Benchmarking and System Validation Software Engineer

Raleigh, NC
Full Time
Experienced

Summary

We are seeking a talented Software engineer to join our Durham, North Carolina team focused on Benchmarking, System Validation and Test Automation for large-scale distributed systems.  In this role, you will be involved with writing applications to benchmark next-generation computing infrastructure at performance and scale with real-world Machine Learning workloads along with building system topologies to validate our customer use cases.
 

Roles and Responsibilities:

  • Model and Benchmark large scale Machine Learning workloads 
  • Characterize performance of distributed deep learning applications with data and model parallelism, and model sharding across devices and memories 
  • Write applications, libraries and kernel modules that stress I/O technology capabilities including those that stress NCCL and CUDA GPU technology
  • Develop low-level SW applications to test I/O performance of next-gen compute systems
  • Validate customer use cases using our technology, and assist with such deployments
  • Implement broad System and Solution Level testing
  • Create White Papers that showcase Data Center I/O technology

Desired Knowledge and Skill Set:

  • Hands on experience with ML Collective Communication and CUDA programming
  • Hands on experience with ML frameworks such as PyTorch and TensorFlow
  • Familiarity with standard Machine Learning workload benchmarks for Training and Inference
  • Strong coding skills in multiple languages such as Python, C and C++
  • Background in low-level I/O performance analysis of networking and server systems 
  • Good knowledge of TCP/IP and performance of other networking protocols 
  • Detailed understanding of server components and applicable drivers for CPUs, memory, GPUs, networking devices and storage
  • Experience validating large scale, Data Center networking and server solutions
  • Working knowledge of high performance communication technologies like MPI, Infiniband, RDMA, GPU-Direct and NVLink is desirable
  • Linux systems knowledge
  • 5+ years of software development experience working closely with hardware
This role will require employee to be on-site in the Raleigh, North Carolina office. No hybrid work option.

About Us 

Enfabrica is on a mission to revolutionize AI compute systems and infrastructure at scale through the  development of superior-scaling networking silicon and software which we call the Accelerated Compute Fabric. Founded and led by an executive team assembled from first-class semiconductor and distributed systems/software companies throughout the industry, Enfabrica sets themselves apart from other startups with a very strong engineering pedigree, a proven track record of delivering, deploying and scaling products in data center production environments, and significant investor support for our ambitious journey! Together, with their differentiated approach to solving the I/O bottlenecks in distributed AI and accelerated compute clusters, Enfabrica is unleashing the revolution in next-gen computing fabrics.

Share

Apply for this position

Required*
Apply with
We've received your resume. Click here to update it.
Attach resume as .pdf, .doc, .docx, .odt, .txt, or .rtf (limit 5MB) or Paste resume

Paste your resume here or Attach resume file

Human Check*