Benchmarking and System Validation Software Engineer
Summary
We are seeking a talented Software engineer to join our Durham, North Carolina team focused on Benchmarking, System Validation and Test Automation for large-scale distributed systems. In this role, you will be involved with writing applications to benchmark next-generation computing infrastructure at performance and scale with real-world Machine Learning workloads along with building system topologies to validate our customer use cases.
Roles and Responsibilities:
- Model and Benchmark large scale Machine Learning workloads
- Characterize performance of distributed deep learning applications with data and model parallelism, and model sharding across devices and memories
- Write applications, libraries and kernel modules that stress I/O technology capabilities including those that stress NCCL and CUDA GPU technology
- Develop low-level SW applications to test I/O performance of next-gen compute systems
- Validate customer use cases using our technology, and assist with such deployments
- Implement broad System and Solution Level testing
- Create White Papers that showcase Data Center I/O technology
Desired Knowledge and Skill Set:
- Hands on experience with ML Collective Communication and CUDA programming
- Hands on experience with ML frameworks such as PyTorch and TensorFlow
- Familiarity with standard Machine Learning workload benchmarks for Training and Inference
- Strong coding skills in multiple languages such as Python, C and C++
- Background in low-level I/O performance analysis of networking and server systems
- Good knowledge of TCP/IP and performance of other networking protocols
- Detailed understanding of server components and applicable drivers for CPUs, memory, GPUs, networking devices and storage
- Experience validating large scale, Data Center networking and server solutions
- Working knowledge of high performance communication technologies like MPI, Infiniband, RDMA, GPU-Direct and NVLink is desirable
- Linux systems knowledge
- 5+ years of software development experience working closely with hardware
About Us
Enfabrica is on a mission to revolutionize AI compute systems and infrastructure at scale through the development of superior-scaling networking silicon and software which we call the Accelerated Compute Fabric. Founded and led by an executive team assembled from first-class semiconductor and distributed systems/software companies throughout the industry, Enfabrica sets themselves apart from other startups with a very strong engineering pedigree, a proven track record of delivering, deploying and scaling products in data center production environments, and significant investor support for our ambitious journey! Together, with their differentiated approach to solving the I/O bottlenecks in distributed AI and accelerated compute clusters, Enfabrica is unleashing the revolution in next-gen computing fabrics.