Senior HPC Scheduler Engineer
NVIDIA
Santa Clara, California, United States
What you’ll be doing: Provide engineering solutions and prototypes to enable efficient resource management and job scheduling for large scale clusters, ensure technical relationships with internal and external engineering teams, and assist system architects and machine learning/deep learning engineers in building creative solutions based on NVIDIA technology. Be an internal reference for scheduling and resource management concepts and methodologies among the NVIDIA technical community. Test, evaluate, and benchmark new technologies and products and work with vendors, partners and peers to improve functionality and optimize performance. What we need to see: 5+ years of experience designing and running scheduling and resource management systems in large datacenter/AI/HPC solutions. Knowledge and experience with resource management / scheduling code bases: SLURM preferred, other implementations (LSF, SGE, Torque...). Proven understanding of performance clusters, infrastructure and workload patterns. Experience using and installing Linux-based server platforms. C/Python/Bash/Lua programming/scripting experience. Experience working with engineering or academic research community supporting HPC or deep learning. Strong teamwork and both verbal and written communication skills. Ability to multitask efficiently in a very dynamic environment! Action driven with strong analytical and troubleshooting skills. Desire to be involved in multiple diverse and innovative projects. BS in Engineering, Mathematics, Physics, or Computer Science or equivalent experience. MS or PhD desirable. Ways to stand out from the crowd: Experience with HPC cluster administration for AI. Experience deploying containerized services. Experience with orchestrators (e.g. Kubernetes). Demonstrated work with Open-Source software: building, debugging, patching and contributing code. Experience tuning memory, storage, and networking settings for performance on Linux systems. Exposure to monitoring and telemetry systems.voyager