The Job logo

What

Where

Senior HPC Scheduler Engineer

ApplyJoin for More Updates

You must Sign In before continuing to the company website to apply.

What you’ll be doing:

Provide engineering solutions and prototypes to enable efficient resource management and job scheduling for large scale clusters, ensure technical relationships with internal and external engineering teams, and assist system architects and machine learning/deep learning engineers in building creative solutions based on NVIDIA technology.

Be an internal reference for scheduling and resource management concepts and methodologies among the NVIDIA technical community. Test, evaluate, and benchmark new technologies and products and work with vendors, partners and peers to improve functionality and optimize performance.


What we need to see:

5+ years of experience designing and running scheduling and resource management systems in large datacenter/AI/HPC solutions.

Knowledge and experience with resource management / scheduling code bases: SLURM preferred, other implementations (LSF, SGE, Torque...).

Proven understanding of performance clusters, infrastructure and workload patterns.

Experience using and installing Linux-based server platforms.

C/Python/Bash/Lua programming/scripting experience.

Experience working with engineering or academic research community supporting HPC or deep learning.

Strong teamwork and both verbal and written communication skills.

Ability to multitask efficiently in a very dynamic environment!

Action driven with strong analytical and troubleshooting skills.

Desire to be involved in multiple diverse and innovative projects.

BS in Engineering, Mathematics, Physics, or Computer Science or equivalent experience. MS or PhD desirable.


Ways to stand out from the crowd:

Experience with HPC cluster administration for AI.

Experience deploying containerized services.

Experience with orchestrators (e.g. Kubernetes).

Demonstrated work with Open-Source software: building, debugging, patching and contributing code.

Experience tuning memory, storage, and networking settings for performance on Linux systems.

Exposure to monitoring and telemetry systems.voyager

Set alert for similar jobsSenior HPC Scheduler Engineer role in Santa Clara, United States
NVIDIA Logo

Company

NVIDIA

Job Posted

a year ago

Job Type

Full-time

WorkMode

On-site

Experience Level

3-7 years

Locations

Santa Clara, California, United States

Qualification

Bachelor

Applicants

Be an early applicant

Related Jobs

NVIDIA Logo

Senior Performance Engineer

NVIDIA

Santa Clara, California, United States

Posted: a year ago

What you’ll be doing: Lead all aspects of implementing performance practices in large scale infrastructure, deliver powerful tools, methodologies, and flows to validate and improve several datacenter products in parallel. Accelerate strategic customer deployments and ensure speed-of-light bringup and deployment of ground-breaking AI infrastructure by working hand in hand tailoring design and faster processes to customer needs. Specific responsibilities include owning the architecting of performance design and settings of datacenter at scale products both implemented in FW and SW components to ensure velocity and scale while efficiently using resources. This involves early engagement with HW/FW/SW/platform internal and customer teams, and other groups, to build end-to-end solutions and optimize datacenter product designs. As a key member you will supply to architecting of the implementation of server and rack level telemetry aspects, collaborate and establish continuous improvements in our design flows. Participating in engagements with various SW and FW (BMC/SBIOS/OS/drivers etc) teams to develop best-in-class practices and tools, you will be analyzing, debugging and resolving critical firmware and software issues for the best AI workload performance at scale. Provide engineering solutions to enable large scale performance strategies for performance for Datacenter GPU Computing products and software stacks, ensure technical relationships with internal and external engineering teams, and assisting systems engineers in building creative solutions based on NVIDIA technology. Be an internal reference for firmware, at scale deployment for datacenter and large-scale GPU-accelerated system solutions among the NVIDIA technical community.   What we need to see: 5+ years of experience in using accelerated computing for datacenter container computing solutions. Strong knowledge of accelerated computing software stacks (CUDA). Experience using and handling modern Cloud and container-based Enterprise computing architectures. C/C++/Python/Bash programming/scripting experience. Experience with CPU architecture. Experience with container technology and Linux based OSes. Experience working with engineering or academic research community supporting high performance computing or deep learning. Strong verbal and written communication skills. Strong teamwork and social skills. Ability to multitask effectively in a dynamic environment. Action driven with strong analytical and analytical skills. Desire to be involved in multiple diverse and creative projects. BS in Engineering, Mathematics, Physics, or Computer Science (or equivalent experience). MS or PhD desirable.   Ways to stand out from the crowd: Deep Learning framework skills. DL and graph compiling programming skills. Exposure to virtualization techniques, cloud platform solutions. Exposure to scheduling and resource management systems. Experience with high performance or large scale computing environments.

NVIDIA Logo

Senior Software Engineer, NGC Data Platform

NVIDIA

Santa Clara, California, United States

Posted: a year ago

What you will be doing: Design and build software code and cloud services for Data Management, including providing a catalog and metadata storage datasets Connect with other technical leaders across NVIDIA to ensure you are using existing technologies where possible and that we are collaborating with their systems appropriately. Collaborate with the NVIDIA research team to use new Storage and Compute innovations - GPU direct storage, DPU.   What we need to see: BS in Computer Science, Information Systems, or Computer Engineering (or equivalent experience) 5+ years of proven experience Experience building robust services at scale. Build and maintain high volume / low latency data platform services Strong foundation in algorithms and data structures and their real-world use cases. Experience with distributed systems, databases, and Big Data systems (Spark, Hadoop). Experience building and shipping services around Kubernetes, Cloud Native, and Cloud Service Providers. Experience with one of the leading cloud providers: AWS, GCP, or Azure. Experience collaborating with teams to write software to support cloud services. Experience with backend systems and software engineering. Programming experience in a relevant language, e.g., Go, Python, C/C++, Java. Understanding of standard approaches to software engineering, software architecture, and design. Ability to document software and services. Break down projects into practical tasks. Communicate design, status, and other sophisticated subjects in written, visual, and oral formats. Ability and passion for working across teams and with collaborators on all sides of the project   Ways to stand out from the crowd: Hands-on experience in building and managing large-scale data platform services. Experience building products and services to solve enterprise-grade customer data analytics problems. Experience with Apache Spark, Object Storage, Metadata Management, Data lake tools (Apache Iceberg), Machine Learning infrastructure toolset (Feature Stores) Computer science background with Distributed systems as a specialization