The Job logo

What

Where

Compute Cluster DevOps Engineer, GPU - HPC

ApplyJoin for More Updates

You must Sign In before continuing to the company website to apply.

What you will be doing:

  • Design, implement and support large scale infrastructure with monitoring, logging, and alerting with promised uptime.
  • Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation, and refinement.
  • Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity management.
  • Maintain infra and services once they are live by measuring and monitoring availability, latency, and overall system health.
  • Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity.
  • Practice sustainable incident response and blameless postmortems.
  • Understand complex and vast infrastructure and support it during on call weeks.
  • Work with different SME and help provide quality resolution to the production issues to the customer.

 

What we need to see:

  • BS degree in Computer Science or related technical field involving coding (e.g., physics or mathematics) or equivalent.
  • 4+ years of hands-on industry experience in the above mentioned areas
  • Experience with automation around the Linux system administration.
  • Experience in one or more of the following: Python, Perl.
  • Good understanding of open-source IT Automation tools like Ansible, slat.
  • Interest in crafting, analyzing, and fixing large-scale distributed systems.
  • Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive.
  • Ability to debug and optimize code and automate routine tasks.

 

Ways to stand out of the crowd:

  • Good hands on experience on schedulers like LSF and SLURM.
  • Experience in maintaining and writing automation around the HPC cluster.
  • Strong System Administration skills in Linux.


 

Set alert for similar jobsCompute Cluster DevOps Engineer, GPU - HPC role in Bengaluru, India
NVIDIA Logo

Company

NVIDIA

Job Posted

a year ago

Job Type

Full-time

WorkMode

On-site

Experience Level

3-7 years

Locations

Bengaluru, Karnataka, India

Qualification

Bachelor

Applicants

Be an early applicant

Related Jobs

NVIDIA Logo

Compute Cluster SRE Engineer, GPU - HPC

NVIDIA

Bengaluru, Karnataka, India

Posted: a year ago

Design, implement and support large scale infrastructure with monitoring, logging, and alerting. Engage in and improve the lifecycle of services. Support services before they go live through activities such as capacity management. Maintain infra and services by measuring and monitoring availability, latency, and system health. Scale systems sustainably through automation. Practice sustainable incident response. Understand complex infrastructure and provide quality resolution to production issues. BS degree in Computer Science or related field and 4+ years of industry experience required.

NVIDIA Logo

Senior HPC Scheduler Engineer

NVIDIA

Santa Clara, California, United States

Posted: a year ago

What you’ll be doing: Provide engineering solutions and prototypes to enable efficient resource management and job scheduling for large scale clusters, ensure technical relationships with internal and external engineering teams, and assist system architects and machine learning/deep learning engineers in building creative solutions based on NVIDIA technology. Be an internal reference for scheduling and resource management concepts and methodologies among the NVIDIA technical community. Test, evaluate, and benchmark new technologies and products and work with vendors, partners and peers to improve functionality and optimize performance. What we need to see: 5+ years of experience designing and running scheduling and resource management systems in large datacenter/AI/HPC solutions. Knowledge and experience with resource management / scheduling code bases: SLURM preferred, other implementations (LSF, SGE, Torque...). Proven understanding of performance clusters, infrastructure and workload patterns. Experience using and installing Linux-based server platforms. C/Python/Bash/Lua programming/scripting experience. Experience working with engineering or academic research community supporting HPC or deep learning. Strong teamwork and both verbal and written communication skills. Ability to multitask efficiently in a very dynamic environment! Action driven with strong analytical and troubleshooting skills. Desire to be involved in multiple diverse and innovative projects. BS in Engineering, Mathematics, Physics, or Computer Science or equivalent experience. MS or PhD desirable. Ways to stand out from the crowd: Experience with HPC cluster administration for AI. Experience deploying containerized services. Experience with orchestrators (e.g. Kubernetes). Demonstrated work with Open-Source software: building, debugging, patching and contributing code. Experience tuning memory, storage, and networking settings for performance on Linux systems. Exposure to monitoring and telemetry systems.voyager