High Performance Computing (HPC) - Systems Engineer

You must Sign In before continuing to the company website to apply.

What role you will play in our team

The HPC Systems Engineer role has the overall responsibility to work within a team to provide a performant, reliable, and secure high-performance computing (HPC) environment. The HPC Systems Engineer will be involved in various aspects of designing and engineering our HPC system as well as be responsible for managing day-to-day operations and maintenance activities including, but not limited to the following: general troubleshooting of any issues that may arise, monitoring overall system health, performing system maintenance tasks, and evaluating new hardware/system software.

What you will do

Primary Job Functions

Establish strategies for overall support of the system
Evaluate new hardware and software and understand potential benefits/impacts it can have in the environment
Perform hardware maintenance
Perform software installations and upgrades, inclusive of operating system
Monitor overall system performance and health
Provide support for the management of data in the environment
Work with users to resolve problems and ensure they are able to effectively utilize the system
Interact with both business customers and technical teams that are globally distributed and within varied time zones
Engaging with vendors for problem resolution of existing infrastructure and discussion of roadmaps and new technologies for evaluations
Foster a supportive work environment and maintains open, productive interactions among team and across organizations
Build and maintain cross-organizational contacts to facilitate execution of work

About You

Skills and Qualifications

B.E./B.Tech in Computer Science or related degree area (e.g. Computer Engineering, Information Systems) or equivalent skills work experience with CGPA – 6 and above.
Excellent technical, analytical, and communication skills
A minimum of 3 years of hands-on Linux experience (e.g. RHEL, CentOS) and production infrastructure support (e.g. networking, storage, monitoring, compute, installation, configuration, maintenance, upgrade, retirement)
Experience in system administration and technical support (e.g. installation, configuration, maintenance, upgrade, retirement, problem resolution)
Experience in HPC technologies such as parallel/distributed files systems (e.g. Lustre, GPFS), high speed interconnect fabrics (e.g. Infiniband, Omni-Path), and HPC batch scheduling software suites (e.g. PBSPro, SLURM)
Proficiency in technical writing and documentation of solutions
Solid understanding of data center operations fundamentals in networking, cooling, and power
Works well in a team environment.
Self-motivated

Preferred Qualifications/ Experience

Strong IT skills in infrastructure and applications
Experience with supporting large scale production environments.
Experience in implementing changes and security controls in a global framework.
Understanding of data center operations fundamentals in networking, cooling, and power
Knowledge and experience with installing/compiling vendor and open-source software.
Knowledge and experience with application/infrastructure deployment and support in one or more of the major cloud environments
Comfortable in relocating to Bengaluru and working hour - (1:30 to 10:30 PM IST) shift time.

Your benefits

An ExxonMobil career is one designed to last. Our commitment to you runs deep: our employees grow personally and professionally, with benefits built on our core categories of health, security, finance and life. We offer you:

Competitive compensation
Medical plans, maternity leave and benefits, life, accidental death and dismemberment benefits
Retirement benefits
Global networking & cross-functional opportunities
Annual vacations & holidays
Day care assistance program
Training and development program
Tuition assistance program
Workplace flexibility policy
Relocation program
Transportation facility

Please note benefits may change from time to time without notice, subject to applicable laws. The benefits programs are based on the Company’s eligibility guidelines.

Set alert for similar jobsHigh Performance Computing (HPC) - Systems Engineer role in Bengaluru, India

Company

ExxonMobil

Job Posted

a year ago

Job Type

Full-time

WorkMode

On-site

Experience Level

3-7 years

Related Jobs

Lead - Computing

Flipkart

Bangalore Urban, Karnataka, India

Posted: a year ago

We are looking for a Backup and Storage Admin with expertise in Infra Compute. You will be responsible for conducting daily backup activities, troubleshooting failed jobs, and ensuring data integrity. Additionally, you will be involved in managing backup policies, configuring replication, and working on security audits. If you have a strong understanding of TCP/IP networks, server hardware, and storage devices, and 5+ years of experience in Backup and Storage administration, we would like to hear from you.

Verification Engineer, Performance

NVIDIA

Bengaluru, Karnataka, India

Posted: 7 months ago

As a Verification Engineer at NVIDIA, you will develop and validate test plans, simulation environments, and automate CPU performance for best-in-class CPUs. Analyze and enhance CPU and fabric performance, define tool chains for system performance alignment, and conduct competitive analysis of Nvidia CPUs.

Compute Cluster DevOps Engineer, GPU - HPC

NVIDIA

Bengaluru, Karnataka, India

Posted: a year ago

What you will be doing: Design, implement and support large scale infrastructure with monitoring, logging, and alerting with promised uptime. Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation, and refinement. Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity management. Maintain infra and services once they are live by measuring and monitoring availability, latency, and overall system health. Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity. Practice sustainable incident response and blameless postmortems. Understand complex and vast infrastructure and support it during on call weeks. Work with different SME and help provide quality resolution to the production issues to the customer.   What we need to see: BS degree in Computer Science or related technical field involving coding (e.g., physics or mathematics) or equivalent. 4+ years of hands-on industry experience in the above mentioned areas Experience with automation around the Linux system administration. Experience in one or more of the following: Python, Perl. Good understanding of open-source IT Automation tools like Ansible, slat. Interest in crafting, analyzing, and fixing large-scale distributed systems. Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive. Ability to debug and optimize code and automate routine tasks.   Ways to stand out of the crowd: Good hands on experience on schedulers like LSF and SLURM. Experience in maintaining and writing automation around the HPC cluster. Strong System Administration skills in Linux.  

Systems Engineer - Emerging HPC

Shell

Bengaluru, Karnataka, India

Posted: a year ago

We are looking for a candidate with a strong background in super-computing technologies and IT services. As a member of the Emerging HPC & Service Development Team, you will be responsible for designing, developing, implementing, and supporting HPC workflows. You will work closely with application developers, production and research scientists, and other teams to ensure the productivity of HPC users. Your role will involve understanding emerging HPC requests, proposing optimal solutions, monitoring and repairing workflows, and optimizing efficiency. Strong knowledge in Computer Science, Mathematics, Engineering, Physical sciences, or Geophysics is required, along with expertise in high performance computing and AI/ML workflows. The ability to create innovative solutions in both on-premise and cloud environments is essential.

Database Performance Engineer - Bengaluru

Zycus

Bengaluru, Karnataka, India

Posted: a year ago

Job Description Zycus is looking out for a Database Performance Engineer having expertise in Postgre database solutions including, configuration, performance tuning, troubleshooting, and backups. Roles and Responsibilities: The PostgreSQL and Mongo Database Performance Engineer will be responsible for understanding operational requirements including Hardware, Architecture, configuration, Integration, and maintaining mission-critical Production PostgreSQL databases. Must have strong experience in DB tuning and implementing optimized configurations as per application load Hands on knowledge on writing stored procedures, functions, packages, triggers, and other database objects, to implement business logic and data processing Optimize queries for efficient execution and improved database performance Proficient in code reviews to ensure adherence to coding standards and best practices Performance Tuning and resolve performance issues in the database and queries Provide technical support and assistance to end-users, troubleshooting queries/DB related issues Identify and resolve errors in queries for robust error handling Design and create database objects, including tables, views, indexes, and sequences Write complex SQL queries to retrieve, insert, update, and delete data Familiar with SQL sub queries, Joins, applying Constraints Estimate PostgreSQL database capacities; develop methods for monitoring database capacity and usage. Lead efforts to develop and improve procedures for automated monitoring and proactive intervention, reducing any need for downtime. Develop Stored Procedures and Database Triggers in support of application development. Participate in application development projects and be responsible for the database architecture and design. Participate in the creation of development, staging, and production database instances, and the migration from one environment to another. Responsible for developer SQL code review to ensure queries are optimized and tuned to perform efficiently prior to production release. Responsible for efficient recovery of databases in case of crashes. Responsible for regular maintenance of databases and proactive remediation of database operational problems. Responsible for Query tuning and preventative maintenance. Experience with Mongo DB ATLAS is nice to have but not mandatory Support complex web-based applications. Job Requirement 5-8 years of relevant experience in PostgreSQL required Experience in Performance Tuning of large Mongo DB instances Experience as DBA in high-load, large-sized production DB Must have experience in PostgreSQL database architecture, logical and physical design, automation, documentation, installs, shell scripting, PL/ SQL programming, catalog navigation, query tuning, system tuning, resource contention analysis, backup and recovery, standby, replication, etc. Must have strong knowledge of AWS Cloud and Azure Cloud. Good Knowledge of pgbadger and pgclue report analysis Hands-on experience on HA (High Availability) Configuration and strong knowledge of High Availability and Disaster Recovery concepts (i.e. Replication, Cluster). Must have strong knowledge of Linux operating systems Strong understanding of command line and server administration. Knowledge of shell scripting (e.g., Bash, Perl). Strong experience in Performance tuning Excellent communication and presentation skills with the ability to communicate at all levels of the organization. Technical leadership and mentoring skills to guide and act as SME. Exceptional problem-solving skills Keen to learn new technology with enthusiasm and critical thinking Ability to work within a team environment. Self-starter and ability to perform work with minimal supervisory direction Excellent analytical and problem solving skills Ability to detect and troubleshoot database-related CPU, memory, I/O, disk space, and other resource contention issues. Have proficiency with Linux for basic administrative tasks Experience with Unix/Linux including basic commands and basic shell scripting

Senior HPC Scheduler Engineer

NVIDIA

Santa Clara, California, United States

Posted: a year ago

What you’ll be doing: Provide engineering solutions and prototypes to enable efficient resource management and job scheduling for large scale clusters, ensure technical relationships with internal and external engineering teams, and assist system architects and machine learning/deep learning engineers in building creative solutions based on NVIDIA technology. Be an internal reference for scheduling and resource management concepts and methodologies among the NVIDIA technical community. Test, evaluate, and benchmark new technologies and products and work with vendors, partners and peers to improve functionality and optimize performance. What we need to see: 5+ years of experience designing and running scheduling and resource management systems in large datacenter/AI/HPC solutions. Knowledge and experience with resource management / scheduling code bases: SLURM preferred, other implementations (LSF, SGE, Torque...). Proven understanding of performance clusters, infrastructure and workload patterns. Experience using and installing Linux-based server platforms. C/Python/Bash/Lua programming/scripting experience. Experience working with engineering or academic research community supporting HPC or deep learning. Strong teamwork and both verbal and written communication skills. Ability to multitask efficiently in a very dynamic environment! Action driven with strong analytical and troubleshooting skills. Desire to be involved in multiple diverse and innovative projects. BS in Engineering, Mathematics, Physics, or Computer Science or equivalent experience. MS or PhD desirable. Ways to stand out from the crowd: Experience with HPC cluster administration for AI. Experience deploying containerized services. Experience with orchestrators (e.g. Kubernetes). Demonstrated work with Open-Source software: building, debugging, patching and contributing code. Experience tuning memory, storage, and networking settings for performance on Linux systems. Exposure to monitoring and telemetry systems.voyager