Site Reliability Engineer

You must Sign In before continuing to the company website to apply.

Smart SummaryPowered by Roshi

Join our team as a Site Reliability Engineer (SRE) at S&P Global in Gurgaon, Haryana, India. Responsibilities include ensuring system availability, performance, stability, efficiency, and collaborating with development teams for reliable and scalable solutions. Implement observability solutions, resolve latency issues, optimize performance, and automate deployment procedures. Requirements include application observability, AWS experience, proficiency in Terraform, Docker & Kubernetes, and CI/CD pipelines knowledge. Full-time on-site opportunity for candidates with 5+ years of experience in SRE or related roles.

Job description

Position summary

We are seeking a highly motivated and experienced Site Reliability Engineer (SRE) to join the Enterprise Solutions SRE team. In this role, you will be responsible for ensuring the availability, latency, performance, efficiency, and stability of our critical infrastructure, which supports a range of data platforms, applications, and services. You will collaborate closely with multiple stakeholders including development teams to implement and maintain reliable and scalable systems while adhering to industry best practices and security standards.

Responsibilities

Design, implement, and maintain comprehensive observability solutions to track the health and performance of our systems.
Analyze observability data to identify potential issues and proactively troubleshoot problems before they impact users.
Develop and implement alerts and notifications for critical events to ensure timely intervention.
Collaborate with development teams to design and implement solutions that enhance system resilience and reduce downtime.
Analyze performance metrics to identify and resolve latency bottlenecks in our infrastructure.
Implement performance optimization techniques and tools to improve the overall responsiveness of our systems.
Work with development teams to ensure that new features and code changes do not introduce performance regressions.
Develop and maintain metrics dashboards to track key performance indicators (KPIs) for our critical systems.
Identify performance trends and anomalies that may indicate potential issues or areas for improvement.
Recommend and implement performance optimization strategies to enhance the overall efficiency of our systems.
Optimize resource utilization and minimize unnecessary expenditure on IT infrastructure.
Identify and implement cost-effective solutions to improve the efficiency of our IT operations reducing TOIL
Design and implement automated deployment and rollback procedures to mitigate risks associated with software updates.
Monitor the performance of new releases and address any issues that arise promptly.
Analyze root causes of incidents to identify and implement preventive measures to minimize their recurrence.
Document incident responses and communicate lessons learned to enhance our incident handling processes.

Requirements

Proficient in application observability, Splunk OpenTelemetry preferred.
Ability to build and maintain a system and culture that supports and implements SLOs.
Experienced in AWS - everything from IAM, Lambda, Cloudfront, RDS SQL Server and PostgreSQL.
Comfortable with Terraform, Cloudformation
Familiar with Docker & Kubernetes.
Experienced in one or more programming languages, such as Python or .NET c#.
Familiar with CI/CD pipelines such as Azure DevOps, GitHub Actions or Gitlab.
Knowing how an app should be designed and built for the cloud.
Strong sense of ownership, urgency, and drive.
Familiarity working in an agile environment.
Maintain relationships with other disciplines and stakeholders.
Ability to review architecture design to ensure high availability and solid disaster recovery principles.
Participate in on-call rotation

Qualifications

Bachelor's degree in Computer Science, Information Technology, or a related field.
5+ years of experience as a Site Reliability Engineer or equivalent in a similar role.
Proven experience in monitoring, analyzing, and optimizing the performance of large-scale distributed systems.
Expertise in Windows & Linux systems administration, including managing servers, operating systems, and network configurations.
Strong scripting and automation skills, preferably with experience in Bash, Python, PowerShell, or similar languages.
Familiarity with AWS.
Experience with DevOps tools and practices, such as GitLab CI/CD, and Docker.
Excellent troubleshooting and problem-solving skills with a knack for identifying and resolving complex technical issues.
Ability to work independently and as part of a collaborative team, effectively communicating technical concepts to both technical and non-technical stakeholders.
A passion for maintaining high availability, performance, and reliability of critical systems in a fast-paced financial environment.

Set alert for similar jobsSite Reliability Engineer role in Gurgaon, India

Company

S&P Global

Job Posted

a year ago

Job Type

Full-time

WorkMode

On-site

Experience Level

3-7 Years

Related Jobs

Sr. Software Engineer - Java

S&P Global

Gurgaon, Haryana, India

Posted: 2 years ago

We are looking for highly motivated technology professionals with 7-10 years of Java development experience. You will be part of a team based out of Gurgaon and collaborate with colleagues globally. Take ownership of development tasks, produce high-quality software, design components based on requirements, and communicate with business analysts. Skills required: Java 8, Solid software design, Springboot, Microservices, multi-threading, performant code, testable code, maintainable code, Test-Driven Development.

Site Reliability Engineer

NVIDIA

Pune, Maharashtra, India

Posted: 2 years ago

As an SRE, you will work with our Application devops engineers to maintain and scale our cloud services. You will serve as front-line support, triaging issues to the platform, applications, or infrastructure. You will partner with multiple teams, including a second SRE team who supervises the GPU cloud infrastructure. In this role, you will monitor the application stack, onboard customers, and manage the customer lifecycle. Bachelor's degree in Computer Science or a related field is required. Experience in system design, software design in Unix/Linux systems, and operating production systems is necessary. Familiarity with Kubernetes and multi-cloud environments is preferred. Excellent problem-solving and communication skills are essential.

Software Engineer

S&P Global

Gurgaon, Haryana, India

Posted: a year ago

We are looking for a Software Engineer to join our team in Gurgaon. As a full stack developer with 5-8 years of experience, you will be responsible for designing and developing high-quality software using Java, JavaScript, and TypeScript. You will work on a range of projects in the financial services industry and collaborate with stakeholders to clarify requirements and design. This is a full-time, on-site opportunity at S&P Global.

Site Reliability Engineer

Zoom

Bengaluru, Karnataka, India

Posted: a year ago

Join as a Site Reliability Engineer in Zoom's DevOps team dedicated to evolving the Cloud Foundations Platform. Empower development teams for efficient resource management in a multi-cloud environment to enhance user experiences and support Zoom's growth in communication and collaboration.

Site Reliability Engineer

Oracle

Bengaluru, Karnataka, India

Posted: 2 years ago

Join our team of talented engineers and contribute to building resilient and performant distributed systems. Learn and work with open-source technologies. Experience in database, performance analysis, monitoring, and cloud technologies. Improve product scalability and availability.

Quality Engineer II

S&P Global

Gurgaon, Haryana, India

Posted: 2 years ago

Join S&P Global, an equal opportunity employer. We value diversity and provide a fair and inclusive workplace for all qualified candidates. Submit your job application electronically.