Join our team as a Site Reliability Engineer (SRE) at S&P Global in Gurgaon, Haryana, India. Responsibilities include ensuring system availability, performance, stability, efficiency, and collaborating with development teams for reliable and scalable solutions. Implement observability solutions, resolve latency issues, optimize performance, and automate deployment procedures. Requirements include application observability, AWS experience, proficiency in Terraform, Docker & Kubernetes, and CI/CD pipelines knowledge. Full-time on-site opportunity for candidates with 5+ years of experience in SRE or related roles.
Job description
Position summary
We are seeking a highly motivated and experienced Site Reliability Engineer (SRE) to join the Enterprise Solutions SRE team. In this role, you will be responsible for ensuring the availability, latency, performance, efficiency, and stability of our critical infrastructure, which supports a range of data platforms, applications, and services. You will collaborate closely with multiple stakeholders including development teams to implement and maintain reliable and scalable systems while adhering to industry best practices and security standards.
Responsibilities
- Design, implement, and maintain comprehensive observability solutions to track the health and performance of our systems.
- Analyze observability data to identify potential issues and proactively troubleshoot problems before they impact users.
- Develop and implement alerts and notifications for critical events to ensure timely intervention.
- Collaborate with development teams to design and implement solutions that enhance system resilience and reduce downtime.
- Analyze performance metrics to identify and resolve latency bottlenecks in our infrastructure.
- Implement performance optimization techniques and tools to improve the overall responsiveness of our systems.
- Work with development teams to ensure that new features and code changes do not introduce performance regressions.
- Develop and maintain metrics dashboards to track key performance indicators (KPIs) for our critical systems.
- Identify performance trends and anomalies that may indicate potential issues or areas for improvement.
- Recommend and implement performance optimization strategies to enhance the overall efficiency of our systems.
- Optimize resource utilization and minimize unnecessary expenditure on IT infrastructure.
- Identify and implement cost-effective solutions to improve the efficiency of our IT operations reducing TOIL
- Design and implement automated deployment and rollback procedures to mitigate risks associated with software updates.
- Monitor the performance of new releases and address any issues that arise promptly.
- Analyze root causes of incidents to identify and implement preventive measures to minimize their recurrence.
- Document incident responses and communicate lessons learned to enhance our incident handling processes.
Requirements
- Proficient in application observability, Splunk OpenTelemetry preferred.
- Ability to build and maintain a system and culture that supports and implements SLOs.
- Experienced in AWS - everything from IAM, Lambda, Cloudfront, RDS SQL Server and PostgreSQL.
- Comfortable with Terraform, Cloudformation
- Familiar with Docker & Kubernetes.
- Experienced in one or more programming languages, such as Python or .NET c#.
- Familiar with CI/CD pipelines such as Azure DevOps, GitHub Actions or Gitlab.
- Knowing how an app should be designed and built for the cloud.
- Strong sense of ownership, urgency, and drive.
- Familiarity working in an agile environment.
- Maintain relationships with other disciplines and stakeholders.
- Ability to review architecture design to ensure high availability and solid disaster recovery principles.
- Participate in on-call rotation
Qualifications
- Bachelor's degree in Computer Science, Information Technology, or a related field.
- 5+ years of experience as a Site Reliability Engineer or equivalent in a similar role.
- Proven experience in monitoring, analyzing, and optimizing the performance of large-scale distributed systems.
- Expertise in Windows & Linux systems administration, including managing servers, operating systems, and network configurations.
- Strong scripting and automation skills, preferably with experience in Bash, Python, PowerShell, or similar languages.
- Familiarity with AWS.
- Experience with DevOps tools and practices, such as GitLab CI/CD, and Docker.
- Excellent troubleshooting and problem-solving skills with a knack for identifying and resolving complex technical issues.
- Ability to work independently and as part of a collaborative team, effectively communicating technical concepts to both technical and non-technical stakeholders.
- A passion for maintaining high availability, performance, and reliability of critical systems in a fast-paced financial environment.