Job description
Requirements:
Design, implement, and maintain highly available and scalable infrastructure solutions across the full technology stack of Metro AWL an HINT.
Develop and maintain monitoring, alerting, and logging systems to ensure timely detection and resolution of issues using Splunk , ARGOS and Oneconsole.
Effectively use Quantum Metrics for proactive and reactive monitoring.
Collaborate with development teams to design and implement automated processes.
Conduct capacity planning and performance tuning to ensure optimal system performance and resource utilization.
Implement and maintain disaster recovery and failover strategies to minimize downtime and ensure business continuity.
Troubleshoot and resolve complex technical issues across the entire technology stack, including infrastructure, networking, and application layers.
Participate in on-call rotations and respond to incidents to ensure 24/7 availability of critical systems.
Drive initiatives to improve system reliability, performance, and efficiency through automation and process optimization.
Stay current with industry trends and best practices in site reliability engineering, cloud technologies, and full stack development.
Mentor and coach junior members of the team, fostering a culture of collaboration, learning, and innovation.
Job Responsibilities:
Job Description: Site Reliability Engineer (SRE) – Metro, AWL, HINT
Role Overview:
As a Site Reliability Engineer (SRE) for Metro,AWL,HINT applications responsible for ensuring the reliability, scalability, and performance of our entire technology stack of the applications. Work closely with development, operations, and other cross-functional teams to implement best practices in site reliability engineering, automate processes, and optimize system performance.
Key Responsibilities:
Timing : Late start in India time and overlap with PST morning
Monitoring & Alerting Tools
Splunk
Kubernetes
Conduktor
Spark
Scylla
ORACLE (Admin and SQL Developer)
SnowFlake (Snowsight)
Grafana
Kafka