The Job logo

What

Where

Site Reliability Engineer - Azure

ApplyJoin for More Updates

You must Sign In before continuing to the company website to apply.

Smart SummaryPowered by Roshi
We are looking for engineers who are passionate about reliability, performance, and efficiency. Your role will be to build tools, services, and automation to manage and improve production services. You will work to improve the reliability and performance of distributed systems and containerized deployments. Troubleshooting complex systems and automating tasks with minimal intervention will be a part of your day-to-day work. Knowledge of Linux, network, and monitoring is essential. Experience with Linux cloud services, containerization technologies, and cloud platforms like Azure, GCP, and AWS will be preferred. Database technologies and scripting languages like Perl/GoLang/Python are good to have. You should be able to participate in 24x7 on-call rotations and actively contribute to performance testing, capacity planning, and high availability practices. A team player with a resourceful attitude who can help onboard new team members and monitor and solve infrastructure issues.

We are looking for engineers who are passionate about reliability, performance, and efficiency, and with experience in building tools, services, and automation to manage and improve production services.

 

  • Systems internals/security, Linux, Network, and Monitoring
  • work to improve the reliability and performance of the next generation of distributed systems and containerized deployments
  • Diagnose and troubleshoot complex distributed systems handling millions of queries per second
  • Knowledge of Linux cloud services using kvm/qemu/lvm.
  • Knowledge of containerization technologies like docker and deployment and troubleshooting of containers
  • Understanding of cloud platforms like Azure, GCP and AWS, ability to set up, configure, monitor and troubleshoot various PaaS components like Firewalls, VPN gateways, Load Balancers, Storage accounts, Networks and others
  • In-depth knowledge in Perl/GoLang/Python to automate tasks with minimal intervention.
  • Day-to-day work is heavily command-line driven, which requires a strong understanding of Linux. 
  • Troubleshoot issues across the entire stack - hardware, software, application, and network
  • Knowledge in Database technologies, specifically in MySQL/NoSQL is good to have.
  • Participate in 24x7 on-call rotations.
  • Design, build and maintain core infrastructure that enables Phonepe scaling to support hundreds of thousands of concurrent users.
  • Actively take part in the Analysis and System improvement plan.
  • Drive performance testing, capacity planning and high availability practices. 
  • Own implementations of new technologies while ensuring proper testing and documentation.
  • Proactively monitor/identify/solve issues which could have a potential impact to our Infrastructure. 
  • Natural team player and also have a resourceful attitude.
  • Buddy new team members, and get them production ready.
Set alert for similar jobsSite Reliability Engineer - Azure role in Bengaluru, India
PhonePe Logo

Company

PhonePe

Job Posted

a year ago

Job Type

Full-time

WorkMode

On-site

Experience Level

0-2 years

Locations

Bengaluru, Karnataka, India

Qualification

Bachelor

Applicants

Be an early applicant

Related Jobs

Groww Logo

Site Reliability Engineer

Groww

Gurgaon, Haryana, India

+2 more

Posted: a year ago

Monitor and troubleshoot system performance, availability, and security. Analyze metrics and trace data. Collaborate with development teams for scalability and reliability. Manage app releases and resolve production issues. Conduct root cause analysis. Optimize system performance and capacity planning. Utilize CI/CD tools.

NVIDIA Logo

Senior Site Reliability Engineer

NVIDIA

Bengaluru, Karnataka, India

Posted: a year ago

What you will be doing: Design, implement and support large scale Kubernetes clusters with monitoring, logging and alerting. Engage in and improve the whole lifecycle of services—from inception and design, through deployment, operation and refinement. Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity management and launch reviews. Maintain services once they are live by measuring and monitoring availability, latency and overall system health. Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity. Practice sustainable incident response and blameless postmortems. Be part of an on call rotation to support production systems.   What we need to see: A minimum of 3 years of hands-on experience in setup, administration and maintenance of multiple large (100+ nodes) Kubernetes clusters on-prem and Cloud Service Providers like AWS, Azure, GCP, OCI. Strong coding experience in one or more of the following languages: Go, Python, Perl, Java, C, C++, Ruby. Hands-on system administration experience of at least 2 years on large scale UNIX production environments, with validated debugging and troubleshooting skills. Ability to maintain platform SLAs through accurate resolutions. Outstanding teammate who can collaborate and influence in a multifaceted environment. Demonstrable experience in handling algorithms, data structures, complexity analysis and software design. BS degree in Computer Science or related technical field involving coding (e.g., physics or mathematics).   Ways to stand out of a crowd: Experience in using or running large private and public cloud systems based on Kubernetes, OpenStack and Docker. Demonstrated ability to automate routine tasks, debug and optimize existing code. Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive. Hands-on experience on network and storage administration. Unit testing and benchmarking are an integral part of your code. Ability to reason and choose the best possible algorithm to meet scaling and availability challenges. Ability to decompose complex requirements into simple tasks and reuse available solutions to implement most of those.