Site Reliability Engineer - Azure

You must Sign In before continuing to the company website to apply.

Smart SummaryPowered by Roshi

We are looking for engineers who are passionate about reliability, performance, and efficiency. Your role will be to build tools, services, and automation to manage and improve production services. You will work to improve the reliability and performance of distributed systems and containerized deployments. Troubleshooting complex systems and automating tasks with minimal intervention will be a part of your day-to-day work. Knowledge of Linux, network, and monitoring is essential. Experience with Linux cloud services, containerization technologies, and cloud platforms like Azure, GCP, and AWS will be preferred. Database technologies and scripting languages like Perl/GoLang/Python are good to have. You should be able to participate in 24x7 on-call rotations and actively contribute to performance testing, capacity planning, and high availability practices. A team player with a resourceful attitude who can help onboard new team members and monitor and solve infrastructure issues.

We are looking for engineers who are passionate about reliability, performance, and efficiency, and with experience in building tools, services, and automation to manage and improve production services.

Systems internals/security, Linux, Network, and Monitoring
work to improve the reliability and performance of the next generation of distributed systems and containerized deployments
Diagnose and troubleshoot complex distributed systems handling millions of queries per second
Knowledge of Linux cloud services using kvm/qemu/lvm.
Knowledge of containerization technologies like docker and deployment and troubleshooting of containers
Understanding of cloud platforms like Azure, GCP and AWS, ability to set up, configure, monitor and troubleshoot various PaaS components like Firewalls, VPN gateways, Load Balancers, Storage accounts, Networks and others
In-depth knowledge in Perl/GoLang/Python to automate tasks with minimal intervention.
Day-to-day work is heavily command-line driven, which requires a strong understanding of Linux.
Troubleshoot issues across the entire stack - hardware, software, application, and network
Knowledge in Database technologies, specifically in MySQL/NoSQL is good to have.
Participate in 24x7 on-call rotations.
Design, build and maintain core infrastructure that enables Phonepe scaling to support hundreds of thousands of concurrent users.
Actively take part in the Analysis and System improvement plan.
Drive performance testing, capacity planning and high availability practices.
Own implementations of new technologies while ensuring proper testing and documentation.
Proactively monitor/identify/solve issues which could have a potential impact to our Infrastructure.
Natural team player and also have a resourceful attitude.
Buddy new team members, and get them production ready.

Set alert for similar jobsSite Reliability Engineer - Azure role in Bengaluru, India

Company

PhonePe

Job Posted

2 years ago

Job Type

Full-time

WorkMode

On-site

Experience Level

0-2 years

Locations

Bengaluru, Karnataka, India

Qualification

Bachelor

Applicants

Be an early applicant

Related Jobs

Site Reliability Engineer

Groww

Gurgaon, Haryana, India

+2 more

Posted: 2 years ago

Monitor and troubleshoot system performance, availability, and security. Analyze metrics and trace data. Collaborate with development teams for scalability and reliability. Manage app releases and resolve production issues. Conduct root cause analysis. Optimize system performance and capacity planning. Utilize CI/CD tools.

Senior Site Reliability Engineer

NVIDIA

Bengaluru, Karnataka, India

Posted: 2 years ago

What you will be doing: Design, implement and support large scale Kubernetes clusters with monitoring, logging and alerting. Engage in and improve the whole lifecycle of services—from inception and design, through deployment, operation and refinement. Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity management and launch reviews. Maintain services once they are live by measuring and monitoring availability, latency and overall system health. Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity. Practice sustainable incident response and blameless postmortems. Be part of an on call rotation to support production systems.   What we need to see: A minimum of 3 years of hands-on experience in setup, administration and maintenance of multiple large (100+ nodes) Kubernetes clusters on-prem and Cloud Service Providers like AWS, Azure, GCP, OCI. Strong coding experience in one or more of the following languages: Go, Python, Perl, Java, C, C++, Ruby. Hands-on system administration experience of at least 2 years on large scale UNIX production environments, with validated debugging and troubleshooting skills. Ability to maintain platform SLAs through accurate resolutions. Outstanding teammate who can collaborate and influence in a multifaceted environment. Demonstrable experience in handling algorithms, data structures, complexity analysis and software design. BS degree in Computer Science or related technical field involving coding (e.g., physics or mathematics).   Ways to stand out of a crowd: Experience in using or running large private and public cloud systems based on Kubernetes, OpenStack and Docker. Demonstrated ability to automate routine tasks, debug and optimize existing code. Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive. Hands-on experience on network and storage administration. Unit testing and benchmarking are an integral part of your code. Ability to reason and choose the best possible algorithm to meet scaling and availability challenges. Ability to decompose complex requirements into simple tasks and reuse available solutions to implement most of those.

Sr. Site Reliability Engineer

Opentext

Bengaluru, Karnataka, India

Posted: 2 years ago

Seeking a skilled Linux OS system administrator with a deep understanding of security and experience in networking administration. Must have practical experience with AWS services and be adept in designing and managing cloud-based solutions. The candidate should be proficient in implementing CI/CD pipelines and possess strong bash scripting skills. The ability to actively participate in team meetings, prioritize tasks, and contribute to system architecture improvement is essential. Immediate availability to address requests and incidents is required.

Staff Site Reliability Engineer

Netskope

Bengaluru, Karnataka, India

Posted: 2 years ago

About the role Please note, this team is hiring across all levels and candidates are individually assessed and appropriately leveled based upon their skills and experience. The SRE Data / Provisioner team supports the Netskope Data Product Suite, and Provisioner, a critical component of our foundational technologies and the single source of truth for all user data across all Netskope Apps. We are a team of software engineers focused on improving availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of the engineering stacks. If you are passionate about solving complex problems and developing cloud services at scale, we would like to speak with you. Job Responsibilities   Partner closely with our development teams and product managers to architect and build features that are highly available, performant and secure Develop innovative ways to smartly measure, monitor & report application and infrastructure health Gain deep knowledge of our application stack Experience improving the performance of micro-services and solve scaling/performance issues Capacity management and planning Function well in a fast-paced and rapidly-changing environment Participate in 24X7 on-call rotations. Preferred Qualifications BS or MS in Computer Science or equivalent technical degree or related practical experience Preferred Technical Skills: 10+ years experience with troubleshooting Unix/Linux Understanding of Networking concepts - TCP/IP, SSL/TLS, IPSec, GRE, VPN Experience with algorithms, data structures, complexity analysis, and software design Experience in one or more of the following: C, C++, Python, Go Experience in managing a large-scale web operations role Bonus points for experience with Ansible, Kubernetes, SQL and NoSQL datastores, CI/CD Hands-on working with private or public cloud services in a highly available and scalable production environment.  Desired Technical Skills: Knowledge of distributed systems is a big plus.   Additional Skills Great written and verbal communication Ability to work for a geo-distributed cross-functional group Demonstrated ability to own and deliver projects independently Demonstrated ability of technical mentoring and coaching  Strong interpersonal communication skills (including listening, speaking, and writing) and the ability to work well in a diverse, team-focused environment with other SREs, developers, Product Managers, etc

Manager Site Reliability Engineer

Zeta

Bengaluru, Karnataka, India

Posted: 2 years ago

Assigns and monitors work of technical personnel, ensures application development and deployment is done in the best possible way, implements quality control and review systems. Manages design and development of custom tools and integration with existing tools to increase engineering productivity. Takes responsibility for the architecture and technical leadership of the entire DevOps infrastructure.

Principal Site Reliability Engineer

Zeta

Bengaluru, Karnataka, India

Posted: 2 years ago

The System Reliability Engineer is responsible for 24/7 availability for Zeta’s cloud SaaS platform. Build, Deploy and Manage business applications to cloud platforms using Containers orchestration, Service mesh, API gateways, CI/CD components & Observability stacks. Collaborate with Product managers, Designers and Developers in self-sufficient teams to implement and follow best SRE practices.