Site Reliability Engineer/ DevOps (7-14 years)

You must Sign In before continuing to the company website to apply.

As a Senior Site Reliability Engineer (SRE), you will assume a leadership role in ensuring the reliability, scalability, and performance of our company's software systems and infrastructure. You will be responsible for driving the evolution of SRE practices and collaborating closely with engineering teams to architect and implement highly available and resilient systems. The role requires a deep understanding of software development, system design, and operations, as well as the ability to mentor and guide junior SRE team members.

What You Will Do:

System Architecture and Design: Lead the design and implementation of highly available, scalable, and fault-tolerant systems in collaboration with software development teams. Employ best practices and architectural principles to ensure long-term system stability and maintainability.
Incident Response and Management: Take ownership of critical incidents and coordinate cross-functional teams to resolve them efficiently. Conduct thorough post-mortem analysis and leverage learnings to enhance system resilience and response procedures.
Performance Optimization and Capacity Planning: Analyze system performance, identify bottlenecks, and work with engineering teams to optimize performance. Develop capacity planning strategies to support business growth and future demands.
Automation and Tooling: Drive automation initiatives to streamline operational tasks, deployment processes, monitoring, and incident response. Mentor team members on best practices in automation and encourage a culture of innovation.
Security and Compliance: Ensure that security measures are integrated into system design and operations. Collaborate with security teams to proactively address potential vulnerabilities and maintain compliance with industry standards and regulations.
Monitoring and Alerting: Oversee the implementation and maintenance of robust monitoring and alerting systems. Ensure the timely response to alerts and lead efforts to improve the monitoring framework continually.
Continuous Integration and Continuous Deployment (CI/CD): Enhance the CI/CD pipeline to enable seamless and reliable deployments. Foster a culture of continuous improvement in the deployment process.
Documentation and Knowledge Sharing: Establish comprehensive documentation and knowledge sharing practices within the SRE team and across engineering teams. Mentor junior members to improve their technical expertise and problem-solving skills.
Technical Leadership: Provide technical guidance and mentorship to junior SRE team members. Collaborate with other senior stakeholders to drive technical strategy and foster a culture of technical excellence.

Who You Are:

8+ years experience with Bachelor's degree in Computer Science, Engineering, or a related technical field (or equivalent practical experience).
Substantial experience as a Site Reliability Engineer or in a similar role, with proven progression in responsibility and leadership.
Expertise in software development and proficiency in multiple programming languages (e.g., Python, Go, Java).
In-depth knowledge of cloud platforms (e.g., AWS, Google Cloud, Azure) and containerization technologies (Docker, Kubernetes).
Strong understanding of system architecture, distributed systems, and networking principles.
Experience with monitoring and logging tools like Prometheus, Grafana, DataDog, ThousandEyes, etc.
Proven track record of driving automation initiatives and using infrastructure-as-code tools (e.g., Terraform, Ansible).
Excellent problem-solving and critical-thinking skills, with a focus on root cause analysis.
Ability to lead and mentor technical teams, fostering a collaborative and innovative environment.

Set alert for similar jobsSite Reliability Engineer/ DevOps (7-14 years) role in Bengaluru, India

Company

Cisco

Job Posted

2 years ago

Job Type

Full-time

WorkMode

On-site

Experience Level

8-12 years

Related Jobs

AppD Senior Devops Engineer with distrubuted systems & Microservices (exp 10 -14 Yrs)

Cisco

Bengaluru, Karnataka, India

Posted: 2 years ago

DevOps Engineer (Data Platform) – Site Reliability – Web Services Engineering | Bengaluru About AppDynamics AppDynamics is an Application Intelligence company. With AppDynamics, enterprises have real-time insights into application performance, user performance, and business performance to move faster in an increasingly sophisticated, software-driven world. Our integrated suite of products is built on our innovative, full-stack observability platform that enables our customers to make faster decisions that enhance customer engagement and improve operational and business performance. AppDynamics is uniquely positioned to enable enterprises to accelerate their digital transformations by actively monitoring, analyzing, and optimizing complex application environments at scale, which has led to proven success and trust with the world’s most innovative companies. About You We are looking for a talented, motivated engineer to join our engineering team and help us continue to build and scale metrics platforms. In this role, you are expected to work on distributed systems that handle real-time ingestion and analytics at a massive scale. You are passionate about data engineering, scalability, availability, and performance. You also have,     Minimum of a bachelor’s degree in CSE, EE, CSM, or related technical discipline.     Minimum of a combined 8 - 11 years of Site Reliability, DevOps, and/or Software Development experience, ideally in a growth-stage environment     Experience operating within, and supporting, complex SaaS production or revenue-critical 24/7 web services environments     Must have experience developing and operationalizing system installations and upgrades     Experience with Unix/Linux system administration, especially in RedHat Linux (CentOS)     Experience running and administering services in AWS or other cloud platforms (Azure, GCP)     Significant experience with one or more scripting/coding languages, ideally with Ansible, Terraform, or Python     Experience with big data platform engineering     Experience with scaling and operationalizing distributed data stores, file systems, and services (Kafka, Elasticsearch, HBase, Druid, etc)     Experience with virtualization and containerization platforms (Docker), container orchestration tools (Kubernetes), and aspects of Kubernetes to facilitate ease of delivery (Istio/Helm/Kube2Iam)     Availability for occasional on-call after-hours support Day-to-day responsibilities include:     Building systems that ensure the reliable operation of distributed data stores     Helping to build infrastructure to facilitate rapid service deployments     Documenting findings and recommendations for improvement     Maintaining and enhancing deployment tools and methodologies; leading in advancing our 'Infrastructure as code' architecture.     Improving the monitoring systems that support our service reliability     Creating repeatable, efficient, and scalable artifact deployment pipelines     Making recommendations to and interfacing with engineering to ensure 100% application uptime     Monitor the SaaS environment and work with QA, Developers, and Ops to identify and tackle problems     Ensure that failover mechanisms are in place and are working correctly     Responding to and resolving technical emergencies About the Role The data platform powers AppDynamics’ Application Intelligence Platform. It handles billions of requests and massive amounts of metrics, events, and other data. It is real-time, very scalable, and highly available. It is the data source for performance monitoring and troubleshooting, policy evaluation, workflow automation, data visualization, and slice & dice data analysis. Come join the Data Platform team and build the world’s next phenomenal Application Intelligence Platform. The engineers in this team are passionate about big data and analytics, infinitely scalable and highly available platforms. They understand the importance of data collected from every application and component in a software-defined business environment - web, mobile, server, infrastructure, and hardware, in enabling the most advanced and effective business and IT decision-making.

Site Reliability Engineer

Juniper Networks

Bengaluru, Karnataka, India

Posted: a year ago

Job description  Juniper is changing what’s possible in networking. We’re going beyond building the networks customers expect — we’re building the networks customers deserve. And the world is taking note. But to continue to excel, we have work to do. Change in our industry is accelerating. To power connections and empower change, we need radical thinkers, eternal optimists, and energized personalities. We need people like you. Juniper is seeking a full-time SRE to join our talented team and support high quality technology solutions that revolutionize wireless and wired networks, powered by Artificial Intelligence in the cloud. Juniper provides services through SaaS applications to several enterprises, including Fortune 100 and Fortune 500 customers. You will be responsible for maintaining and improving the company's production environment for rapid scaling and outstanding performance. You will keep stellar cloud uptime and reliability. Your primary responsibilities will be incident management and release management in cloud instances in various regions.   Responsibilities: Manage system availability, health and service levels (SLAs, SLOs) of the large-scale cloud infrastructure, running in AWS and GCP. Proactively monitor, diagnose, analyze failures, and provide support for software engineers to debug production issues across microservices and distributed platforms. Work with development team in resolving the issues found. Participate in on-call rotation and resolution of issues in a 24x7 multi-cloud (AWS/GCP) environment. Monitor metrics and performance of applications and cloud infrastructure. Manage code releases, i.e., push code and patches on cloud. Own entire lifecycle of incidents (incident management), including reporting, analyzing, handling incidents, all the way up to its closure and writing RCAs. Laser focus and be able to analyze scalability, reliability, high availability, performance, software maintainability, and operational challenges. Write and maintain runbooks for knowledge driven automated processes and bots. Perform capacity planning based on performance, usage, and utilization stats. Perform after-hours infrastructure updates and maintenance. Follow SRE best practices and procedures.   Required Skills: Bachelor’s degree in Computer Science or Computer Engineering or equivalent. Minimum 6-7 years of devops/SRE experience. 5+ years hands-on experience with AWS or GCP, EC2 (GCE), IAM, S3 (GS), Docker, Kubernetes pods, Jenkins, Prometheus, CloudWatch (Stack Driver), Linux, Ansible, Salt 5+ years’ experience in deploying code and infrastructure in AWS or GCP using continuous integration/continuous delivery (CI/CD) tools in production environments. 5+ years of automation using python or/and Golang or/and shell scripting. 6+ prior experience in developing metrics to monitor health of infrastructure and applications. 5+ years of experience in managing SaaS applications infrastructure with REST based test automation experience using python. Basic understanding of Terraform or CloudFormation or any IaC code is preferred. Ideally detailed understanding of IP routing, Security and Cloud services such as CGNAT, IPSec, IDP and SDWAN/SDN for different customer use cases. The candidate should have a thorough understanding of networking fundamentals (TCP/IP, UDP, DHCP, DNS, ICMP, AR, routing and switching). General understanding of distributed systems.  Understanding of data management technologies including relational and non-relational databases.  Hands on experience in operating large-scale cloud-based distributed applications. Knowledge of build pipeline/infrastructure like Jenkin, GitHub, CICD would be added advantage. The ability to "fix the plane while in flight".

Senior Site Reliability Engineer

HealthEdge

Bengaluru, Karnataka, India

Posted: a year ago

HealthEdge is seeking a Senior Site Reliability Engineer to enhance stability, availability, and scalability of private cloud environments. Guide teams in problem resolution, system architecture recommendations, troubleshooting, and software solutions. Mentor junior developers and adopt industry best practices. Full-time onsite opportunity in Bengaluru, Karnataka, India.

Software Engineer (Devops/SRE - Cloud AWS, Kuberenetes, Docker, Python, Terraform, Ansible) - 8+ years

Cisco

Bengaluru, Karnataka, India

Posted: 2 years ago

Cloud Security Engineering at Cisco drives the technology that's transforming the way customers secure their networks, and more importantly, their users. We're seeking a Software Engineer with a robust background in software development and familiarity with DevOps practices. The individual in this role will be crucial in shaping our infrastructure, enhancing our deployment pipelines, and maintaining our monitoring systems. As a key member of the Network eXperience organization, you will be part of a team responsible for the design, development, and operation of key microservices focused on cloud network experience, traffic optimization and insights related that our Umbrella and Cisco Secure Access products offer.  This is a small team that does big things.  What You'll Do •           Develop, implement, and optimize continuous delivery pipelines for various applications. •           Ensure all systems are scalable, reliable, secure, and efficient. •           Collaborate with software engineers to make sure operational issues (such as system sizing, system configuration, or load balancing) are considered in software design. •           Build and manage dashboards to provide visibility into production system health and performance. •           Work to solve complex problems related to infrastructure cloud services and build automation to prevent problem recurrence. •           Participate in the creation of new distributed components and services. •           Utilize various open source technologies, tools and cloud services to support continuous integration efforts. •           Foster a culture of continuous improvement by learning, teaching, and implementing innovative practices. Basic Qualifications:  •           Bachelor’s Degree in Computer Science, Engineering, or related field. •           At least 2 years of experience in DevOps, Site Reliability Engineering (SRE), or similar roles. •           Proficiency in scripting languages such as Python, Bash, or JavaScript. •           Familiarity with cloud services (AWS, GCP, Azure) and containerization technologies (Docker, Kubernetes). •           Understanding of CI/CD pipelines and configuration management. Desired Qualifications: ·           Experience with infrastructure as code (IAC) using tools like Terraform, Ansible, or similar. ·           Familiarity with database systems, both SQL and NoSQL. ·           Good communication and teamwork skills.  Who You'll Work With  The members of the Cloud Security Engineering Network eXperience team build and operate core control plane services for the Umbrella and Cisco Secure Access platform. We are a team that is supportive of learning and experimentation. We work closely with the rest of the Cloud Security Engineering teams and other engineering groups across Cisco.

Software Engineer- snowflake/ cloud data warehouse/ Python-7+ years

Cisco

Bengaluru, Karnataka, India

Posted: 2 years ago

Want to be part of something big, ready for that next challenge? Well then, we want you. We are looking for creative and passionate people to join the Supply Chain transformation community to accelerate our pace of innovation and increase the value our systems deliver to the business. Cisco is growing and driving new business models across the enterprise. In Cisco Supply Chain IT, our vision is to enable an adaptable, innovative and scalable Supply Chain for operational perfection and to power Cisco’s growth. We do this through flexible services with focus to Simplify, Innovate and Accelerate. Cisco has an unparalleled culture and year after year is ranked world’s #1 best place to work according to www.greatplacetowork.com Responsibilities: Interface with businesses for understanding data foundation requirements. Identify data sources and analyze the data, design data models. Develop data pipeline, extraction, transformation, and loading of data. Test the data loaded and co-ordinate business testing and gather signoff from the business on data quality and accuracy Support the data warehouse, ensure job refreshes are completed on time, and support the user issues reported. Support initiatives for data integrity and normalization. Assess tests, implement new or upgraded software, and assist with strategic decisions on new systems. Generate reports from single or multiple systems. Troubleshoot the reporting database environment and reports. Evaluate changes and updates to source production systems. Communicate insights and provide solutions that have proven results. Provide technical expertise in data storage structures, data mining, and data cleansing. Minimum Requirements: Bachelor’s degree from an accredited university or college in computer science. Minimum 4 years with data engineering/data analyst roles Experience working with snowflake cloud data ware house, Python programming, AI/ML Projects and the different OLTP data bases like Oracle. Exposure working with Supply Chain Data and Processes is an advantage. Ability to work with stakeholders to assess potential risks. Experience working in Agile scrum teams and Jira. Understanding of addressing and metadata standards.

Software Engineer- snowflake/ cloud data warehouse/ Python-7+ years

Cisco

Bengaluru, Karnataka, India

Posted: 2 years ago