The Job logo

What

Where

Site Reliability Engineering Specialist

ApplyJoin for More Updates

You must Sign In before continuing to the company website to apply.

Smart SummaryPowered by Roshi
Manage a team to ensure BT delivers reliable and cost-effective cloud services. Collaborate with engineering leadership for goal achievement. Provide technical support for optimization and problem-solving. Implement infrastructure upgrades. Coach and develop talent. Improve processes.

JOB DESCRIPTION

The Site Reliability Engineering Manager manages a team within the site reliability engineering organization ensuring BT delivers the service performance, reliability and availability that internal and 
external customers expect, through contributing in cross-functional engineering discussions to achieve scalable, measurable, fault-tolerant, and cost-effective cloud services.
 

Accountabilities

 

  • Collaborates with engineering leadership in supporting the development of the architectures and practices that should be adopted in order to deliver on engineering and operational goals.
  • Act as In-Life Manager for all CSOC/ P1 & P2 across Corporate Units estate. Owns and leads retrospective and preventive actions after each high severity production incident.
  • Provides technical support to product teams to optimize availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.
  • Ensure Corporate Units application estate has 100% monitoring coverage
  • Solves problems relating to mission-critical services and builds automation to prevent problem recurrence with the goal of automated response to all non-exceptional service conditions.
  • Executes new builds of infrastructure tooling that improves reliability across the entire product surface area, dealing with massive distributed scale.
  • Coordinates the work with development teams during design phase, to build and perform infrastructure upgrades to support applications availability and reliability.
  • Oversees MTTR Governance across all incidents in Corporate Units and work with application management team to continuously improve on the MTTR.
  • Drive Incident to PR Linkage for all reputable Incidents to perform cluster analysis to drive AI-Ops adoption.
  • Oversees the delivery of infrastructure as code software to improve the availability, scalability, latency, and efficiency of services.
  • Oversees the implementation of robust monitoring and alerting systems.
  • Manages the queue and support processing to ensure early warning of support issues.
  • Champions, continuously develops and shares with team knowledge on emerging trends and changes in site reliability engineering best practices and industry standards.
  • Coaches talent, and manages others, to develop capabilities and ensure performance through upskilling, development and recruitment.
  • Implement ways to improve working processes within the area of site reliability engineering responsibility.

 

Experience expected to have

 

•    Computing infrastructure experience
•    Understanding of unix administration, networking, protocols and hardware.
•    Infrastructure as code, practical knowledge and hands on using Terraform or cloudformaton
•    Orchestration of operational jobs using tools such as rundeck (or other alternatives)
•    Configuration management tools & frameworks. primarily using ansible, but skills in other tools are acceptable (puppet, chef, salt etc)
•    General networking skills, especially in the context of a public cloud (e.g. aws – vpc, subnets, routing tables, nat / internet gateways, dns, security groups)
•    Understanding of how to specify an environment for high availability
•    Understand the need for the deployment of geo-resilient application architectures
•    The ability to work in a fast-paced, high energy team environment.
 

Preffered:
•    Cloud experience – specifically AWS 
•    Experience using databases
•    Knowledge of relevant bt security standards 
•    Devops engineering tooling such as ansible, rundeck, checkmk, docker, Kubernetes, nagios
•    Experienced in CI/CD technologies and techniques
•    Understanding of bt server builds, enterprise cloud and even external cloud vendor hosting

Set alert for similar jobsSite Reliability Engineering Specialist role in Bengaluru, India
BT Group Logo

Company

BT Group

Job Posted

a year ago

Job Type

Full-time

WorkMode

On-site

Experience Level

3-7 Years

Category

Software Engineering

Locations

Bengaluru, Karnataka, India

Qualification

Bachelor

Applicants

Be an early applicant

Related Jobs

BT Group Logo

Site Reliability Engineering Professional

BT Group

Bengaluru, Karnataka, India

Posted: a year ago

The Mobile Systems Development unit designs, builds, and maintains the UK voice and mobile communication and collaboration services. This role is responsible for ensuring system uptime, building automation solutions, and working alongside developers to add value. It also involves monitoring and resolving issues, supporting platform upgrades, and ensuring compliance. The role requires experience in deploying production systems, working with load balancers, SSL/TLS configuration, data streaming, infrastructure automation, containerization, and source control. Knowledge of incident and change management, communication skills, and familiarity with Openstack and machine learning is desirable.