Site Reliability Engineering Specialist

You must Sign In before continuing to the company website to apply.

Smart SummaryPowered by Roshi

Manage a team to ensure BT delivers reliable and cost-effective cloud services. Collaborate with engineering leadership for goal achievement. Provide technical support for optimization and problem-solving. Implement infrastructure upgrades. Coach and develop talent. Improve processes.

JOB DESCRIPTION

The Site Reliability Engineering Manager manages a team within the site reliability engineering organization ensuring BT delivers the service performance, reliability and availability that internal and
external customers expect, through contributing in cross-functional engineering discussions to achieve scalable, measurable, fault-tolerant, and cost-effective cloud services.

Accountabilities

Collaborates with engineering leadership in supporting the development of the architectures and practices that should be adopted in order to deliver on engineering and operational goals.
Act as In-Life Manager for all CSOC/ P1 & P2 across Corporate Units estate. Owns and leads retrospective and preventive actions after each high severity production incident.
Provides technical support to product teams to optimize availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.
Ensure Corporate Units application estate has 100% monitoring coverage
Solves problems relating to mission-critical services and builds automation to prevent problem recurrence with the goal of automated response to all non-exceptional service conditions.
Executes new builds of infrastructure tooling that improves reliability across the entire product surface area, dealing with massive distributed scale.
Coordinates the work with development teams during design phase, to build and perform infrastructure upgrades to support applications availability and reliability.
Oversees MTTR Governance across all incidents in Corporate Units and work with application management team to continuously improve on the MTTR.
Drive Incident to PR Linkage for all reputable Incidents to perform cluster analysis to drive AI-Ops adoption.
Oversees the delivery of infrastructure as code software to improve the availability, scalability, latency, and efficiency of services.
Oversees the implementation of robust monitoring and alerting systems.
Manages the queue and support processing to ensure early warning of support issues.
Champions, continuously develops and shares with team knowledge on emerging trends and changes in site reliability engineering best practices and industry standards.
Coaches talent, and manages others, to develop capabilities and ensure performance through upskilling, development and recruitment.
Implement ways to improve working processes within the area of site reliability engineering responsibility.

Experience expected to have

•   Computing infrastructure experience
•   Understanding of unix administration, networking, protocols and hardware.
•   Infrastructure as code, practical knowledge and hands on using Terraform or cloudformaton
•   Orchestration of operational jobs using tools such as rundeck (or other alternatives)
•   Configuration management tools & frameworks. primarily using ansible, but skills in other tools are acceptable (puppet, chef, salt etc)
•   General networking skills, especially in the context of a public cloud (e.g. aws – vpc, subnets, routing tables, nat / internet gateways, dns, security groups)
•   Understanding of how to specify an environment for high availability
•   Understand the need for the deployment of geo-resilient application architectures
•   The ability to work in a fast-paced, high energy team environment.

Preffered:
•   Cloud experience – specifically AWS
•   Experience using databases
•   Knowledge of relevant bt security standards
•   Devops engineering tooling such as ansible, rundeck, checkmk, docker, Kubernetes, nagios
•   Experienced in CI/CD technologies and techniques
•   Understanding of bt server builds, enterprise cloud and even external cloud vendor hosting

Set alert for similar jobsSite Reliability Engineering Specialist role in Bengaluru, India

Company

BT Group

Job Posted

2 years ago

Job Type

Full-time

WorkMode

On-site

Experience Level

3-7 Years

Related Jobs

Site Reliability Engineering Professional

BT Group

Bengaluru, Karnataka, India

Posted: 2 years ago

We are looking for a dedicated Site Reliability Engineer (SRE) to ensure the reliability, availability, and performance of our critical systems and applications. You will be responsible for resolving application-related issues, driving innovation and automation, and collaborating with cross-functional teams. With your expertise in monitoring tools, security best practices, system architecture, and scalability strategies, you will contribute to the overall success of our operations. Join our team and make a significant impact on our system reliability and performance.

Site Reliability Engineering Professional

BT Group

Bengaluru, Karnataka, India

Posted: 2 years ago

The Mobile Systems Development unit designs, builds, and maintains the UK voice and mobile communication and collaboration services. This role is responsible for ensuring system uptime, building automation solutions, and working alongside developers to add value. It also involves monitoring and resolving issues, supporting platform upgrades, and ensuring compliance. The role requires experience in deploying production systems, working with load balancers, SSL/TLS configuration, data streaming, infrastructure automation, containerization, and source control. Knowledge of incident and change management, communication skills, and familiarity with Openstack and machine learning is desirable.

Network Reliability Engineering Specialist, RAN

BT Group

Bengaluru, Karnataka, India

Posted: 2 years ago

We are looking for a technical expert to join our team. You will be responsible for maintaining SLAs, assigning technical tasks, and being a subject matter expert in multiple areas. You should have a strong knowledge of RAN systems, IT, and Networks. Experience with agile processes and software engineering is a plus.

Network Reliability Engineering Specialist, Packet Core

BT Group

Bengaluru, Karnataka, India

Posted: 2 years ago

Develop, build and maintain multiple core network Test Environments. Engage with project teams to drive the Design, Build, Test and Delivery of new services. Leading technical discussions, incident management, change management, problem management, risk management, and vulnerability management. Drive alignment of Test Environments with the live network. Champion continuous improvement and automation solutions. Collaborate with other business communities for better performance and quality. Keep documentation and support procedures up to date.

Network Reliability Engineering Specialist, Voice Core

BT Group

Bengaluru, Karnataka, India

Posted: 2 years ago

Join our team working on UK's number 1 network and be part of a significant digital architectural evolution. As a Mobile Engineer, you will be responsible for building and maintaining network test environments, deploying and investigating faults, and ensuring the continuous improvement of core operational processes. This is an opportunity to expand your skillset and knowledge base in a fast-paced environment.

Software Engineering Specialist

BT Group

Bengaluru, Karnataka, India

Posted: 2 years ago

The Software Engineering Specialist independently executes advanced activities to deliver the engineering strategy and roadmap that supports BT’s commercial strategy through cross functional business partnering and the participation of a team that pursues innovation as well as engineering excellence.