JOB DESCRIPTION
The Site Reliability Engineering Manager manages a team within the site reliability engineering organization ensuring BT delivers the service performance, reliability and availability that internal and
external customers expect, through contributing in cross-functional engineering discussions to achieve scalable, measurable, fault-tolerant, and cost-effective cloud services.
Accountabilities
- Collaborates with engineering leadership in supporting the development of the architectures and practices that should be adopted in order to deliver on engineering and operational goals.
- Act as In-Life Manager for all CSOC/ P1 & P2 across Corporate Units estate. Owns and leads retrospective and preventive actions after each high severity production incident.
- Provides technical support to product teams to optimize availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.
- Ensure Corporate Units application estate has 100% monitoring coverage
- Solves problems relating to mission-critical services and builds automation to prevent problem recurrence with the goal of automated response to all non-exceptional service conditions.
- Executes new builds of infrastructure tooling that improves reliability across the entire product surface area, dealing with massive distributed scale.
- Coordinates the work with development teams during design phase, to build and perform infrastructure upgrades to support applications availability and reliability.
- Oversees MTTR Governance across all incidents in Corporate Units and work with application management team to continuously improve on the MTTR.
- Drive Incident to PR Linkage for all reputable Incidents to perform cluster analysis to drive AI-Ops adoption.
- Oversees the delivery of infrastructure as code software to improve the availability, scalability, latency, and efficiency of services.
- Oversees the implementation of robust monitoring and alerting systems.
- Manages the queue and support processing to ensure early warning of support issues.
- Champions, continuously develops and shares with team knowledge on emerging trends and changes in site reliability engineering best practices and industry standards.
- Coaches talent, and manages others, to develop capabilities and ensure performance through upskilling, development and recruitment.
- Implement ways to improve working processes within the area of site reliability engineering responsibility.
Experience expected to have
• Computing infrastructure experience
• Understanding of unix administration, networking, protocols and hardware.
• Infrastructure as code, practical knowledge and hands on using Terraform or cloudformaton
• Orchestration of operational jobs using tools such as rundeck (or other alternatives)
• Configuration management tools & frameworks. primarily using ansible, but skills in other tools are acceptable (puppet, chef, salt etc)
• General networking skills, especially in the context of a public cloud (e.g. aws – vpc, subnets, routing tables, nat / internet gateways, dns, security groups)
• Understanding of how to specify an environment for high availability
• Understand the need for the deployment of geo-resilient application architectures
• The ability to work in a fast-paced, high energy team environment.
Preffered:
• Cloud experience – specifically AWS
• Experience using databases
• Knowledge of relevant bt security standards
• Devops engineering tooling such as ansible, rundeck, checkmk, docker, Kubernetes, nagios
• Experienced in CI/CD technologies and techniques
• Understanding of bt server builds, enterprise cloud and even external cloud vendor hosting