JOB DESCRIPTION
Why this job matters
The Mobile Systems Development unit design, build and maintain the UK voice and mobile communication and collaboration services for BT.
BT’s Centralised Logging Solution (CLS) is part of the Galaxy program and is based on a nearly real-time streamed event analytics and monitoring solution. The CLS functions by collecting and processing events from diverse network domain sources for EE and BT’s mobile core network. It then aggregates and performs necessary transformations or formatting on the events before storing them. Our SRE unit is involved in gathering the requirements from various sources and then integrating with CLS. We are also responsible for the development of parser scripts for each component for data enrichment and providing the visual dashboards for end-users. The solution is going through a major architectural transformation to provide a common collection layer, be compliant with Telecom Security Act (TSA) and GDPR guidelines and also bringing in industry best practices for machine learning and anomaly detection.
Site Reliability Engineering combines technical development and IT operations. We take applications and services through to live, breaking down silos and blurring the lines between traditional operations and development teams. This speeds up and automates the development pipeline.
The role holder will be responsible for ensuring system uptime in accordance with SLAs by building automation solutions which can be used in operations, maintenance, and incident scenarios, whilst working alongside developers to create new features, remove bugs and add value.
This role also is part of a 24x7 operational team which will require entry into a paid callout rota in future. The role offers an opportunity for a Hybrid working (3 days per week in Bangalore Pritech Office and 2 days wherever you like).
What you’ll be doing
You’ll be:
Partnering closely with external systems, our internal development team and product managers to build and deploy features that are highly available, performant, secure.
Develop innovative ways to smartly measure, monitor & report application and infrastructure health, capacity, and performance.
Gain deep knowledge of our data collection, ingestion, enrichment layers.
Use monitoring tools to identify and resolve latency, saturation and errors in the platform ensuring the best application performance and customer experience.
Supporting the development and delivery of platform technology upgrades.
Taking new data pipelines to live, acting as quality gate check for the integration.
Providing support to 1st and 2nd line operations teams, whilst interacting with developers and other stakeholders to resolve complex issues and prevent reoccurrence.
Respond to incidents on a timely manner and conduct post-incident reviews.
Documenting what’s been done, not because it ticks a box, but because it’s useful.
Ensure compliance with regulatory and security standards as deemed necessary.
The skills and experience you’ll need
Essential
Desirable