Site Reliability Engineering Professional

You must Sign In before continuing to the company website to apply.

Smart SummaryPowered by Roshi

The Mobile Systems Development unit designs, builds, and maintains the UK voice and mobile communication and collaboration services. This role is responsible for ensuring system uptime, building automation solutions, and working alongside developers to add value. It also involves monitoring and resolving issues, supporting platform upgrades, and ensuring compliance. The role requires experience in deploying production systems, working with load balancers, SSL/TLS configuration, data streaming, infrastructure automation, containerization, and source control. Knowledge of incident and change management, communication skills, and familiarity with Openstack and machine learning is desirable.

JOB DESCRIPTION

Why this job matters

The Mobile Systems Development unit design, build and maintain the UK voice and mobile communication and collaboration services for BT.

BT’s Centralised Logging Solution (CLS) is part of the Galaxy program and is based on a nearly real-time streamed event analytics and monitoring solution. The CLS functions by collecting and processing events from diverse network domain sources for EE and BT’s mobile core network. It then aggregates and performs necessary transformations or formatting on the events before storing them. Our SRE unit is involved in gathering the requirements from various sources and then integrating with CLS. We are also responsible for the development of parser scripts for each component for data enrichment and providing the visual dashboards for end-users. The solution is going through a major architectural transformation to provide a common collection layer, be compliant with Telecom Security Act (TSA) and GDPR guidelines and also bringing in industry best practices for machine learning and anomaly detection.

Site Reliability Engineering combines technical development and IT operations. We take applications and services through to live, breaking down silos and blurring the lines between traditional operations and development teams. This speeds up and automates the development pipeline.

The role holder will be responsible for ensuring system uptime in accordance with SLAs by building automation solutions which can be used in operations, maintenance, and incident scenarios, whilst working alongside developers to create new features, remove bugs and add value.

This role also is part of a 24x7 operational team which will require entry into a paid callout rota in future. The role offers an opportunity for a Hybrid working (3 days per week in Bangalore Pritech Office and 2 days wherever you like).

What you’ll be doing

You’ll be:

Partnering closely with external systems, our internal development team and product managers to build and deploy features that are highly available, performant, secure.

Develop innovative ways to smartly measure, monitor & report application and infrastructure health, capacity, and performance.

Gain deep knowledge of our data collection, ingestion, enrichment layers.

Use monitoring tools to identify and resolve latency, saturation and errors in the platform ensuring the best application performance and customer experience.

Supporting the development and delivery of platform technology upgrades.

Taking new data pipelines to live, acting as quality gate check for the integration.

Providing support to 1st and 2nd line operations teams, whilst interacting with developers and other stakeholders to resolve complex issues and prevent reoccurrence.

Respond to incidents on a timely manner and conduct post-incident reviews.

Documenting what’s been done, not because it ticks a box, but because it’s useful.

Ensure compliance with regulatory and security standards as deemed necessary.

The skills and experience you’ll need

Essential

Proven experience on deploying, managing, operating medium to large-scale production systems running linux servers.
Experience working with Load balancers and executing platform integration and establishing connectivity with external network components.
Familiarity with SSL/TLS configuration, certificate management and security hardening.
Extensive Knowledge on distributed data streaming with Apache Kafka
Experience in infrastructure and configuration management automation with Terraform, Ansible etc.
Hands-on work experience with Docker & Kubernetes (day-to-day operations)
Hands-on work experience with ELK stack (Elasticsearch, Logstash and Kibana) administration.
Source control with Git and setting up CI/CD automation with Gitlab runner.
Familiar with ServiceNow for incident and change management.
Sound communication and interpersonal skills.

Desirable

Familiar to Openstack Cloud ecosystem
Fundamental understanding about Machine learning and anomaly detection models

Set alert for similar jobsSite Reliability Engineering Professional role in Bengaluru, India

Company

BT Group

Job Posted

2 years ago

Job Type

Full-time

WorkMode

On-site

Bengaluru, Karnataka, India

Posted: 2 years ago

The QA Engineering Professional supports critical testing activities for a portfolio of products and services, following best practice standards for test quality. They help devise and implement the test strategy and transformation to make testing better for customers. They coach new team members to achieve their full potential.