Principal Engineer - PSRE
Design, develop, and implement scalable monitoring solutions for distributed systems at scale. Lead monitoring architectures, integrate monitoring tools, and support monitoring systems while driving innovation and technical design decisions. Develop R&D projects in line with the team's technical vision. Requires 10+ years of SRE experience, expertise in monitoring technologies, cloud-based systems, and distributed tracing implementation. Experience with Kubernetes, Agile practices, and strong analytical skills. Location: Hyderabad, Telangana, India/Bengaluru, Karnataka, India. Full-time, On-site opportunity.
Company Overview
Arcesium is a global financial technology firm that solves complex data-driven challenges faced by some of the world’s most sophisticated financial institutions. We constantly innovate our platform and capabilities to meet tomorrow’s challenges, anticipate the risks our clients encounter, and design advanced solutions to help our clients achieve transformational business outcomes.
Financial technology is a high-growth industry as change and innovation continue to disrupt the status-quo and prompt major transformation. Arcesium is at a particularly interesting time in our own growth as we look to leverage our successfully established market position and expand operations in pursuit of strategic new business opportunities. We value intellectual curiosity, proactive ownership, and collaboration with colleagues, and we empower you to meaningfully contribute from day one and accelerate your professional development.
What You'll Do
- Design, develop, and implement scalable and reliable monitoring solutions for distributed systems at scale.
- Define and implement monitoring requirements in collaboration with cross-functional teams.
- Lead the development of monitoring architectures and strategies.
- Integrate monitoring tools into existing infrastructure.
- Maintain and support monitoring systems.
- Demonstrate strong technical breadth/depth, driving innovation, evaluating new technologies, and deciphering the technical vision for engineeringteams.
- Own key contributions to technical design and architecture decisions, considering trade-offs of choices, managing risk, making decisionsindependently where appropriate, and presenting reasoned options for decision making by others.
- Lead the way by writing exemplary code, documentation, and RFCs.
- Identify, propose, develop, deploy, and own R&D projects in accordance with the technical vision and needs of the team, turning problemstatements into solutions, and operating independently as needed.
What You'll Need
- 10+ years of experience in SRE or a related field.
- Proven experience in designing, developing, and implementing monitoring solution.
- Deep understanding of monitoring technologies and tools, including Prometheus, Grafana, Loki, and Tempo
- Experience with cloud-based monitoring systems, such as New Relic, Datadog, and Grafana Cloud
- Experience with log analysis tools, such as Splunk, Logstash, Fluent, and Sumo Logic
- Experience with distributed tracing implementation using Open Telemetry, Jaeger
- Strong understanding of SRE principles and practices.
- Experience with incident response and management.
- Reliability: An exposure to Chaos Engineering and various reliability practices including disaster recovery will be good to have.
- Experience with Cloud Computing like AWS.
- Experience with Kubernetes.
- Experience in Agile practices (Scrum)
- Excellent analytical, problem-solving, and troubleshooting skills.
- Excellent communication and presentation skills.
- Experience managing and mentoring engineers.
- Ability to work independently and as part of a team.