The Job logo

What

Where

Software Engineer

ApplyJoin for More Updates

You must Sign In before continuing to the company website to apply.

Smart SummaryPowered by Roshi
Join Azure AI Infrastructure team at Microsoft to build cutting-edge AI infrastructure services on Azure AI Platform. Design components for AI languages, cluster orchestration, job scheduling, storage, networking, and containerization. Lead development of scalable Microsoft Service Fabric and Kubernetes clusters. Enhance service quality, security, and performance to support AI training and inferencing workloads.
Overview

Azure AI Infrastructure team is looking for passionate engineers to build the largest deep-learning infrastructure service at Microsoft. In this role you will be tasked with building new components to bring the latest innovations in AI Infrastructure onto the Azure AI Platform. You will partner with top engineering talent within Azure AI Infrastructure and across Azure to work on cluster orchestration, job scheduling, storage, networking, containerization and operating system integration. Your work will enable various AI languages and run-times on Azure AI Infrastructure to bring distributed deep learning training and inferencing to life. In addition, you will build infrastructure components required to build, deploy, monitor and service highly available and scalable Microsoft Service Fabric and Kubernetes clusters under your care. You will lead development and customer support from the frontline and establish architecture, service excellence guidelines and a high-quality bar. 

 

Candidates must have a track record for delivering engineering and service excellence on a mid-to-large scale service. 

 

Who are We? 

 

We are engineers on Azure AI Infrastructure. We believe that building a planet-scale AI Supercomputer from the ground-up which addresses the fundamental pain-points of data scientists and AI practitioners and takes AI to the unprecedented scale is an opportunity of a lifetime. If you share the same dream as us, come join us! 

 

What Is Azure AI Infrastructure? 

 

High scale AI workloads are always testing the limits of the infrastructure stack. Large-scale model training and inference with huge data volumes of training data on hundreds-thousands of GPUs make it a true engineering challenge. Azure AI Infrastructure is a globally distributed, multi-tenant service that provides robust, cost-effective and competitive AI infrastructure (compute, networking and storage) for AI training and inferencing. By abstracting workloads from underlying infrastructure, Azure AI Infrastructure creates a shared pool of resources that can be dynamically provisioned for full utilization of expensive GPU compute, and enabling data scientists to productively build, scale, experiment, and iterate their models on top of a robust, performant, scalable and cost-effective distributed infrastructure built for AI. In Azure AI Infrastructure, we are constantly seeking to apply the best ideas from AI, ML, distributed systems, distributed databases, machine learning, information retrieval, networking, and security. 

Qualifications
  1. 1-3 years of experience with coding in one of C#, C or C++, Rust, go
  2. Experience working with the Linux operation system and Kubernetes cluster orchestration
  3. Experience with improving service operations or engineering fundamentals
  4. Excellent collaboration skills
  5. A master’s or bachelor’s degree in computer science or a related field
  6. At least 1 year of experience building and shipping production software or services

 

Responsibilities
  1. Deliver a robust container orchestration platform for Azure AI Infrastructure
  2. Design and build the scheduling sub-system that is responsible for delivering on the SLAs for AI training and inferencing workloads
  3. Design and build storage and caching system for efficient DNN training and inferencing
  4. Design and build control plane APIs for creation and management of training jobs and inference model metadata
  5. Deliver node management, fault detection and node repair as a service to improve job/model reliability
  6. Deliver world-class monitoring systems and telemetry pipelines to enhance service and job observability for both end-users and operators.
  7. Codify security and compliance requirements by building and strengthening system defenses against malicious attacks and exploits
  8. Leverage performance and profiling tools to identify hot spots and bottlenecks across hardware and software boundaries: from CPU, GPU, microcode, OS, networking code and drive end-to-end job performance
Set alert for similar jobsSoftware Engineer role in Bengaluru, India
Microsoft Logo

Company

Microsoft

Job Posted

9 months ago

Job Type

Full-time

WorkMode

Hybrid

Experience Level

0-2 Years

Category

Software Engineering

Locations

Bengaluru, Karnataka, India

Qualification

Bachelor

Applicants

Be an early applicant

Related Jobs

Microsoft Logo

Software Engineer

Microsoft

Bengaluru, Karnataka, India

Posted: a year ago

Join Azure Files team of Azure Storage group to design, develop, and validate Azure's file sharing service. Work on cutting-edge cloud and on-premises file storage solutions to ensure world-class reliability, performance, and security. Collaborate with talented engineers on impactful projects. This is a remote full-time opportunity at Microsoft, located in Bengaluru, India.

Microsoft Logo

Software Engineer

Microsoft

Bengaluru, Karnataka, India

Posted: 20 days ago

Overview Have you ever imagined a world with an infinite amount of storage available and accessible to everyone? A place where everyone in the world can easily access their data from anywhere at any time via any means (e.g., mobile phones, tablets, PCs, smart devices, etc.). Did you ever desire a universally accessible storage system to record all the knowledge known to mankind or to store all the data collected from all the scientists in the world for them to collaborate upon? Do you want to be part of a team that strives to bring these to reality? Microsoft Azure Storage, which is a massively scalable, highly distributed, ubiquitously accessible storage system, designed to scale out and serve the entire world. We continue to have tremendous hockey stick growth, with many Exabyte’s of data stored, and are designing and building systems for Zettabyte scale to support demand growth for the coming years.  As a Software Engineer in the Azure Storage Capacity Management team, you will build, improve and support highly scalable, performant services that deliver highly reliable, secure and available access to storage for our customers. This opportunity will allow you to develop your technical skills in cloud services and storage, accelerate career growth, and provide an opportunity to work in a highly dynamic, flexible, and globally distributed team. Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond. Qualifications Required Qualifications:  Bachelor's Degree in Computer Science, or related technical discipline with proven experience coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python. OR equivalent experience. Excellence in software engineering practices. Preferred Qualifications:  Good experience in service-oriented architecture  Experience implementing distributed cloud-based applications with large data backends  Solid Engineering Fundamentals  Strong knowledge of computer science, algorithms, and design patterns  Proven record of shipping high quality software debugging ability across technology stacks  Love the next problem, the next experiment, the next partner  Have a deep desire to work collaboratively, solve problems with groups, find win/win solutions and celebrate successes  Get excited by the challenge of hard technical problems        Responsibilities Design and implement new features and enhancements for Azure Storage Capacity Balancing  Debug and troubleshoot complex issues across multiple software components  Analyze large scale performance measurement data to find bottlenecks in the software or hardware.  Collaborate with other engineers, product managers, and customers to gather requirements, provide feedback, and deliver solutions.  Conduct code reviews and ensure adherence to coding standards and quality criteria.  Document and communicate the design and architecture of the software.  Acts as a Designated Responsible Individual (DRI) and guides other engineers by developing and following the playbook, working on call to monitor system/product/service for degradation, downtime, or interruptions, alerting stakeholders about status and initiates actions to restore system/product/service for simple and complex problems when appropriate