The Job logo

What

Where

Principal Software Architect - Data Center

ApplyJoin for More Updates

You must Sign In before continuing to the company website to apply.

What you’ll be doing:

Drive the system architecture for a complex server platform in a multi-functional environment.

Work directly with major customers to understand their requirements and work to align their roadmap with NVIDIA’s roadmap.

Work with business partners and vendors to shape their products to meet NVIDIA’s needs.

Develop a roadmap of new technologies and protocols and drive their design and adoption.

Mentor architects and engineering teams to grow them into future leaders.

Make key technical decisions even when faced with ambiguity, and mitigate execution risks by following left shift strategy.

 

What we need to see:

Deep experience in designing architecture for scalable and performant server systems, particularly at the SW/HW interface.

Expertise in Out of Band and Inband management architectures.

Knowledge of device management protocols such as MCTP, PLDM and RDE.

Knowledge of system management protocols such as Redfish and IPMI.

Experience working with platform security experts to define tradeoffs between security and ease of use.

Demonstrable experience in implementing left shift strategy to de-risk program execution.

Excellent written and verbal communication skills.

BS or MS degree in Computer Engineering, Computer Science, or related degree or equivalent experience.

15+ years in the area of System architecture and design.

 

Ways to stand out from the crowd:

Knowledge of cloud and cluster level deployment and management systems.

Participation and contributions in standards bodies such as OCP and DMTF.

Familiarity with CXL architectures.

Knowledge in storage and networking technologies.

Set alert for similar jobsPrincipal Software Architect - Data Center role in Santa Clara, United States
NVIDIA Logo

Company

NVIDIA

Job Posted

a year ago

Job Type

Full-time

WorkMode

On-site

Experience Level

13-17 years

Locations

Santa Clara, California, United States

Qualification

Bachelor

Applicants

Be an early applicant

Related Jobs

NVIDIA Logo

Principal Technical Product Manager – CUDA

NVIDIA

Santa Clara, California, United States

Posted: a year ago

What you will be doing: Product Definition and Architecture – Co-own NVIDIA’s strategy for positioning Grace-Hopper across Industries, and Eco-systems working very closely with other Technical Architects, Research, and Engineering Teams. Strategy – Influence and participate in active discussion with Executive-Staff and Senior VPs, paving the path for the NVIDIA Grace-Hopper story. Product Launches – Define the go-to-market strategy and contribute to the cross-functional implementation of the plan across Marketing, PR (Press Release), Sales, etc. Asset Creation – Work with marketing to define positioning that enables the creation of technical content, including blog posts, webinars, developer tutorials, and other product value proposition tools. You will need strong collaboration and communication skills along with a solid understanding of how technical decisions impact both NVIDIA and customer business. What we need to see: Master’s degree in computer science, Electrical Engineer, Applied Mathematics, or related engineering field (Ph.D. preferred) or equivalent experience Experience and familiarity with scientific computing and AI (Artificial Intelligence) applications 6+ years’ experience developing/architecting software, libraries, and SDKs (software development kits) Additional 5+ years’ experience working as a Technical Product Manager in a Technology Company World-class communication skills with a proven ability to articulate a value proposition to technical and non-technical audiences.   Ways To Stand Out from the Crowd: Experience Developing/Architecting System Software Experience developing/architecting parallel, heterogeneous and/or large-scale software Deep understanding of system software & networking Background with CUDA or GPU computing

NVIDIA Logo

Senior Software Engineer, NGC Data Platform

NVIDIA

Santa Clara, California, United States

Posted: a year ago

What you will be doing: Design and build software code and cloud services for Data Management, including providing a catalog and metadata storage datasets Connect with other technical leaders across NVIDIA to ensure you are using existing technologies where possible and that we are collaborating with their systems appropriately. Collaborate with the NVIDIA research team to use new Storage and Compute innovations - GPU direct storage, DPU.   What we need to see: BS in Computer Science, Information Systems, or Computer Engineering (or equivalent experience) 5+ years of proven experience Experience building robust services at scale. Build and maintain high volume / low latency data platform services Strong foundation in algorithms and data structures and their real-world use cases. Experience with distributed systems, databases, and Big Data systems (Spark, Hadoop). Experience building and shipping services around Kubernetes, Cloud Native, and Cloud Service Providers. Experience with one of the leading cloud providers: AWS, GCP, or Azure. Experience collaborating with teams to write software to support cloud services. Experience with backend systems and software engineering. Programming experience in a relevant language, e.g., Go, Python, C/C++, Java. Understanding of standard approaches to software engineering, software architecture, and design. Ability to document software and services. Break down projects into practical tasks. Communicate design, status, and other sophisticated subjects in written, visual, and oral formats. Ability and passion for working across teams and with collaborators on all sides of the project   Ways to stand out from the crowd: Hands-on experience in building and managing large-scale data platform services. Experience building products and services to solve enterprise-grade customer data analytics problems. Experience with Apache Spark, Object Storage, Metadata Management, Data lake tools (Apache Iceberg), Machine Learning infrastructure toolset (Feature Stores) Computer science background with Distributed systems as a specialization

NVIDIA Logo

Senior Software Technical Program Manager - Compute Platform

NVIDIA

Santa Clara, California, United States

Posted: a year ago

What you will be doing: Working closely with Software development managers, Engineers, Product marketing, Customer program management, Quality Assurance, and other logistical personnel to understand, define and implement processes for developing sophisticated compute software domains for cloud service provider and OEM customers. This will also include responsibilities related to general compute software releases. Schedule and lead status meetings, remove obstacles, mitigate customer concerns, be the focal point for building and maintaining the release schedules as well as the prioritized release plan of record. Collaborate with teams across the company to plan and drive Software objectives for the team. In this role, you will collect requirements, help define priorities, drive scheduling, and planning for all phases of the software development lifecycle. Develop and maintain schedules, anticipate risks and developing risk management solutions for the many moving parts that need to work in parallel. Lead and improve existing product development and software release processes; and collaborate with engineering management to refine the development workflow for maximum engineering efficiency. You will have the opportunity to partner with diverse technical groups, spanning all organizational levels. Internally, you will translate customer requirements into achievable landmarks and actions and ensure that customers are kept up to date on issue status. Partner with various internal teams and 3rd party partners located in various different time zones as needed to help resolve customer issues. Manage customer releases. Drive process documentation. Work with customer PMs on software issues including technical feedback from OEMs, CSPs and partners. Improve and maintain all processes related to enterprise support.   What we need to see: Bachelor of Science in Electrical Engineering or Computer Science or equivalent experience 6+ years proven experience in a similar or related role. Proven track record to get complex product to customers. Hands on experience with software applications or system software/firmware/open-source development. Strength working independently and actively with minimal guidance. Proven experience to creatively resolve technical and resource issues. Ability to think strategically and tactically and to build consensus to make programs successful. Detailed knowledge of software engineering principles. Experience with industry standard configuration management tools. Experience with productivity tools and process automation. You should be detail oriented with shown ability to multitask, in a dynamic environment with shifting priorities and changing requirements. Strong communication and technical presentation skills.   Ways to stand out from the crowd: Programming of a modern programming language highly desired. Having experience with Agile tools in support of this role. Solid understanding of operating systems, graphics principles and standards. Previous experience coordinating activities between HW and SW organizations. MBA/PMP Certification/training, is a plus.

NVIDIA Logo

Senior HPC Scheduler Engineer

NVIDIA

Santa Clara, California, United States

Posted: a year ago

What you’ll be doing: Provide engineering solutions and prototypes to enable efficient resource management and job scheduling for large scale clusters, ensure technical relationships with internal and external engineering teams, and assist system architects and machine learning/deep learning engineers in building creative solutions based on NVIDIA technology. Be an internal reference for scheduling and resource management concepts and methodologies among the NVIDIA technical community. Test, evaluate, and benchmark new technologies and products and work with vendors, partners and peers to improve functionality and optimize performance. What we need to see: 5+ years of experience designing and running scheduling and resource management systems in large datacenter/AI/HPC solutions. Knowledge and experience with resource management / scheduling code bases: SLURM preferred, other implementations (LSF, SGE, Torque...). Proven understanding of performance clusters, infrastructure and workload patterns. Experience using and installing Linux-based server platforms. C/Python/Bash/Lua programming/scripting experience. Experience working with engineering or academic research community supporting HPC or deep learning. Strong teamwork and both verbal and written communication skills. Ability to multitask efficiently in a very dynamic environment! Action driven with strong analytical and troubleshooting skills. Desire to be involved in multiple diverse and innovative projects. BS in Engineering, Mathematics, Physics, or Computer Science or equivalent experience. MS or PhD desirable. Ways to stand out from the crowd: Experience with HPC cluster administration for AI. Experience deploying containerized services. Experience with orchestrators (e.g. Kubernetes). Demonstrated work with Open-Source software: building, debugging, patching and contributing code. Experience tuning memory, storage, and networking settings for performance on Linux systems. Exposure to monitoring and telemetry systems.voyager

NVIDIA Logo

Senior Performance Engineer

NVIDIA

Santa Clara, California, United States

Posted: a year ago

What you’ll be doing: Lead all aspects of implementing performance practices in large scale infrastructure, deliver powerful tools, methodologies, and flows to validate and improve several datacenter products in parallel. Accelerate strategic customer deployments and ensure speed-of-light bringup and deployment of ground-breaking AI infrastructure by working hand in hand tailoring design and faster processes to customer needs. Specific responsibilities include owning the architecting of performance design and settings of datacenter at scale products both implemented in FW and SW components to ensure velocity and scale while efficiently using resources. This involves early engagement with HW/FW/SW/platform internal and customer teams, and other groups, to build end-to-end solutions and optimize datacenter product designs. As a key member you will supply to architecting of the implementation of server and rack level telemetry aspects, collaborate and establish continuous improvements in our design flows. Participating in engagements with various SW and FW (BMC/SBIOS/OS/drivers etc) teams to develop best-in-class practices and tools, you will be analyzing, debugging and resolving critical firmware and software issues for the best AI workload performance at scale. Provide engineering solutions to enable large scale performance strategies for performance for Datacenter GPU Computing products and software stacks, ensure technical relationships with internal and external engineering teams, and assisting systems engineers in building creative solutions based on NVIDIA technology. Be an internal reference for firmware, at scale deployment for datacenter and large-scale GPU-accelerated system solutions among the NVIDIA technical community.   What we need to see: 5+ years of experience in using accelerated computing for datacenter container computing solutions. Strong knowledge of accelerated computing software stacks (CUDA). Experience using and handling modern Cloud and container-based Enterprise computing architectures. C/C++/Python/Bash programming/scripting experience. Experience with CPU architecture. Experience with container technology and Linux based OSes. Experience working with engineering or academic research community supporting high performance computing or deep learning. Strong verbal and written communication skills. Strong teamwork and social skills. Ability to multitask effectively in a dynamic environment. Action driven with strong analytical and analytical skills. Desire to be involved in multiple diverse and creative projects. BS in Engineering, Mathematics, Physics, or Computer Science (or equivalent experience). MS or PhD desirable.   Ways to stand out from the crowd: Deep Learning framework skills. DL and graph compiling programming skills. Exposure to virtualization techniques, cloud platform solutions. Exposure to scheduling and resource management systems. Experience with high performance or large scale computing environments.