Description
Role Description:
As a Principal Engineer on the Agentforce Deployment Platform team, you will own the end-to-end architecture, strategy, and execution of our AI/ML deployment and operationalization systems. You’ll collaborate closely with software engineers, data scientists, product managers, and data teams to build and turn cutting-edge architecture and research into scalable, highly available, and compliant production-ready systems.
You are not just a coder — you are a thought leader, innovator, builder and mentor who thrives on ownership and pushing boundaries in production MLOps, AI infrastructure, and reliable delivery in a rapidly changing and cutting edge space.
Key Responsibilities:
* Lead the architectural vision for our global-scale ML serving, inference, and model management platform.
* Design and optimize low-latency, high-throughput model serving infrastructure and data flow for training and inference at scale.
* Strategize and implement AI assisted migration platform that is proactive governance and reactive autonomous remediation by enforcing policies at every stage for Deployment lifecycle.
* Work with product and business teams to translate user needs into technical requirements, focusing on platform capabilities for rapid iteration and secure deployment.
* Set long-term technical strategy and direction, serving as a top-tier technical mentor for engineers across teams.
* Drive adoption of cutting-edge MLOps best practices for model training, secure and automated deployment, proactive monitoring, and robust governance.
* Innovate not just in model building, but in how models are packaged, delivered, and operated in a mission-critical environment.
* Make strategic technical decisions on build vs buy, model selection, and core platform infrastructure to ensure scalability and cost-efficiency.
Required Skills:
* 15+ years of software engineering experience; 7+ years building and operating AI/ML systems at scale.
* Demonstrable Principal-level impact and ownership on large-scale engineering initiatives.
* Expertise in at least one object-oriented programming language (Java/C++/GoLang) and one ML native language (Python).
* Strong experience in Applied AI, specifically focusing on the infrastructure and platform services required to operationalize deployment vehicles effectively.
* Deep experience with high-scale ML serving frameworks (e.g., TorchServe, TensorFlow Serving, NVIDIA Triton).
* Familiarity with LLMs, vector databases, and applied generative AI deployment patterns (e.g., containerization, traffic management, and cost optimization of RAG pipelines).
* Deep mastery of system design, distributed systems, and cloud-native architectures (AWS/GCP, Kubernetes, Service Mesh).
* Exceptional track record in building and scaling ML serving pipelines, real-time inference systems, and API platforms.
* Proven ability to influence and drive technical consensus across cross-functional teams and mentor senior engineers.
* Strong communication and collaboration skills across technical and non-technical teams.
* Ability to translate complex AI concepts into pragmatic and compliant engineering decisions.
* Experience in startups or high-growth tech companies.
Preferred Skills:
* Contributions to open-source AI/ML infrastructure or MLOps projects.
* Patents, papers, blogs, or other external publications related to large-scale ML deployment, observability, or governance.
* Strong platform and product-centric mindset demonstrated by high-leverage infrastructure projects
