Description
As an Associate Evaluations Manager on the Digital Success Data & AI team, you will help measure, monitor, and improve the performance of Agentforce on Salesforce Help. You’ll run ongoing synthetic evaluations across Answer Quality, latency, instructions adherence, and other core capabilities, translating results into clear, actionable insights that inform product decisions, operational improvements, and leadership reporting.
You will contribute to operational excellence by building clear documentation, repeatable processes, and scalable evaluation frameworks that increase confidence in agent performance. This role partners closely with Support Delivery, Engineering, Operations, and Data Science to evaluate new features, identify risks and opportunities, and ensure quality at launch.
You will also help evolve our evaluation ecosystem — refining LLM judge prompts and ground truths — while identifying opportunities to increase automation, reliability, and efficiency. Success in this role requires strong analytical rigor, curiosity, and ownership: someone comfortable working in ambiguity, solving problems end-to-end, and consistently delivering clear, defensible insights that drive action.
Your Impact
Support the Agentforce baselining program, using synthetic and automated tooling to continuously measure and improve performance.
Analyze evaluation results independently, identifying root causes, surfacing trends, and translating insights into actionable recommendations for models, implementations, and processes.
Maintain and evolve evaluation frameworks, scoring rubrics, and guidelines to ensure consistent, defensible, and scalable assessments.
Deliver clear, influential reporting and business reviews that inform stakeholders and drive product and operational decisions.
Define, monitor, and interpret key evaluation metrics, proactively identifying risks, regressions, and improvement opportunities.
Enable internal partners on evaluation processes and findings, building trust and shared understanding across teams.
Strengthen the evaluation feedback loop across automated testing, LLM-judge prompts, and golden datasets to continuously improve testing sophistication.
Perform targeted evaluations for new features and urgent initiatives, ensuring quality and market readiness.
Audit and refine the utterance repository to keep testing relevant, high quality, and aligned with evolving product capabilities.
Synthesize customer and internal feedback into actionable insights, helping shape product direction and operational improvements.
Advocate for tooling, process, and workflow improvements that increase evaluation efficiency, scalability, and reliability.
Proactively surface risks and partner on mitigations, ensuring issues are addressed before they impact customers.
Required Skills
1+ years of professional experience working in Salesforce environments (program, analyst, operations, or product context). Demonstrated ability to take ownership of tasks and drive outcomes independently.
Strong analytical mindset: comfortable reviewing conversational AI outputs, identifying failure patterns, conducting root cause analysis, and translating findings into actionable recommendations.
Operational rigor and attention to detail: able to execute repeatable evaluation workflows accurately and consistently in a fast-paced, ambiguous environment.
Clear written communication skills: able to document findings, produce internal documentation, and communicate insights concisely for cross-functional audiences.
Comfort working with data: proficiency in spreadsheets (e.g., Google Sheets), reporting, and basic dashboard interpretation to derive insights and track trends.
High reading comprehension and critical thinking: able to evaluate nuanced generative AI responses against quality standards and expected behaviors.
Tool fluency: ability to work confidently in Salesforce reporting environments (Agentforce, Tableau, Testing Center, Observability) or quickly ramp on similar tools.
Curiosity and learning agility: resourceful in exploring new tools, understanding evolving AI behaviors, and continuously improving evaluation approaches.
Execution reliability: responsive, accountable, and dependable in delivering accurate outputs and supporting operational needs.
Preferred Skills
Experience evaluating AI-generated content or conversational systems
Familiarity with prompt engineering, acceptance criteria development, or labeling workflows
Experience supporting operational analytics, UAT, QA, or product evaluation programs
Experience documenting processes or enabling others through training materials
How We Work
Clarity: We optimize for being understood, not sounding smart. Executive-ready summaries, clean narratives, no overcomplication.
Operational rigor & integrity: Our work must be accurate, defensible, and trusted. Show your work — assumptions, data sources, and logic matter.
Curious bias for action: We ask “why” in service of forward progress. Insight only matters if it leads to better outcomes.
Customer Zero mindset: We use our own products first to surface issues before customers do — test, break, learn, improve.
Mutual support & collaboration: We operate as a team — sharing context, helping each other unblock work, and maintaining strong operational awareness.
