Skip to main content
Register now for TDX! Join the must-attend event to experience what’s next and learn how to build it.

Evaluate AI Agent Value

Learning Objectives

After completing this unit, you’ll be able to:

  • Differentiate between technical failures and failed user experiences.
  • Triage agent failures using a heuristic priority model.

How to Evaluate Agent Value

In this unit, you learn how to move beyond technical benchmarks to understand how users actually perceive AI agents. You explore a three-tier failure taxonomy and learn practical design strategies to increase trust and adoption.

The Gap Between Working and Worth It

A screen shows a positive result on one side of a split image. On the other hand, a person in the field is still confused by unclear directions. This visualizes how what looks like technical success to an agent might still be a confusing experience for a user.

[AI-generated image using Google Docs Gemini.]

Success isn’t just about the model’s performance on a benchmark—it’s about the user’s perception of value and the trust they place in the tool. When a user says an agent doesn’t work, they aren’t usually talking about a 404 error or a system crash. AI interactions are nuanced. Users often lack the technical vocabulary to describe why a conversation felt “off,” so they resort to generic complaints.

Take this example: A user asks an agent, “What are the current inventory levels for our top-selling product?” The agent responds: “I don’t have access to the inventory database. You can find that information in the Supply Chain Dashboard.”

Technically? This is a successful interaction. The agent didn’t hallucinate; it correctly identified its limitations and redirected the user. But to the user? This is a failure. They didn’t get the answer they needed to complete their task. To them, the agent didn’t work.

Agent Quality Heuristics

Heuristics have always been the rules of thumb that help designers evaluate quality. But while traditional heuristics (like the Nielsen Norman 10 Usability Heuristics) were built to measure how well a human navigates a static interface, Agent Quality Heuristics measure how well an agent navigates a dynamic context.

In the world of AI, a heuristic isn’t just a usability check—it’s a performance standard. We move beyond simple system success (did the code run correctly?) to perceived success, ensuring the response is valuable, timely, and trustworthy enough for a user to rely on consistently.

This is essential, because to respond to end-users at scale, a business needs to have a clear and shared definition of what a good user experience looks like, and how to assess experiences that don’t stack up. This is done by examining various indicators of an agent’s success that center on the user’s perspective, then mapping them to the severity of their impact.

The Three Tiers of Agent Failure

Salesforce uses a failure points taxonomy to determine how disruptive a particular failure is for a user. This helps teams move away from “it’s broken” and toward “here’s exactly why the user is frustrated.”

Severity

Tier

Description

P0

Red Alert: System Failure

The highest severity. The agent crashes, times out, or provides a nonsensical hallucination that’s factually dangerous.

P1

Missed the Mark: User Intent Not Met

The agent is functional, but it delivers an output misaligned with the user’s goal. It misunderstood the what or the why of the request.

P2

Usable, Not Lovable: Limited User Value

The agent is functional and accurate, but the output is low-quality, too wordy, or requires the user to do more work to get the actual answer.

While P0s are usually caught in technical quality assurance, P1s and P2s are often where user frustration and fall-off is a risk. Though difficult to identify in traditional testing, these failures are painfully obvious to the end user. The heuristics themselves each map onto a severity tier, which allows designers to translate the interactions they see in evaluations into a functional system of triage.

Heuristic

Diagnostic Questions for Scoring

Severity Tier Mapping

Factual and Reliable

Is the response perceived as correct in the moment? Is it relevant, free of hallucinations, contradictions, and errors? Does the agent avoid contradicting previously established context or information?

P0

Effective

Does the output meet the user’s actual intent, even if the system acts as designed?

P1

Responsive

Does the agent ask proactive clarifying questions if the initial prompt is vague?

P1

Memory and UI Context

Does it effectively use UI page context and information from prior turns to provide more relevant responses without requiring the user to repeat themselves?

P1

Trusted

Does the agent operate within appropriate boundaries and authority? Does it avoid simply deflecting or stating limitations without offering actionable alternatives?

P1

Teachable

Does the agent adjust based on negative user sentiment (such as, “That’s not what I meant”)?

P1

Decisive

Does the agent move the user forward with clear direction and confidence, without exposing internal system complexity, being overly cautious, or creating decision paralysis?

P1

Conversational

Does it use plain language and avoid being too noisy or verbose?

P2

Consistent

Is the brand voice, terminology, and formatting consistent across turns?

P2

Approachable

Is it inclusive, accessible (Web Content Accessibility Guidelines 2.2), and easy to interact with?

P2

Helpful

Does it provide actionability and next steps rather than deflecting to self-service?

P2

By understanding the factors that influence a user’s subjective experience with an agent, designers can plan for an agent’s initial success and ongoing improvement. Heuristics provide a shared standard for assessing agent behavior, and mapping agent failures to easily understood priority tiers makes it easier to identify and intervene in moments of friction. In the next unit, you explore how designers can apply insights from heuristic evaluations to triage agent failures.

Resources

Salesforce 도움말에서 Trailhead 피드백을 공유하세요.

Trailhead에 관한 여러분의 의견에 귀 기울이겠습니다. 이제 Salesforce 도움말 사이트에서 언제든지 새로운 피드백 양식을 작성할 수 있습니다.

자세히 알아보기 의견 공유하기