Evaluate AI Agent Value

Learning Objectives

After completing this unit, you’ll be able to:

Differentiate between technical failures and failed user experiences.
Triage agent failures using a heuristic priority model.

How to Evaluate Agent Value

In this unit, you learn how to move beyond technical benchmarks to understand how users actually perceive AI agents. You explore a three-tier failure taxonomy and learn practical design strategies to increase trust and adoption.

The Gap Between Working and Worth It

A screen shows a positive result on one side of a split image. On the other hand, a person in the field is still confused by unclear directions. This visualizes how what looks like technical success to an agent might still be a confusing experience for a user.

[AI-generated image using Google Docs Gemini.]

Success isn’t just about the model’s performance on a benchmark—it’s about the user’s perception of value and the trust they place in the tool. When a user says an agent doesn’t work, they aren’t usually talking about a 404 error or a system crash. AI interactions are nuanced. Users often lack the technical vocabulary to describe why a conversation felt “off,” so they resort to generic complaints.

Take this example: A user asks an agent, “What are the current inventory levels for our top-selling product?” The agent responds: “I don’t have access to the inventory database. You can find that information in the Supply Chain Dashboard.”

Technically? This is a successful interaction. The agent didn’t hallucinate; it correctly identified its limitations and redirected the user. But to the user? This is a failure. They didn’t get the answer they needed to complete their task. To them, the agent didn’t work.

Agent Quality Heuristics

Heuristics have always been the rules of thumb that help designers evaluate quality. But while traditional heuristics (like the Nielsen Norman 10 Usability Heuristics) were built to measure how well a human navigates a static interface, Agent Quality Heuristics measure how well an agent navigates a dynamic context.

In the world of AI, a heuristic isn’t just a usability check—it’s a performance standard. We move beyond simple system success (did the code run correctly?) to perceived success, ensuring the response is valuable, timely, and trustworthy enough for a user to rely on consistently.

This is essential, because to respond to end-users at scale, a business needs to have a clear and shared definition of what a good user experience looks like, and how to assess experiences that don’t stack up. This is done by examining various indicators of an agent’s success that center on the user’s perspective, then mapping them to the severity of their impact.

The Three Tiers of Agent Failure

Salesforce uses a failure points taxonomy to determine how disruptive a particular failure is for a user. This helps teams move away from “it’s broken” and toward “here’s exactly why the user is frustrated.”

Severity	Tier	Description
P0	Red Alert: System Failure	The highest severity. The agent crashes, times out, or provides a nonsensical hallucination that’s factually dangerous.
P1	Missed the Mark: User Intent Not Met	The agent is functional, but it delivers an output misaligned with the user’s goal. It misunderstood the what or the why of the request.
P2	Usable, Not Lovable: Limited User Value	The agent is functional and accurate, but the output is low-quality, too wordy, or requires the user to do more work to get the actual answer.

While P0s are usually caught in technical quality assurance, P1s and P2s are often where user frustration and fall-off is a risk. Though difficult to identify in traditional testing, these failures are painfully obvious to the end user. The heuristics themselves each map onto a severity tier, which allows designers to translate the interactions they see in evaluations into a functional system of triage.

Heuristic	Diagnostic Questions for Scoring	Severity Tier Mapping
Factual and Reliable	Is the response perceived as correct in the moment? Is it relevant, free of hallucinations, contradictions, and errors? Does the agent avoid contradicting previously established context or information?	P0
Effective	Does the output meet the user’s actual intent, even if the system acts as designed?	P1
Responsive	Does the agent ask proactive clarifying questions if the initial prompt is vague?	P1
Memory and UI Context	Does it effectively use UI page context and information from prior turns to provide more relevant responses without requiring the user to repeat themselves?	P1
Trusted	Does the agent operate within appropriate boundaries and authority? Does it avoid simply deflecting or stating limitations without offering actionable alternatives?	P1
Teachable	Does the agent adjust based on negative user sentiment (such as, “That’s not what I meant”)?	P1
Decisive	Does the agent move the user forward with clear direction and confidence, without exposing internal system complexity, being overly cautious, or creating decision paralysis?	P1
Conversational	Does it use plain language and avoid being too noisy or verbose?	P2
Consistent	Is the brand voice, terminology, and formatting consistent across turns?	P2
Approachable	Is it inclusive, accessible (Web Content Accessibility Guidelines 2.2), and easy to interact with?	P2
Helpful	Does it provide actionability and next steps rather than deflecting to self-service?	P2

By understanding the factors that influence a user’s subjective experience with an agent, designers can plan for an agent’s initial success and ongoing improvement. Heuristics provide a shared standard for assessing agent behavior, and mapping agent failures to easily understood priority tiers makes it easier to identify and intervene in moments of friction. In the next unit, you explore how designers can apply insights from heuristic evaluations to triage agent failures.

Resources

Salesforce: Your AI Agent Works, But Do Your Users Think It’s Worth It?

예상 시간

주제

도움말 검색

Agentforce 자원