Triage Agent Responses
Learning Objectives
After completing this unit, you’ll be able to:
- Prioritize agent behavior design based on heuristic scores.
- Establish a continuous loop of evaluating and refining agent behavior.
Tackle Trouble with Triage
Agent failures are not always obvious. Unlike traditional digital experiences, agents may still produce responses while failing to resolve the user’s issue. Weak prompts, missing data, or unclear intent can all lead to answers that sound reasonable but do not help the user succeed. Experience heuristics help teams detect these subtle failures and prioritize what to fix.
Heuristic evaluation and severity mapping allow designers to understand where behavior patterns lose value, and how much value is at risk because of a perceived failure. That way, they can prioritize interventions where they have the highest impact on real experiences.
This is another key shift for designers. While focusing effort where it’s needed most is nothing new, agents require a continuous process of evaluation and improvement that accounts for more than bugs and blockers. Heuristics help designers identify gaps between user expectations and the reality of agent logs. Severity mapping helps designers identify the most urgent gaps to close, and explain their process in language that other designers can understand.
Evaluations Make or Break Experiences
Evaluating an agent’s performance is an essential part of maintaining that agent’s ground truth. That is to say, ensuring that an agent conforms to the designer’s definition of “good” behavior. While agents can learn from interactions themselves, it’s design interventions that truly shape agent behavior over time. As designers step into the role of defining what failure and success look like in an agent interaction, they need to be able to adjust to unexpected or ineffective patterns of behavior.
Where a designer’s role in the past might involve assessing a technical pass-or-failure, the agentic designer needs to take full advantage of the rich insights in agent logs. These logs help designers see the flow of a conversation, including any points of frustration or failure where an agent could have acted differently. When examining logs and accessing heuristic factors like trust, approachability, and correctness, evaluators center on two key ideas:
- All evaluations start with and flow from user intent.
- Assessments should prioritize failure severity, not just failure rates.
With this in mind, designers can focus their effort on what matters most: delivering tangible results for users, and correcting the most impactful failures first. Let’s take a look at what the evaluation itself entails.

[AI-generated image using Google Docs Gemini.]
Apply Heuristics to Assess Agents
Using the Salesforce Lightning Design System agent heuristics and failure taxonomy, here’s how a designer might go about scoring an agent interaction log, and how that score can inform the designer’s next steps.
First, an evaluator should keep a few points in mind when scoring.
- Start with user goal: Everything flows from this.
- Evidence-based: Quote metrics; do not guess.
- Cascading failures matter: One root failure can lead to other subsequent failures.
- Look for contradictions: Does passing one heuristic, like approachability, conflict with failing another, like factuality?
- Politeness is not a pass: Focus on task value.
- Consistency: Apply the same standards across heuristics.
- All Pass/Fail/N/A designations require reasoning: Always include turn numbers, observed behavior, and impact.
- Early successes don’t excuse later failures: If the agent succeeds in turn 1 but fails the same heuristic in turn 5, that’s still a fail
- For passes, explain what went right: What specific behavior demonstrated success?
- For fails, explain what went wrong: What should the agent have done instead? How did this impact the user experience?
Next, the evaluator’s goal is to understand the context of the conversation itself. This is done by answering the following questions, as briefly and directly as possible.
- What did the user want to accomplish?
- What did the agent deliver?
- Did the user achieve their goal? (Yes, Partially, or No)
- Where was value lost? (Focus on turns and describe the outcome.)
With a grounding in the user’s intent and a good understanding of what happened over the course of the interaction, the evaluator can go on to assess each heuristic, and assign a pass or failure. As a reminder, see the heuristics table in the previous unit, which explains the severity tiers associated with P0, P1, and P2 incidents.
For each heuristic, the designer chooses pass, fail, or N/A (in cases where the heuristic might not come up or apply in a particular conversation). These assessments also note the turns, the observed behavior, and the impact to the user. Once all heuristics are scored, the evaluator can move on to a final score for the conversation. This final score isn’t based on averages, because a single major failure can have a massive impact on the user experience. Instead, a score for the conversation is assigned based on the highest severity tier observed.
Result |
Final Score |
|---|---|
Any heuristic failed with P0 tier. |
P0 Critical System Failure |
No P0 failures, but any heuristic failed with P1 tier. |
P1 User Intent Not Met |
No P0 or P1 failures, but any heuristic failed with P2 tier. |
P2 Limited Value |
All heuristics pass. |
Pass |
Finally, once a log is scored, the evaluator documents their findings and next steps. That includes noting:
- Primary heuristic(s) driving the score
- Root failure (what went wrong first)
- Cascading failures
- User impact (value lost)
- Corrective actions
At the end of the evaluation process, designers have a much better understanding of both what happened in a conversation and the right priority for next actions. For agents, getting the best outcomes is about applying insightful design interventions, not just having a good set of initial rules. By using heuristics and severity tiers, designers center agent behavior on real success for real people, defined by aligning what agents do with what users need.
Evaluating AI agents isn’t just about spotting what’s broken. It’s about understanding what matters most to the user and prioritizing improvements accordingly. With a structured approach to triage and refinement, you can continuously shape agent behavior to deliver meaningful, reliable experiences.