Explore Agent Testing Tools and Considerations
Learning Objectives
After completing this unit, you’ll be able to:
- Explain the importance of testing agents.
- Describe the tools you can use to test your agent.
- Discuss considerations of agent testing and ways to mitigate them.
Before You Start
Before you start this module, consider completing this recommended content. These modules provide a foundation of knowledge that this module will build upon.
-
Trailhead: Agentforce: Agent Planning
-
Trailhead: Agentforce Builder Basics
-
Trailhead: The Einstein Trust Layer
Introduction
Artificial intelligence (AI) and the rise of AI Agents are reshaping how we think about software development. In many organizations, the very same Salesforce administrators and developers who’ve spent years administering or customizing Salesforce solutions are now charged with building Agentforce Agents. This requires a shift in their skills, the tools they use, and their mindset. While the familiar and traditional application lifecycle management (ALM) stages of ideation, configuration, testing, deployment, and observation also apply to the agent development lifecycle (ADL) process, tossing generative AI into the mix can add a few unexpected twists and turns, especially regarding agent testing.
In this module, you learn about tools available to test and troubleshoot your agents, considerations to help you test, and testing strategies you can use to make your agent’s responses more accurate and predictable.
The Reasons to Test
If you’ve earned the Agentforce: Agent Planning badge, you followed along as Nora Alami at Coral Cloud Resorts planned an agent that could create and manage customer reservations. You learned about defining criteria like your audience, scope, use cases, guardrails, and the tasks it will perform. These specifications are the same things your testing should validate to make sure your agent’s performance aligns with the work you designed it to do.
Tools to Test and Troubleshoot Your Agent
Ensuring your agent responds accurately and predictably to user input can seem daunting, especially when you consider all of the user requests your topics, actions, and guardrails need to be able to handle. With so many variables at play, the cause of an inaccurate response, an error message, or a hallucination might be inside an instruction, an action, data, or a permission set. That’s why Agentforce Studio gives you two levels of testing so you can feel confident that your agent is ready to provide reliable and predictable responses: manual testing in Agentforce Builder, and testing-at-scale in Testing Center.
Agentforce Builder Testing and Troubleshooting Tools
After you build your agent in Agentforce Creator, you can begin to test it in Agentforce Builder. You can try out conversations in the Conversation Preview panel to see how your agent performs. You can review the steps it took to return the response you received by looking through the details on the plan canvas. And, you can review the agent’s event logs to see specific session and conversation details.
Conversation Preview (1):It’s exciting when you get to the step in Agentforce Builder where you’re able to begin conversing with your agent in the Conversation Preview panel. Here, you can simulate conversations your users might have with your agent so you can see if it responds the way you intended. The responses it generates lets you see if your agent provides helpful and relevant responses, calls the right actions, references your business processes correctly, and stays within the guardrails you set.
Plan canvas (2): Each time you enter input into the conversation preview chat window, the panel in the center, called the plan canvas, updates to show you how the agent came up with its response. The plan canvas shows the initial input you gave it, the topic it selected, the actions it called, and which instructions it used. You can also see the reasoning the agent used to generate the response, and any of your relevant data that it was permitted to use to provide a more personal and accurate response.
The response and the details you receive help you pinpoint where you can refine your agent to provide responses that are in line with your plan. You can test an input, revise your agent, and test again. Just refresh the Conversation Preview window between inputs and to apply your updates.
Enhanced Event Logs
While details of your interactions in the Conversation Preview panel disappear each time you refresh your agent, Enhanced Event Logs capture and store interactions in an agent session so you can view the flow of a conversation to improve your agent’s responses. To use Enhanced Event Logs, you enable the setting in Agentforce Creator on the Customize your agent screen by checking a box called Keep a record of conversations with Enhanced Event Logs to review agent behavior. You can also enable Enhanced Event Logs later on the Details tab in your agents’ Settings.
Having access to Enhanced Event Logs comes in handy after your agent is launched because you can review the kinds of conversation exchanges your users have with your agents, including the input the agent was given and how it responded. This can help you locate and fix an issue or adjust your agent to handle input you hadn’t anticipated. Event logs let you know if you need to set additional guardrails, or refine your instructions or actions to give more targeted responses. Agentforce Builder stores event logs for 7 days so you can retroactively review conversation data and session activity all in one place.
Testing Center
After you’ve refined your agent’s performance in Agentforce Builder, you’re ready to batch test it in Testing Center. To access Testing Center from Setup, search for and select Testing Center in the Quick Find box. Or from Agentforce Builder, click the Batch Test button above the Conversation Preview panel.
You might be thinking, I already tested my agent in Agentforce Builder, why do I need to batch test it in Testing Center? Well, it would take a very long time to think up every way a user could pose a question or interact with your agent and then test them one by one in the Conversation Preview window. Testing Center simplifies testing by testing dozens, or even hundreds of scenarios at one time. For example, you can upload a .csv file of test scenarios you’ve written in natural language, or you can ask Testing Center to use AI to generate test input that is relevant to the jobs your agent performs.
When a batch test is run, the results show you the input that was tested along with the expected and actual topics and actions it called, the expected response, and whether each input passed or failed. If you need more information about why a test input failed, you can copy and paste the input into the Agentforce Builder Conversation Preview panel, and review the path the agent took to reach the failed response on the plan canvas. This helps you further refine your instructions, which in turn improves the user experience. For detailed information about Testing Center and writing or generating test scenarios, see Agentforce: Agent Testing.
Agent Testing Considerations
In traditional application testing, you plan every detail of your application before you even begin to build it. Success is measured by producing predictable and repeatable results–it’s deterministic. Your solution either works the way it’s intended, or it doesn’t. On the other hand, while developing an agent also requires planning up front, you refine, test, and revise your agent while you build it. Agent testing is probabilistic, meaning its results can be less predictable, unique, and, well, sometimes surprising because of generative AI’s lack of rules-based logic. The same input can generate many different, yet still correct responses, incorrect responses, or even hallucinations. It’s also difficult to anticipate all the ways a user might interact with your agent, so you need to account for and test for a variety of scenarios when you build it. This way, you minimize responses that don’t match your user’s input or that are inaccurate.
Determine When Your Agent Is Production-Ready
The probabilistic nature of agent behavior makes determining when your agent is production-ready a little unclear. Every company needs to determine its own baseline for pass/fail rates in various scenarios. There isn’t one right answer, and the level of precision desired can vary by industry. A good place to start is to consider how accurately a human would perform the same task, for example, handling reservation questions, and use that as a baseline. Then you can strive to ensure your agent meets or exceeds that level of accuracy.
Always Test in a Sandbox
Testing your agents can modify your CRM data, so always use Testing Center in a sandbox environment, never in your production environment.
Use Multiple Criteria to Evaluate Your Response
To get the responses you want from your inputs in the Conversation Preview panel, it will likely take some trial and error. Building an agent is an iterative process. And to account for various types of input, you need to go through a bit of revising, including wordsmithing, checking permissions, validating data, or adding more details or guardrails to your instructions. The feedback you receive on the plan canvas, event logs, or Testing Center will help you hone in and identify where you need to refine your agent's topics, actions, or instructions to get responses that are closer to your desired level of accuracy.
Here are several key things to consider as you test your agent and ways to address them.
Testing Consideration |
Ways to Refine Your Agent |
---|---|
Did the agent follow my instructions? |
|
Is the response accurate, complete, and easy to read? |
|
Is the response grounded in my data? |
|
Is the response aligned with my brand voice? |
|
How long did the response take? |
|
Is there bias or toxicity in the response? |
|
Is the response reliable every time? |
|
Testing Costs
One last testing consideration is the cost to run tests. Testing your agent in Testing Center can consume Flex Credits, Conversation Credits, or Einstein Requests and can consume Data Cloud Credits, too. These requests and credits are billable usage metrics for generative AI that incur costs to your organization. To learn more, review the Generative AI Billable Usage Types help documentation, or speak with your account executive.
Wrap Up
Agent Testing requires a different way of thinking and working from traditional application testing. When you consider all of the variables that can impact your agent’s responses, it’s no wonder that successful agent testing is more subjective than a traditional software test. Mastering Agentforce testing tools and understanding how to mitigate the factors that affect the agent’s performance can help you quickly achieve your desired level of accuracy. In the next unit, you learn the importance of creating an agent testing strategy to guide your testing.