Skip to main content
Register now for TDX! Join the must-attend event to experience what’s next and learn how to build it.

Consider Unstructured Data

Learning Objectives

After completing this unit, you’ll be able to:

  • Identify risks and limitations when using unstructured data to power AI agents.
  • Explain how grounding, indexing, and retrieval reduce hallucination risk.
  • Recognize common data reliability issues in unstructured content sources.

Start with Unstructured Data Sources

Luna knows that much of the data used to answer customer questions doesn’t live in traditional structured fields. Return policies, shipping rules, store hours, and warranty details often exist in what’s called unstructured data sources, like FAQs, PDFs, chat logs, emails, and knowledge articles. Generative AI can access this unstructured data, but only when it's properly prepared.

Illustration showing six types of unstructured data sources arranged in a grid with icons: Text Files & Documents, Multimedia Content, Social Media Data, Mobile and Communications Data, Machine and Sensor Data, and Historical Archives.

When Luna begins analyzing why the case deflection agent sometimes provides wrong responses, she first evaluates the unstructured data sources that the agent uses.

Unlike structured data, which can be profiled and queried, unstructured content requires special consideration. Agents must be grounded in accurate content and able to retrieve the right information in real time. Otherwise, they risk hallucinating or answering with missing, outdated, or irrelevant information.

Many enterprise AI solutions, including Agentforce, use a retrieval approach often called retrieval augmented generation (RAG). Instead of relying solely on what the model was trained on, the agent searches unstructured content available in enterprise documents, such as knowledge articles, policies, emails, and transcripts, and retrieves relevant content before generating a response.

This approach reduces hallucinations by grounding answers in trusted data. However, it also introduces new reliability risks. If the wrong documents are indexed, outdated content is retrieved, or documents are poorly structured, the agent can still provide wrong answers even though it is technically working as designed.

Luna summarizes the reliability challenge with a simple model. For unstructured data to support reliable AI responses:

  • The right documents must exist.
  • Integration processes must ensure that the latest approved version is processed.
  • The documents must be correctly indexed and searchable.
  • The agent must retrieve the correct section of those documents.

If any of these steps fail, the agent can generate unreliable responses.

Key Risks with Unstructured Data

Luna uses the following checklist to evaluate whether unstructured content is ready for AI.

Evaluation Area

Considerations

Examples of Risks

Source trust and relevance

Is the source the official system of record?

Is the data current, complete, and aligned to approved policies?

The return policy used by the agent comes from an old PDF on a shared drive, not from the repository accessible to end consumers.

An internal email is used as a source even though it isn’t customer-facing or approved.

Context and post-processing transparency

Unstructured content is sometimes shortened or altered before it is used by agents. If you don’t understand how it was changed, the content can lose important meaning.

Was the content AI-generated or human-authored?

Can you trace back and explain any summarization, redaction, or transformation of the content?

A long call transcript is automatically summarized, but the summary omits important problem details the agent needs.

Sentiment analysis of transcripts should be compared with customers’ feedback responses to assess the effectiveness of CSAT (customer satisfaction) score assertions.

Retrieval gaps

Can the content be effectively indexed and retrieved when the agent is asked a related question?

Are there document formats (like embedded PDFs or images) that limit search and retrieval?

While audio or video files can contain relevant context, such as tone or body language, they’re lacking in text-only transcripts. They’re also often not retrievable without specialized processing, like content tagging.

Even when relevant documents exist, retrieval failures can occur when the system cannot match a user’s question to the correct section of content.

Missing contextual metadata

Does the content include date and version metadata so the latest version is used?

Do transcripts include participant and timestamp data?

A return policy article has no publish date, so the agent can’t determine which version applies to the customer’s order date.

Grounding and verification

Grounding is essential for reducing hallucinations by connecting agents to approved and authoritative content.

Is the agent grounded to a curated index of trusted documents?

Are content citations returned to users so they can verify the answer?

The agent references an internal draft instead of an approved policy.

Case Deflection Agent Risks

During the proof‑of‑value project, Luna identifies several issues related to unstructured data.

Risk

Example

Impact

Outdated or unapproved content

Knowledge articles were manually uploaded instead of synchronized from the content management system. Older policies and draft documents were indexed alongside approved versions.

The agent sometimes referenced outdated return or replacement policies.

Fragmented customer interaction history

Customer conversations existed across legacy chatbots, email systems, and external call‑center transcripts.

Without associating these interactions with the unified customer profile, the agent lacked full context when answering questions.

Unstructured operational data

Delivery systems stored proof‑of‑delivery photos, but location metadata was embedded in the image files.

Without extracting and comparing geolocation information, NTO could not automatically identify incorrect deliveries.

These examples show how unstructured data often contains valuable information that is inaccessible to AI agents until it is properly indexed, associated with customer context, or transformed into a usable format.

By reviewing sources, grounding content, and ensuring accurate retrieval, Luna helps NTO’s Case Deflection agents give trustworthy answers every time.

Unstructured Data and the AI Reliability Framework

Unstructured data is often the first area architects evaluate in agent implementations. Many early AI projects begin with knowledge articles or policy documents because they provide quick business value.

However, reliable AI responses depend on ensuring that these documents are current and approved, properly indexed, and connected to the relevant customer context.

In the next unit, Luna investigates another key reliability factor: structured data used by the agent to understand customers, transactions, and operational records.

Resources

在 Salesforce 帮助中分享 Trailhead 反馈

我们很想听听您使用 Trailhead 的经验——您现在可以随时从 Salesforce 帮助网站访问新的反馈表单。

了解更多 继续分享反馈