Skip to main content
Register now for TDX! Join the must-attend event to experience what’s next and learn how to build it.

Assess Unstructured Data Readiness for AI

Learning Objectives

After completing this unit, you’ll be able to:

  • Identify common reliability risks in unstructured data sources.
  • Explain how grounding and data retrieval impact AI accuracy.
  • Evaluate whether unstructured content is ready for agentic use.

Activate Unstructured Data

Luna knows that much of the information NTO uses to answer customer questions comes from unstructured content (not structured CRM fields). Return policies, shipping rules, warranty terms, chat transcripts, email threads, and other documents contain the essential information that customer support needs to respond to customer inquiries. Generative AI can now analyze this type of data, unlocking insights that were previously buried in inaccessible unstructured documents.

Content being available is not the same as content being reliable, though. If a case, transcript, or document is not properly linked to the correct customer, order, or transaction, the AI agent might retrieve it without context. Before declaring unstructured data “AI-ready,” Luna must confirm that the content is:

  • Accurate
  • Current
  • Linked to the right records

But What Exactly Is Unstructured Data?

Unstructured data is information that doesn’t follow a consistent structure with defined fields. It typically exists in free-text or non-text media formats and requires additional processing to extract meaning before it can be analyzed or used by AI. Examples include PDF files, images, audio or video files, or email files.

Examples of unstructured data including text files and documents, multimedia content, social media data, mobile and communications data, machine and sensor data, and historical archives.

The unstructured data that NTO’s Case Deflection Agent needs comes from case comments, email threads, chat transcripts, knowledge articles, and photos of proof of delivery.

Some of those unstructured documents might contain structured content that can provide important metadata to help AI agents have proper context. For example:

  • Meeting transcripts might be linked to meeting invites, with a list of invitees and a separate list of attendees. These lists are structured data and follow distinct formats and can be associated with who has made what remark and when in the transcript. This level of specificity can be important for determining context for the AI agent.
  • Digital photographs might include a timestamp, geolocation data, and the type of equipment used to create the image.
  • PDF documents might contain explicit sections that capture structured data, such as name, email, title, or company name associated with eSignatures and when a signature was captured.
Note

Important:

Some document properties can be lost when exporting or duplicating. For example, exporting a PDF meeting transcript to a shared drive can remove participation details, whereas extracting the transcript from the meeting platform would maintain the interaction details.

Unstructured Data Types and Considerations

As Luna begins evaluating NTO’s unstructured data, she assesses different types of unstructured content to ensure their relevance and reliability.

Content Type

Description

Example

Data Reliability Considerations

Knowledge content

Curated informational content used by users or AI agents.

Knowledge articles, policies, FAQs

Verify with content owners or subject matter experts (SMEs) that the content is current, correct, and accurately classified.

Ensure new, approved versions of the content will update prior versions.

Interaction content

Conversational content tied to customer or operational interactions.

Chat logs, meeting transcripts

Ensure content is associated with participants (for example, accounts, employees, customers), products, and transactional records (for example, cases, orders).

Business documents

Formal or semiformal business documents that are stored as files.

Business contracts, invoices

Verify document content is current.

Extract and make accessible structured elements (for example, invoice date, amount, account number) when needed for consistent AI use.

Other media assets

Non-text content that might require transcription, tagging, or transformation.

Images, audio files, scanned documents

Ensure the content is searchable and AI-ready.

Enforce privacy and access controls.

Start with Unstructured Data Discovery

Historically, structured data powered most enterprise systems, while up to 90% of enterprise data in unstructured sources went underutilized. Generative AI changes this dynamic. Because Gen AI can interpret text content, teams can rush to upload email, PDFs, transcripts, and knowledge articles to “make everything AI-ready.”

But without clear version control, reliable attribution, and strong record associations, simply providing agents with unstructured content can introduce risk. Agents might end up relying on outdated policies, misattributed conversations, or incomplete documents. This can result in incorrect responses, compliance concerns, or inconsistent behavior.

When consistent or repeatable processing is required, structured elements embedded within unstructured documents—such as invoice dates, customer identifiers, or policy terms—often need to be extracted into structured fields. Without this step, agents can’t follow predictable logic or act reliably with those values.

To determine the scope of data processing needed, Luna categorizes and assesses unstructured content. She ensures that:

  • Interaction content (chat logs, meeting transcripts, email) is correctly associated with the appropriate customer, employee, product, or transaction.
  • Business documents are evaluated to determine whether structured data extraction is required for consistent AI processing.

Evaluate Unstructured Data Readiness for AI

To evaluate unstructured data, Luna asks a series of questions.

Question

Impact

What to Do Based on the Answer

Is the unstructured content associated with the correct customer, product, or transaction?

Luna assesses what percentage of email activities aren’tlinked to a Contact, Lead, Opportunity, or Case.

Without proper association, agent responses might be incorrect or based on incomplete information.

Link content to the appropriate structured records (Contact, Case, Order, Product).

Is authorship, timestamp, and participation context preserved?

Luna verifies that transcripts are associated with the participants on the call.

If context, such as attendees or timestamps, is missing, meaning can be lost, and reliability decreases.

Preserve system-of-origin metadata. Avoid copying content into generic fields without context. Maintain traceability to the source system.

Are we using the correct and current version of policies, procedures, or knowledge content?

Luna captures the system of record for knowledge articles to minimize risk during data integration.

Without strict version control, AI can act on outdated, unofficial, or conflicting guidance.

Designate a source-of-truth system. Enforce versioning controls and restrict agent access to approved content only.

Could outdated, duplicated, or misattributed content influence agent responses?

Luna highlights the need for ongoing monitoring to assess and retire outdated content.

Duplicate or stale content increases the risk of hallucinations and inconsistent outputs.

Archive or retire outdated documents. De-duplicate content and establish content lifecycle governance.

Does the document contain structured elements—such as invoice amounts, dates, identifiers—that should be extracted into governed fields?

Luna provides a list of business documents whose data should be extracted and maintained, with lineage, before agentic processing.

When repeatable reasoning or actions depend on embedded structured values, relying solely on free text makes it harder for the agent to give consistent answers.

Extract critical elements into structured fields. Standardize formats and validate extracted values before use by the agent.

Luna’s discovery questions help ensure that unstructured data sources and content are well-understood and appropriately made available to AI solutions.

Now that you’ve explored how to assess unstructured data, move on to the next unit where you discover the importance of structured data for AI.

Resources

Partagez vos commentaires sur Trailhead dans l'aide Salesforce.

Nous aimerions connaître votre expérience avec Trailhead. Vous pouvez désormais accéder au nouveau formulaire de commentaires à tout moment depuis le site d'aide Salesforce.

En savoir plus Continuer à partager vos commentaires