Explore Data 360 Content Parsing and Pre-Processing Methods

Learning Objectives

After completing this unit, you’ll be able to:

Describe Data 360 content parsing and pre-processing methods.
Explain how selecting the right parsing and pre-processing method grounds AI in trusted, accurate data.

Parsing Content

Parsing transforms unstructured documents—like PDFs, manuals, and reports—into structured data that AI systems can easily retrieve and analyze. The parsing strategy you select is critical: it determines how well document structure, relationships, and visual context are preserved–which directly impacts your system's answer accuracy, overall performance, and operational costs.

Data 360 offers three options for parsing content.

Default Parser: Extracts text into structured, searchable data with built-in settings.
LLM-based Parser: Extracts text, images, and other visual elements using an LLM.
Docling Parser: Extracts text and tables with layout understanding using open-source models. Combine Docling parser with image processing using LLMs to process visuals such as flow charts and images.

The three parsing options available in the search index builder when you create a search index in Data 360. The default parser option is selected.

Let’s explore the options, so you know which one to choose for your use case.

Default Parser

The Default Parser is a highly scalable, cost-efficient solution designed specifically for text-heavy resources like knowledge base articles, policy documents, developer documentation, and internal wikis. It’s optimized to extract clean, linear text where meaning is primarily conveyed through paragraphs, lists, and headings. This makes it the ideal choice for large-scale ingestion of curated textual knowledge with minimal structural complexity.

LLM-based Parser

For highly complex, multimodal documents, full LLM-based parsing delivers the most comprehensive interpretation available across text, tables, and visuals. By enabling a holistic semantic understanding of diverse content types, it provides unparalleled contextual awareness. This depth of analysis is ideal for scenarios—such as advanced compliance analysis or engineering diagnostics—where maximizing comprehension is the primary objective and outweighs any additional processing costs.

Docling Parser

The Docling Parser is designed for enterprise documents where layout and structural relationships matter—such as financial reports, compliance documents, and operational reports. Unlike standard parsers, it delivers superior structural fidelity by interpreting complex elements like multilevel headers, merged cells, and nested tables. This deep layout awareness preserves the vital context needed to ensure highly accurate downstream retrieval and AI-generated answers.

When dealing with visually complex enterprise documents like system architectures, organizational charts, or governance workflows, standard text extraction falls short. Enabling Image Processing with Docling bridges this gap by intelligently interpreting both textual and visual elements. By using advanced LLMs to selectively analyze directional flows, labeled connections, and structural dependencies, this approach unlocks deeper contextual insights into visually represented processes without sacrificing processing efficiency.

Pre-Processing

You can also select LLM-based Visual Data Pre-Processing with the Default Parser to capture context from visual data using an LLM. LLM-based visual data pre-processing prepares multimodal elements, such as images and tables, for chunking. For example, preprocessing extracts context from images and extracts data from tables while maintaining the context and relationships within the table data.

The LLM-based Visual Data Pre-Processing option selected in Index Builder, as well as the default LLM model GPT-4o and a default prompt instruction to the LLM to process the visual content.

You cannot select both LLM-based parsing and LLM-based Visual Data Pre-Processing options for a search index. Use LLM-based Parsing for documents that contain rich visual content—such as images, charts, and tables—throughout the documents. In such cases, processing the entire document holistically provides better context and understanding.

Summary

Parsing and pre-processing is the critical process of transforming unstructured documents like PDFs and reports into structured data that AI systems can analyze. Data 360 provides multiple options to align parsing strategy with maximum contextual understanding, scale, and efficiency. The strategy you choose directly impacts your system's accuracy, performance, and costs.

Take the next step to discover search index types in Data 360 and identify the right search index to build for your use case with the Search Index Types in Data 360: Quick Look module.

Tempo stimato

Argomenti

Hai bisogno di aiuto?

Risorse per Data 360