Apply Data Quality Assets to Transform Your Data

Learning Objectives

After completing this unit, you’ll be able to:

Describe the purpose of Data Quality assets in Informatica Cloud.
Explain how Data Quality assets cleanse, standardize, and enhance data.

Time to Fix the Data

Jamal now understands his data quality problems. He’s profiled the data, reviewed the statistics, and defined his rules. Now it’s time to act.

This is the Apply phase—and it’s powered by Informatica Cloud Data Quality (CDQ) assets. Data Quality assets are reusable components you create in the Data Quality service. Each asset targets a specific type of data quality problem. To update the data, you apply these assets in transformations inside a mapplet or mapping in Cloud Mapping Designer.

The Data Quality Assets

IDMC Data Quality service provides various configuration options you can tailor for your specific requirements.

Creating a New Data Quality Asset.

Dictionaries

A dictionary is a reference data object containing a set of approved, standard values. It acts as your organization’s source of truth for a given field—a primary list of valid country names, product categories, department codes, and so on.

At least one column must contain the standard or preferred version of a set of values (the valid column). Other columns can contain alternative or related versions, including variations that might exist in source data.

You can use your dictionaries to:

Validate whether a value exists in an approved list.
Standardize variations of the same value—for example, map “USA,” “U.S.A.,” and “United States” all to “United States”.
Enhance data by replacing nonstandard values with their preferred equivalents.
Derive data by searching for a value and returning the valid value, such as searching for a country and returning its currency.

Dictionaries are foundational—rule specifications, labelers, cleanse, and parse assets all use them, which makes them one of the most versatile building blocks in CDQ.

Rule Specifications

A rule specification asset represents the data requirements of a business rule in logical form. It defines:

What types of data a field should contain
What conditions the data must satisfy
What action to take when data passes or fails the rule

Rule specifications are reusable across multiple datasets and projects. Jamal creates a rule specification requiring every customer record to have a valid, non-null email address—then reuses that same rule across different source systems without rebuilding it each time.

Labeler

A labeler asset identifies the types of information in an input field and writes a label for each type to a corresponding output field. It can operate in two modes:

Token mode: A label is applied to a string of characters that indicates a type of information, such as a person name, company name, country, or product.
Character mode: Each individual character is labeled—useful for phone numbers, postal codes, dates, and national IDs such as Social Security numbers.

You can base labeling operations on dictionaries, regular expressions, or character sets. Jamal uses a labeler to identify the patterns and types of information inside free-text fields—a useful step before he standardizes or parses them.

Cleanse

A cleanse asset is a set of one or more data transformation steps that standardize the form and content of data. It can:

Improve data consistency in a dataset.
Fix errors in data.
Comply with regulatory standards.
Prepare for downstream data quality initiatives.

Cleansing is typically the first transformation step after profiling. Before running address verification, for example, Jamal cleanses the address data to standardize abbreviations and remove extraneous characters. This makes the verification step more effective.

Parse

A parse asset splits data from a single multi-domain input field into multiple individual, single-domain fields. This is especially useful for names and addresses stored in a single text column.

Parse supports two modes:

Prebuilt mode: Uses predefined name parsing logic for common data types, such as splitting “John A. Smith” into First Name, Middle Initial, and Last Name.
Custom mode: Lets you build custom parsing logic using dictionaries and regular expressions.

Jamal uses parse to split a Full Name field into separate fields for first name, middle name, last name, title, and suffix. Parse also generates additional data automatically—for example, a greeting term such as Mr. Jamal Booker, or an expanded formal name when you enter an abbreviated version. Parsing typically happens as part of the standardization process.

Verifier

A verifier asset evaluates the accuracy and deliverability of postal address records.

A verifier takes care of a few things.

Compares input addresses against authoritative reference data covering 240+ countries and territories.
Corrects errors and standardizes address formats.
Enhances records with additional data, such as missing postal codes and geocodes.
Where available, supports reverse geocoding, which converts geographic coordinates into readable addresses.
Reports on the quality of each address with status codes to support business decisions.

Verifier downloads reference data to the secure agent for local processing. A separate Verifier license is required to perform address verification.

Deduplicate

A deduplicate asset uses identity matching to identify and group similar or related records that represent the same real-world entity. The deduplicate asset specifies the type of identity that the transformation looks for at run time—the type of identity determines the input fields that the transformation analyzes.

Deduplicate can:

Assess the overall level of duplication in a dataset.
Group similar records into clusters based on matching logic.
Consolidate clusters into a single, preferred master record.

When Jamal runs deduplication on the customer table, he discovers that 14% of records are duplicates. This explains the discrepancy between the marketing, sales, and finance customer counts.

Put It All Together in Cloud Mapping Designer

You typically create Data Quality assets in the Data Quality service and apply them inside Cloud Mapping Designer, which is part of Cloud Data Integration. To update your data, you use assets in transformations inside a mapplet or mapping.

In Mapping Designer, Jamal builds a mapping that:

Reads customer data from the source system.
Uses a labeler to identify the types of information in free-text fields.
Applies a cleanse asset to standardize text formatting and remove noise.
Uses a parse asset to split the name into individual structured fields.
Runs the verifier to validate and enhance address data.
Applies a deduplicate transformation to consolidate duplicate records.
Writes clean, trusted data to the target system.

The result is a dataset that Jamal’s entire organization can rely on.

Cloud Data Integration Mapping illustrating a Data Quality Mapping Configured to Cleanse and Standardize Data.

Before and After Cloud Data Quality

Data Quality applies custom rules to detect, correct, and standardize raw data, which transforms inconsistent or incomplete records into reliable, validated outputs ready for business use.

Review how the data quality process standardizes common field types.

Field	Before	After
Email	john.doe@, JOHNDOE@ACME	john.doe@acme.com
Address	123 main st new york ny	123 Main Street, New York, NY 10001
Customer Name	JOHN SMITH / john smith / J. Smith	John Smith
Record Count	11,200 (with duplicates)	9,650 (deduplicated)

You now know how to apply Data Quality assets like parse and verifier to cleanse and standardize your data. With these foundational skills, you’re ready to take on data degradation and deliver reliable, high-quality data to your entire organization.

Geschätzte Zeit

Themen

Benötigen Sie Hilfe?