Use Profiling to Discover Data Quality Issues

Learning Objectives

After completing this unit, you’ll be able to:

Summarize where data profiling fits within a data quality management process.
Describe how profile rules detect anomalies in data.
Explain the benefits of using scorecarding to monitor data quality.

Check Your Data Before You Fix It

Jamal’s organization pulls customer data from three different systems: a CRM, an ecommerce platform, and a legacy database. Before he can cleanse or standardize any of it, he needs to understand exactly what he’s working with. He needs a clear, evidence-based picture of the problems—not a vague sense that something is wrong.

That’s where data profiling comes in.

What Is Data Profiling, and Where Does It Fit?

Data profiling is the foundational step in the Discover phase of the Data Quality Management Process. It’s the practice of analyzing your data to understand its structure, content, and quality—before you apply any fixes or transformations.

Think of it like a health check for your data. Just as a doctor runs diagnostic tests before prescribing treatment, you profile your data before attempting any cleansing or transformation. Profiling analyzes the contents of all specified fields in a dataset and identifies low-quality data according to the six data quality dimensions.

In CDQ, the Data Profiling service lets you run a profile task against any connected data source. Once the profile runs, you can:

View historical and latest profile results to track changes over time.
Compare two profile runs side by side to spot emerging issues.
Inspect specific values, data types, and patterns to investigate anomalies.
Identify uniqueness or repeating values through value frequency analysis.
Identify patterns and formats in data fields.
Export results to Microsoft Excel for further analysis.
Monitor profile job status in real time.

Reports from profiling provide direct input for the cleanse and standardize processes you apply in the next phase.

What Profiles Measure

When CDQ profiles a dataset, it generates a rich set of statistics for each column.

Metric	What It Reveals
Null count	How many values are missing (Completeness)
Distinct or nondistinct	How many unique values exist versus repeated values (Uniqueness)
Min and max values	The range of values in the field
Min and max length	Shortest and longest values—useful for catching truncated data
Patterns	Common formats found in the field (Validity)
Value frequency outliers	Unusual values that appear more or less often than expected (Accuracy)
Inferred or. documented data types	Whether data matches its expected type

When Jamal profiles the customer address field, he immediately identifies that 12% of records have null postal codes and 8% of phone numbers don’t match the expected pattern. He now has a precise, evidence-based picture of what needs to be fixed and which dimensions are failing.

Sample Data Quality Profile focusing on Customer Name and Address Data.

Use Profile Rules to Detect Anomalies

Profiling statistics tell you what’s in your data. Profile rules let you define what should be in your data and automatically flag anything that falls short.

In CDQ, you create profile rules and apply them to a dataset. A rule defines a condition that data must satisfy. When configured, it flags records that fail the rule as exceptions, which gives you a clear, actionable list of data that needs attention.

For example, Jamal creates a rule requiring every customer record to have a valid email address format. When the profile runs, CDQ automatically identifies all records where the email is missing, malformed, or contains an invalid domain. Instead of manually reviewing thousands of rows, Jamal gets a targeted exception report in seconds.

Profile rules are reusable across multiple datasets—a scalable way to enforce data standards organization-wide. You can output exception records to a file for manual correction when needed.

Data Quality Profile with Rules Applied showing the percentage of Valid/Invalid Company Names in the Dataset.

Sample Data output from an Exception Management Process highlights issues with each record.

Track Quality Over Time with Scorecards

Fixing data quality issues is important—but how do you know if things are actually getting better?

That’s where scorecards come in.

Scorecards, managed through the Informatica Cloud Data Governance and Catalog (CDGC) service, provide a visual representation of data quality scores over time. They aggregate profile run results and rule evaluations into a single, easy-to-read view and include the following features.

Current data quality scores by dimension (completeness, validity, accuracy, uniqueness, and so on)
Trends over time signal whether quality is improving, declining, or holding steady
Specific areas where quality falls below acceptable thresholds

For Jamal and his team, scorecards transform data quality from a vague concern into a measurable business metric. They set quality targets, track progress, and address problems before those problems affect downstream systems or decisions. Scorecards also support a culture of data ownership—teams know the quality of their data and take responsibility for improving it.

Data Quality Scorecard used to monitor quality over time.

Data profiling makes it easy to verify your data’s health and pinpoints exactly where the anomalies lie. In the next unit, you take action on those insights by applying specific Data Quality assets to standardize and transform your records.

Durée estimée

Thèmes

Besoin d'aide ?