Join Data in Data Prep

Learning Objectives

After completing this unit, you’ll be able to:

  • Describe the different join types for a dataset.
  • Evaluate lookup and join tasks for preserving the grain.

Linking Multiple Datasets

The dataset you’re proposing requires a data operation that will bring together data from two or more related data sources into one. You’ve learned about lookups in the previous unit. Another operation to examine is a join. There are many join behaviors to get the desired outcome, such as inner, left, right, and full outer join.

The datasets used in the join examples contain values that uniquely define each row, and identify common values between datasets.

Opportunities dataset with four rows (four unique opportunities).

Accounts dataset with three accounts.

 Cases dataset with three cases.

Inner Join: Matching Opportunities to Cases

An inner join selects matching records based on a common field(s). 

Inner join.

In Illustration 4, an inner join is initiated between Opportunity.AccountId and Cases.AccountId. The value Account1, found in the first two rows of both datasets, is a match. Matching rows from each dataset are added to a join result. 

Illustration 4: Inner join between Opportunity and Cases using the AccountId field.Illustration 4: Inner join between Opportunity and Cases using the AccountId field.

In Illustration 5, a match has been established and fields from the matching rows are added to an output dataset. Take a look at the Opp001 value in Opportunity. Since its AccountId matches the AccountId for Case1 and Case2 in Cases, Opp001 repeats on both rows. We didn’t merge the matching rows from Cases, but kept them separate. 

Illustration 5: Inner join between Opportunity and Cases.

Illustration 5: Inner join between Opportunity and Cases.

The step continues for Illuminations in Opportunity.Name and $500 in Opportunity.Amount. The process repeats for Opp002, which also matches Case1 and Case2 in Cases. You can also think of the same inner join concept starting from Cases and matching AccountIDs in Opportunity. 

Left Join (Full Left): Matching Opportunities to Cases

Left join outputs all rows from the left dataset, then matching rows from the right dataset are added based on a common field(s).

Left join.

In a left join between Opportunity and Cases using AccountId, all Opportunity records are output, whether they have a matching Case record or not. Next, Opportunity records are matched to Cases by AccountId, and matching rows are added. Finally, nonmatching records of Opps in Cases are output with “null” for the Case dimension fields and “0” for the case measure fields.

Left join between Opportunity and Cases using the AccountId field.

OppIDs Opp001 and Opp002 each have two matching case rows (by AccountId) resulting in four rows. Notice that the count of Opportunity records increased with the left join.

Right Join (Full Right): Matching Cases to Opportunities

Right join outputs all rows from the right dataset, then matching rows from the left dataset are added based on a common field(s).

Right join.

Using Opportunities and Cases, a right join operates similarly to a left join, but from the perspective of the Cases dataset.

Right join between Opportunity and Cases using the AccountId field.

Full Outer Join: Matching Opportunities to Cases

Full outer join returns all records whether there is a match or not between datasets. 

Full outer join.

In other words, a full outer join returns all rows from Opportunity and Cases. Nonmatching rows contain “null.”

Full outer join between Opportunity and Cases using the AccountId field.

So, what's the difference between a lookup and a join?

As a reminder, here is what an Opportunity to Cases would look like as a lookup.

Lookup between Opportunity and Cases using the AccountId field.

Now take a look at the left join with Opportunity to Accounts.

Left join between Opportunity and Cases using the AccountId field.

After the left join, Opportunity ends up with the same number of rows it started with, plus related fields from Account. The end result is similar to the Opportunity to Accounts lookup. How did this happen? The Accounts dataset does not contain repeating AccountIds as they do in Cases, hence there are no repeating records. For instances such as this, you may be led to assume a lookup and left join deliver the same outcome.

Take a look at a left join of Accounts and Opportunity to get an output of all Accounts and related Opportunity fields. In this case the result is not keeping the Account rows unique. 

Account Left Join Opportunity.

Summary

Join is a powerful transformation to combine rows from multiple datasets. Understanding the different types of joins that are available, and how they’re used can help you create the datasets to meet a business need. The bottom line is you need to assess which join behavior is the most suitable to avoid unintended outcomes in your output result.

Resources

Keep learning for
free!
Sign up for an account to continue.
What’s in it for you?
  • Get personalized recommendations for your career goals
  • Practice your skills with hands-on challenges and quizzes
  • Track and share your progress with employers
  • Connect to mentorship and career opportunities