Examine Correlation in Data
Learning Objectives
After completing this unit, you’ll be able to:
- Define correlation.
- Distinguish between strong and weak correlations.
Journey Toward Data Fluency
Data literacy is the foundation for using and communicating with data with ease.
The Data Literacy Basics module describes quantitative variables as numerically measurable characteristics, such as number of hours spent watching television each day, speed measured in miles per hour, total inches of annual rainfall in a city, sales in dollars, and amount spent on marketing.
When you are examining relationships within your data, how do you determine how closely two variables, like sales and the amount spent on marketing, are related? Can you use one variable to predict the other?
Correlation and regression are important techniques used to discover trends and make predictions. While there are other important forms used in analytics, we focus on the simplest form used in AI and analytics—linear correlation and regression.
In this unit, you gain familiarity with the concept of correlation, which describes whether and how closely two variables move in relation to each other. You gain an appreciation of how correlation measures association but doesn’t prove causation. In the next unit, you explore how linear regression can be used to calculate or predict the value of one variable based on another, in addition to measuring how well this model fits your data.
What Is Correlation?
Correlation is a technique that can show whether and how strongly pairs of quantitative variables are related.
For example, do the number of daily calories consumed and body weight have a relationship? Do people who consume more calories weigh more? Correlation can tell you how strongly peoples’ weights are related to their calorie intake.
The correlation between weight and calorie intake is a simple example, but sometimes the data you work with may not have the relationships that you expect. Other times, you may suspect correlations without knowing which are the strongest. Correlation analysis helps you understand your data.
When you begin your correlation analysis, you can create a scatter plot to investigate the relationship between two quantitative variables. The variables are plotted as Cartesian coordinates, marking how far along on a horizontal x-axis and how far up on a vertical y-axis each data point is. In the scatter plot below, you see the relationship between sales and the amount spent on marketing. It appears there’s a correlation: As one variable goes up, the other seems to as well.
Correlation Versus Causation
Now that you know how correlation is defined and how it is represented graphically, let's discuss how to better understand correlation.
First, it’s important to know that correlation never proves causation.
Pearson’s correlation tells us only how strongly a pair of quantitative variables are linearly related. It does not explain the how or why they’re related.
For example, sales of air conditioners correlate with sales of sunscreen. People aren’t buying air conditioners because they bought sunscreen, or vice versa. The cause of both purchases is hot weather.
How Is Correlation Measured?
Pearson’s correlation, also called the correlation coefficient, is used to measure the strength and direction (positive or negative) of the linear relationship between two quantitative variables. When correlation is measured in a sample of data, the symbol used is the letter r. Pearson’s r can range from -1 to 1.
When r = 1, there is a perfect positive linear relationship between variables, meaning that both variables correlate perfectly as values increase. When r = -1, there is a perfect negative linear relationship between variables. In a perfect negative correlation, when one variable increases, the other variable decreases with the same magnitude. When r = 0, no linear relationship between variables is indicated.
With real data, you would not expect to see r values of -1, 0, or 1.
Generally, the closer r is to 1 or to -1, the stronger the correlation, as shown in the following table.
r = | Correlation |
---|---|
0.90 to 1 or -0.90 to -1 |
Very strong correlation |
0.70 to 0.89 or -0.70 to -0.89 |
Strong correlation |
0.40 to 0.69 or -0.40 to -0.69 |
Modest correlation |
0.20 to 0.39 or -0.20 to -0.39 |
Weak correlation |
0 to 0.19 or 0 to -0.19 |
Very weak or no correlation |
Linear Correlation Conditions
For correlations to be meaningful, you need to consider some conditions: they must use quantitative variables, describe linear relationships, and take into account the effect of any outliers. You should check these conditions before you run a correlation analysis..
In 1973, a statistician named Francis Anscombe developed Anscombe’s Quartet to show the importance of graphing data visually, as opposed to simply running statistical tests. The four visualizations in his quartet all show the same trend line equation. The quartet illustrates why visualizations are so important—they help us identify trends within our data that may be obscured by statistical tests.
In the example below, only the top-left scatter plot in the quartet meets the criteria of being linear without any outliers. The top-right scatter plot is not showing a linear relationship and a nonlinear model would be more appropriate. The two scatter plots on the bottom each have outliers which can dramatically affect the results.
Now that you’re more familiar with the concepts around the statistical technique of correlation, you’re ready for the next unit, where you learn about linear regression.