Discover Relationships Using Linear Regression

Learning Objectives

After completing this unit, you’ll be able to:

Define linear regression.
Differentiate between characteristics of correlation and linear regression.

What Is Linear Regression?

In the previous unit, you learned that correlation refers to the direction (positive or negative) and the strength (very strong to very weak) of the relationship between two quantitative variables.

Like correlation, linear regression also shows the direction and strength of the relationship between two numeric variables, but unlike correlation, regression uses the best-fitting straight line through the points on a scatter plot to predict Y values from X values. With correlation, the values of X and Y are interchangeable. With regression, the results of the analysis will change if X and Y are swapped.

Concepts in this unit are adapted from Introduction to Statistics.

The Linear Regression Line

Just as with correlations, for regressions to be meaningful, you must:

Use quantitative variables
Check for linear relationship
Watch out for outliers

Like correlation, linear regression is visualized on a scatter plot.

The regression line on the scatter plot is the best-fitting straight line through the points on the scatter plot. In other words, it is a line that goes through the points with the least amount of distance from each point to the line.

Why is this line helpful and useful? We can use the linear regression calculation to calculate, or predict, our Y value if we have a known X value.

To make this clearer, let's look at an example.

A Regression Example

Let’s say you want to predict how much you will need to spend to buy a house that is 1,500 square feet. Let's use linear regression to predict.

Place the variable that you want to predict, home prices, on the y-axis (this is also called the dependent variable).
Place the variable you're basing your predictions on, square footage, on the x-axis (this is also called the independent variable).

Here is a scatter plot showing house prices (y-axis) and square footage (x-axis).

A scatter plot with blue marks showing house prices (y-axis) and square footage (x-axis)

The scatter plot shows homes with more square feet tend to have higher prices, but how much will you have to spend for a house that measures 1,500 square feet?

To help answer that question, create a line through the points. This is linear regression. The regression line will help you to predict what a typical house of a certain square footage will cost. In this example, you can see the equation for the regression line.

The equation for the regression line is highlighted.

The equation for the line is Y = 113*X + 98,653 (with rounding).

What does this equation mean? If you bought a place with no square footage (an empty lot, for example), the price would be $98,653. Here are the steps for how the equation is solved.

To find Y, multiply the value of X by 113 and then add 98,653. In this case, we are looking at no square footage, so the value of X is 0.

Y = (113 * 0) + 98,653
Y = 0 + 98,653
Y = 98,653

The value 98,653 is called the y-intercept because this is where the line crosses, or intercepts, the y-axis. It is the value of Y when X equals 0.

The number 113 is the slope of the line. Slope is a number that describes both the direction and the steepness of the line. In this case, the slope forecasts that for every additional square foot, the house price will increase by $113.

So, here’s what you need to spend on a 1,500 square foot house:

Y = (113 * 1500) + 98,653 = $268,153

Take another look at this scatter plot. The blue marks are the actual data. You can see that you have data for homes between 1,100 and 2,450 square feet.

A scatter plot with blue marks, a gray regression line, and orange lines showing where X and Y meet on the regression line

Note that this equation cannot be used to predict the price of all houses. Since a 500-square-foot house and a 10,000-square-foot house are both outside of the range of the actual data, you would need to be careful about making predictions with those values using this equation.

The r-Squared Value

In addition to the equation in this example, we also see an r-squared value (also known as the coefficient of determination).

The r-squared value for the regression line is highlighted.

This value is a statistical measure of how close the data is to the regression line, or how well the model fits your observations. If the data is perfectly on the line, the r-squared value would be 1, or 100%, meaning that your model fits perfectly (all observed data points are on the line).

For our home price data, the r-squared value is 0.70, or 70%.

Linear Regression Versus Correlation

You may now be wondering how to distinguish between linear regression and correlation. See the table below to see a summary of each concept.

Linear regression	Correlation
Shows a linear model and prediction, predicting Y from X.	Shows a linear relationship between two values.
Uses r-squared to measure the percentage of variation explained by the model.	Uses r to measure the strength and direction of the correlation.
Does not use X and Y as interchangeable values (because Y is predicted from X).	Uses X and Y as interchangeable values.

Being familiar with the statistical concepts of correlation and regression helps you to explore and understand the data you work with by examining relationships.

Resources

Book: Online Statistics Education: An Interactive Multimedia Course of Study, 2020

Time Estimate

Topics

Looking for Help?

Tableau Resources