Discover Relationships Using Linear Regression
Learning Objectives
After completing this unit, you’ll be able to:
- Define linear regression.
- Differentiate between characteristics of correlation and linear regression.
What Is Linear Regression?
In the previous unit, you learned that correlation refers to the direction (positive or negative) and the strength (very strong to very weak) of the relationship between two quantitative variables.
Like correlation, linear regression also shows the direction and strength of the relationship between two numeric variables, but unlike correlation, regression uses the best-fitting straight line through the points on a scatter plot to predict Y values from X values. With correlation, the values of X and Y are interchangeable. With regression, the results of the analysis will change if X and Y are swapped.
The Linear Regression Line
Just as with correlations, for regressions to be meaningful, you must:
- Use quantitative variables
- Check for linear relationship
- Watch out for outliers
Like correlation, linear regression is visualized on a scatter plot.
The regression line on the scatter plot is the best-fitting straight line through the points on the scatter plot. In other words, it is a line that goes through the points with the least amount of distance from each point to the line.
Why is this line helpful and useful? We can use the linear regression calculation to calculate, or predict, our Y value if we have a known X value.
To make this clearer, let's look at an example.
A Regression Example
Let’s say you want to predict how much you will need to spend to buy a house that is 1,500 square feet. Let's use linear regression to predict.
- Place the variable that you want to predict, home prices, on the y-axis (this is also called the dependent variable).
- Place the variable you're basing your predictions on, square footage, on the x-axis (this is also called the independent variable).
Here is a scatter plot showing house prices (y-axis) and square footage (x-axis).
The scatter plot shows homes with more square feet tend to have higher prices, but how much will you have to spend for a house that measures 1,500 square feet?
To help answer that question, create a line through the points. This is linear regression. The regression line will help you to predict what a typical house of a certain square footage will cost. In this example, you can see the equation for the regression line.
The equation for the line is Y = 113*X + 98,653 (with rounding).
What does this equation mean? If you bought a place with no square footage (an empty lot, for example), the price would be $98,653. Here are the steps for how the equation is solved.
To find Y, multiply the value of X by 113 and then add 98,653. In this case, we are looking at no square footage, so the value of X is 0.
- Y = (113 * 0) + 98,653
- Y = 0 + 98,653
- Y = 98,653
The value 98,653 is called the y-intercept because this is where the line crosses, or intercepts, the y-axis. It is the value of Y when X equals 0.
The number 113 is the slope of the line. Slope is a number that describes both the direction and the steepness of the line. In this case, the slope forecasts that for every additional square foot, the house price will increase by $113.
So, here’s what you need to spend on a 1,500 square foot house:
Y = (113 * 1500) + 98,653 = $268,153
Take another look at this scatter plot. The blue marks are the actual data. You can see that you have data for homes between 1,100 and 2,450 square feet.
Note that this equation cannot be used to predict the price of all houses. Since a 500-square-foot house and a 10,000-square-foot house are both outside of the range of the actual data, you would need to be careful about making predictions with those values using this equation.
The r-Squared Value
In addition to the equation in this example, we also see an r-squared value (also known as the coefficient of determination).
This value is a statistical measure of how close the data is to the regression line, or how well the model fits your observations. If the data is perfectly on the line, the r-squared value would be 1, or 100%, meaning that your model fits perfectly (all observed data points are on the line).
For our home price data, the r-squared value is 0.70, or 70%.
Linear Regression Versus Correlation
You may now be wondering how to distinguish between linear regression and correlation. See the table below to see a summary of each concept.
Linear regression | Correlation |
---|---|
Shows a linear model and prediction, predicting Y from X. |
Shows a linear relationship between two values. |
Uses r-squared to measure the percentage of variation explained by the model. |
Uses r to measure the strength and direction of the correlation. |
Does not use X and Y as interchangeable values (because Y is predicted from X). |
Uses X and Y as interchangeable values. |
Being familiar with the statistical concepts of correlation and regression helps you to explore and understand the data you work with by examining relationships.