Mastering linear regression in Excel can add a versatile analytical tool to your skillset, giving you the ability to make predictions about the relationship of dependent and independent variables. Though relatively simple to implement, linear regression—also known as regression analysis—is a powerful analytical technique with wide applicability across myriad business use cases.
Running a linear regression in Excel is a relatively straightforward technique that allows you to make real-world predictions by examining linear relationships between dependent and independent variables and the effect of those variables upon one another. This guide shows you how to unlock the predictive potential hidden within your datasets in four simple steps.
Table of Contents
How Does a Linear Regression Work?
Linear regression is used for quantifying the relationship between an independent variable and a dependent variable, which enables you to make predictions for how real world scenarios might unfold. This ability serves a wide range of purposes in the business world.
For example, a sports drink company makes the reasonable assumption that its beverage sales are directly linked to the outdoor temperature of a given geographic sales region. The hotter the temperature, the more beverage units sold—in other words, regional beverage sales depend on the temperature of the given region. Based on this initial information, you could assign “beverage sales” to the y (dependent) variable—the target, or value you’re trying to predict—and “temperature” to the x (independent) variable.
Our example dataset aligns average daily temperatures with average daily sports drink sales during a given time period.
Average Temperature (°F) | Sports Drink Sales – Noon (average/day) |
---|---|
72 | 50 |
78 | 65 |
82 | 80 |
88 | 95 |
92 | 110 |
97 | 125 |
102 | 140 |
108 | 155 |
112 | 170 |
118 | 185 |
122 | 200 |
128 | 215 |
The goal is to make predictions based on historical data points by calculating and plotting out the trendline.
The Equation
Mathematically, linear regression uses the easy-to-interpret formula listed below, followed by a detailed breakdown of what each of the formula’s variables represents:
y = a + bx
- (y) Dependent Variable: This variable depends on the other measured factors to the right of the equals sign.
- (a) Regression Intercept Term: The constant value of the dependent variable when the independent variable x is set to zero—in other words, the point where the function crosses over the y-axis.
- (b) Regression Slope Coefficient: The value that determines the slope of the line (i.e., the change in the independent x variable for a unit change in the independent y variable).
- (x) Independent Variable: The variable used to predict the value of another variable; because it’s used to predict the value of the dependent variable, x is also referred to as the predictor.
The Shape of a Linear Regression
When plotted on a graph, the shape of a linear regression takes the form of a straight line with any valid slope value—essentially, the angle or direction at which the line banks:
- Positive Slope: Means that y increases as x increases, and the function banks upwards from left to right.
- Negative Slope: Means that y decreases as x increases, and the function banks downwards from left to right.
- Zero Slope: Means that y is a constant that doesn’t change; basically a horizontal line.
How to Do a Linear Regression in Excel
The following steps show how to set up and run a linear regression in recent versions of Microsoft Excel. It’s possible in other spreadsheet tools like Google Sheets or Zoho Sheets as well—the steps are generally similar, but the details will vary.
Despite its simplicity, performing linear regressions in previous versions of Excel could be challenging, but more recent versions include advanced data analysis tools that make linear regressions easy.
Step 1: Input Historical Values Into Excel
Start by inputting your historical data into Excel. For our sports drink company example, enter the information from the dataset detailed above—your results should resemble the example below.
Step 2: Plot Your Historical Data Using a Scatter Plot
Next, plot your historical data using a scatter plot.
- Select all the fields with data, including the header labels.
- Click the “Insert” tab on the top-level menu.
- Select “Scatter,” followed by the first option: “Scatter with only Markers.”
Your results should resemble the following example.
Even before placing the trendline, you can see that a highly consistent linear correlation between the variables is easily observable.
Step 3: Place Your Trendline
To complete your linear regression, you’ll need to draw the trendline against your scatter plot.
- Right-click the scatter plot chart and select “Add Trendline.”
- From the Format tab menu, choose “Linear.”
- Check the boxes for both “Display Equation on Chart” and “Display R-squared value on chart.”
Because you elected to display the linear regression equation and the R-squared value, the results—listed below—are placed directly inside of the chart.
y = 2.983x – 165.56
R² = 0.9992
Step 4: Calculate Your Prediction
Using this equation, you can now calculate a sales number prediction by directly plugging in a value for x. For this example, we can predict sports drink sales for 65 °F weather.
- Replace x in the equation y = a + bx:
y = 2.983(65) – 165.56
2. Solve the equation for y.
In this example, the value is 28, which means that in 65 °F weather we could expect sports drink sales of $28. You can verify this by plotting the new values into the original graph:
The R² value—also referred to as the coefficient of determination—indicates how well your data fits the regression model (goodness of fit). An optimal R² value typically is 0.9 or above, so your linear regression model has a high goodness of fit.
Linear Regression Disadvantages
Simplicity and elegance come at a price, and linear regression formulas are not without their caveats. The general rule of thumb regarding sample size in the linear regression formula is a minimum of 10 to 20 cases per independent variable. If three independent variables are used, for example, a minimum sample size of 30 to 60 items is required for making accurate predictions.
When using linear regression, the following five conditions must also be true for results to be valid:
- Linear Relationship: The dependent and independent variables must have a linear relationship.
- Normal Distribution: All variables, and the independent variable in particular, follow a normal distribution referred to as multivariate normality
- No Multicollinearity: Multicollinearity is when two or more independent variables are highly correlated with each other.
- No Autocorrelation: Autocorrelation is the degree to which the same variables between two successive time intervals are correlated.
- Homoscedasticity Exists: The variance between datasets must be consistent, with error being constant along the dependent variable values.
Frequently Asked Questions (FAQs)
How can I evaluate the performance of a linear regression?
You can use common metrics to assess the performance of your linear regression model, including the R² value for measuring the proportion of variance explained by your equation, as well as mean squared error (MSE) or root mean squared error (RMSE) for quantifying the average squared difference between predicted and observed values.
How do I know when to use a linear regression formula?
You should use a linear regression formula when a linear relationship between the dependent and independent variables is observed or suspected in your dataset. Also, linear regression is ideal for predicting a continuous outcome when the previously mentioned assumptions are satisfied based on the nature of your data.
What’s the difference between simple linear regression and multiple linear regression?
A simple linear regression takes one dependent variable and one independent variable and depicts their linear relationship with a straight line. On the other hand, multiple linear regression incorporates two or more independent variables to predict a single dependent variable, allowing for a more complex modeling of relationships.
How do outliers impact the results of linear regression analysis?
Outliers can significantly influence the results of a linear regression, distort the estimated coefficients, and affect the trendline. Techniques like data transformation or robust regression methods can be employed to mitigate the impact of outliers.
Bottom Line: Excel-based Linear Regression For Easy Desktop Predictions
Whether you’re looking to enhance your decision-making processes, optimize business strategies, or simply gain a deeper understanding of the data that surrounds you, linear regression in Excel allows data professionals to leverage powerful statistical tools in a ubiquitous, broadly familiar software tool. That said, you should also understand the limitations of unspecialized software; remember that while Excel is a convenient tool for basic regression analysis, it may lack some advanced features found in dedicated statistical software. Always interpret results cautiously, and consider using more specialized statistical software for in-depth analysis, if needed.
If you’re learning to perform statistical analysis using readily available spreadsheet tools, read our guide to running Monte Carlo simulations in Excel using five easy steps.