Please Note: Blog posts are not selected, edited or screened by Seeking Alpha editors.

Regression Analysis: A Concept That Every Investor Should Understand

I have written and plan on writing many articles that rely heavily upon a technique called regression analysis. Regression tests the relationship between one or more X variables (independent variables) and a single Y variable (dependent variable). Think of it like a science fair experiment; when you change X, what happens to Y? The output of a regression analysis gives many valuable pieces of information about how two variables are related and can help investors make informed conclusions about predicting risk and reward. For this explanation, I will use the calculation of beta as an example.

Why Use Regression?

The market contains many sets of data that we know logically should be correlated. For example, we know that the returns of the S&P 500 and most individual stocks SHOULD be correlated. But is this always the case? The answer is definitely NO. Most utility stocks have very little relationship to the market's returns, yet most financial data services will report a value for beta and leave it to you to know whether this number is useful or not. Having an understanding of regression allows you to "sanity check" these reported numbers and prevent you from accidentally making important decisions based upon useless data.

R Squared

The first sanity check that regression gives us is a value called R squared. In examining two sets of data, the regression essentially creates a scatterplot of data points on an X and Y axis, a bit like what you would see on a graphing calculator. Then, given these data points, the regression chooses a straight line that will best fit these data points. To be exact, it will choose the line which minimizes the squared variance of all data points from this "line of best fit." In other words, if you look at the distance between the each data point and this line of best fit, the combined sum of all of these distances will be the lowest possible value. Take a look at the following regression chart for Consolidated Edison (NYSE:ED)...

The data points look more like random tosses at a dartboard than any kind of real relationship. Even so, you can see that the regression analysis was forced to pick a line of best fit. If an investor chose to use this line of best fit (given by financial data sites as beta), you can see exactly how misled they might be. The line has almost no predictive power because there is so much variance.

Visually, you can see that the line is useless, but it can also be quantified so that we don't even need a chart. One of the outputs of a regression analysis is a value called R squared. R squared essentially tells you what percent of variance is explained by your model, or line of best fit. For Consolidated Edison, the R Squared value of its beta regression is 0.05. In other words, when using the S&P 500's returns to predict the returns of Consolidated Edison, only 5% of variance is explained. Thus, 95% of Consolidated Edison's returns are due to factors other than the S&P 500's returns. For most stocks, this number is significantly higher, but you never know until you perform the regression yourself or look for a reported value of R squared for a given beta calculation.

Testing For Statistical Significance

R squared is applicable to every regression analysis, but what happens if you include multiple X variables in your regression? Some of the variables may be very effective at predicting Y, while others may be useless. The second part of a regression analysis uses probability to tell you the percent chance that a given X variable and the Y variable are not correlated. Normally, you decide upon a certain "critical" probability above which you will consider the variable useless. For example, in my regression analyses, if the test tells me that there is more than a 5% chance that the given X variable and the Y variable are uncorrelated, I consider that X variable useless for that model.


Coefficients are the heart of any regression analysis. Once all of the above sanity checks are passed and the model, as well as each X variable, are determined to be statistically significant, we can create a predictive model. For example, in Consolidated Edison's beta regression, if we pretend that R square was much higher and the X variable was determined to be significant, the predictive model would look like this...

Y = BX + A

ED's Returns = 0.21 [beta] * S&P500's Returns + 0.001 [alpha]

Notice the term "alpha" at the end. Alpha is the value of Y when X equals 0. In other words, there is some other factor or set of factors that we cannot explain which are adding a certain amount of returns, regardless of what the X value is. Alpha accounts for these factors. In general, it is assumed that this value is constantly changing and should usually be close to zero. If alpha is not constantly changing or not close to zero, we should probably search for a factor that is causing alpha to be a large part of the predictive model.

Models can include several variables, in which case there would still be only one Y term and one alpha term, but multiple X variables and a different coefficient for each X variable. In this way, you can incorporate as many X variables as you determine to be significant.

Disclosure: I have no positions in any stocks mentioned, and no plans to initiate any positions within the next 72 hours.