In the last issue of CognitiveTimes, we discussed the concept of regression and how models based on this concept are used to predict variables of interest.
These could be anything, ranging from stock prices to real estate prices to weather prediction.
We looked at a very simple example and then went over the mathematics behind the concept to try and visualize and truly understand how this predictive model works.
Any prediction based on any model, always contains some degree of error. Mathematicians and statisticians are always on a quest to minimize this error. The simple models we saw in the last article, were not practical and rarely occur in the real world. The assumption that our variable of interest, or the variable we are trying to predict, is dependent on just one other variable or feature, about which we have information, is not something that happens in the natural world.
The price of a house (our variable of interest and the variable we were trying to predict), being based on just the square footage of the house is an oversimplification. It is great as an example to start and to understand the basic concepts, but to create and understand meaningful models, we must forge ahead into slightly more complicated territory.
In actuality, if we take the same example, there are multiple variables or features that the price of a house could depend on. For instance, here are some of the factors that could be responsible for the value of a house.
- Covered Area
- Neighborhood/Location
- No. of Bedrooms
- No. of Bathrooms
- Date of Construction
- Recently Renovated? Yes or No?
There are so many more things we can add:
- Does the house have a pool?
- Does it have wooden floors?
- Which school district is the house located in?
All of these factors have a bearing on the value of the house.
Let’s recall some regression basics.
The two most basic types of regression, are simple linear regression and multiple linear regression. There are many more complicated regression methods, but for this tech explainer, we will focus on the two mentioned above. Part 1 discussed simple linear regression in detail, where we assumed our variable of interest and the one we want to predict, depended on just one other independent variable X.
In multiple linear regression, we assume that the independent variables that control our variable of interest, are more than one. The general form of each type of regression is:
So in multiple linear regression, we assume Y depends on various X. As discussed above X1 could be the covered area of the house, X2 could be the number of bedrooms and so on and so forth.
Our Model:
• Y = the variable that we are trying to predict (dependent variable).
• X values = the variables we are using to predict Y (independent variable).
• a = the intercept.
• b = the slope.
• u = the regression residual.
The reason we looked at simple linear regression in Part 1, was that it is much easier to visualize what is going on when we are looking at just two variables. The one we want to predict – Y and the one that we feel predicts it – X. In the last article we went over how we find the equation of our model, by using the method of least squares.
In multiple linear regression, we do the exact same thing. The relationship is still linear, but we are now looking at more than one independent variable. This means that the example we worked out in part one is now taking place in an n dimensional space, that depends on the number of features or variables – X, that we are using to predict Y.
This is where calculations on a piece of paper, by hand, become near impossible and we use the incredible mathematical libraries in Python that make these calculations a couple of lines of code.
We will use the graphic below, to go over each step in the process of the multiple linear regression model workflow.
Step 1: The story always begins by analyzing the data. Plotting various different variables that we feel could influence our variable of interest. Sometimes this is done by subject matter experts. In our very simple real estate example, a realtor could come up with some very interesting features or variables that they feel from their experience would influence the price of a house. The same way, other subject matter experts would have a good insight into which variables to choose as features. Data scientists are also able to come up with a list of possible features, by analyzing graphs of the data, to find relationships of interest.
Step 2: After we have a list of features, we proceed to fit this model using Python or other programming software. How do we tell if our model is a good fit? There are multiple tests to go through:
- Global F Test
- Adjusted R2
- Mean Squared Error
- CV Criteria
Detailed explanations of the above are outside the scope of this explainer. When you use Python or other statistical software to fit your model, all these values are calculated for you and presented in a table. You then analyze these values to determine how well the model fits the data. As an example of how this is done, let’s look at the R2 value.
The coefficient of determination (R-squared), is a statistical metric, that is used to measure how much of the variation in outcome can be explained away by the variation in the features we have chosen. R2 always increases, as more predictors or features are added to the MLR model, even though the predictors may not be related to the outcome variable. This is why data science is sometimes called an art. Finding the right number of features or predictors, is an art that is learnt with experience. This also means that R2 on its own can’t be used to identify which features should be included in a model, as the value will increase even when a feature does not influence our variable of interest. R2 can only be between 0 and 1, where 0 indicates the outcome that is not predicted by any of the features (model is bad) and 1 indicates the outcome can be predicted, without error from the features (perfect model).
Step 3: There are some assumptions that we have to fulfill, when using a multiple linear regression model. The data has to fulfill this criteria. So we check for several things.
List here from diagram:
- 3 or more variables
- No major outliers or points of excessive influence
- Relationships between variables are linear and additive
- No autoco-relation
- No multicollinearity
- Data is homoscedastic
- Residuals have normal distribution
Step 4: Address any concerns in the assumptions. There are
corrective measures that can be taken to address concerns in the
data. For instance if we have:
- Hetroscedastic data, then we can try to transform the variable we are trying to predict.
- If the residuals are non normal, we can use a subset of the data or check the data for outliers and remove them.
- If there is autocorrelation, we can remove a predictor or feature variable.
- If we have missing data, we can add dummy data there or treat the data set in other ways.
Step 5: In the end, we use our test data to check our model fit. Remember, the data set you used to create the model, cannot be the dataset used to check the model. If you have limited data, split the dataset in the beginning. Use one set to create the model and the other set to test it.
Since we want this tech explainer to be as accessible as possible, while making sure you walk away with some practical knowhow of how linear regression works, let’s end our discussion with a practical example.
An investment banker may want to know how the market affects a stock price of his interest. Let’s assume this stock is Chevron.
In this case the variable of interest, or the dependent variable, is the Chevron stock price – the variable we are trying to predict. And the predictor, feature or independent variable, is the value of the S&P 500 index.
As discussed before, in reality, more than one independent variable will influence the stock price. So in addition to the performance of the market, we could also add variables such as the price of oil, interest rates, and the price movement of oil futures. These are all variables that can affect the price of the Chevron stock, as well as other oil companies. This is a great example of where multiple linear regression can be used.
As discussed above, MLR models will examine how these multiple independent variables will affect our variable of interest, the Chevron stock price. The basic equation of our model will take the form:
where, for i=n observations:
• yi = dependent variable—the price of Chevron
• xi1 = interest rates
• xi2 = price of oil
• xi3 = value of the S&P 500 index
• xi4 = price of oil futures
• B0 = y-intercept at time zero
• B1 = B1 is the regression coefficient that measures a unit change in the Chevron stock price when the interest rate changes.
• B2 = B2 is the regression coefficient that measures a unit change in the Chevron stock price when the oil price
changes.
• B3 = B3 is the regression coefficient that measures a unit change in the Chevron stock price when the value of the S&P 500 index changes.
• B4 = B4 is the regression coefficient that measures a unit change in the Chevron stock price when the price of oil futures change.
These regression coefficients, B0, B1 , B2 …B4 are calculated using statistical software or programming languages such as Python.
In this example, we have used 4 predictors. If we feel more elements influence the price of the Chevron stock, we can add more predictors following the same model and equation format as above.
As we have discussed before, no model is 100% accurate and the actual data point can differ slightly from the outcome that is predicted by our model. The error term, or the residual value E, which is the difference between the actual and the predicted value, is therefore always included in the model equation. The goal of any model is to minimize this error term as much as possible.
An example of the sort of output, we would receive from a statistical software, for the example above, is shown here.
This can be interpreted to mean that ? if other variables are held constant the price of the Chevron stock will:
• Increase by 8.9% if the price of oil increases by 1%
• Decrease by 1.5% if the interest rates are increased by 1%
• Increase by 4.5% if the S&P 500 index increases by 1%
• Increase by 6% if the price of oil futures increases by 1%
The R2 term indicates that 87.8% of the variation in the Chevron stock price can be explained by our four features/predictors – oil price, price of oil futures, interest rates and the value of the S&P 500 index.
Hopefully, this example, in conjunction with the deep dive we did for simple linear regression, will make the concept of linear regression clearer for you.