ISL-Chapter-03 (Interactive Learning)
Chapter 03 - Linear Regression
- import libraries
- Simple Linear regression: Univariate Linear Regression
- Multi-variate (Multiple) Linear Regression
- Summary
- Lab
Note: The content of this post is based on my understanding from Chapter 03 of the book Introduction to Statistical Learning, where the authors introduce the concept of Linear Regression and provide comprehensive discussion.
If you haven't read this chapter from the book, I recommend reading it first (You can find its free online version via the provided link) Please feel free to change the code and see how the results behave. You may find something interesting...
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn import utils
from sklearn.preprocessing import StandardScaler
import statsmodels.api as sm
from scipy import stats
Probably you have heard regression a lot. It shows it is an important topic and it has helped many users in different areas. It can also help us better understand statistical models (in general) and it can be considered as the fundamental tool for building several more-complicated models.
Why are you digging the data?
It is important to improve our skills to ask questions and then try to find the answers! Recall that advertising data, where the features were the TV/Radio/Newspaper budgets, and the target was sale. Given the data, we are asked to come up with a marketing plan for the next year.
Ok, so what kind of questions should we ask?
- Is there any evidence that shows a relationship between media(features) and sale(target, ouput)?
- If yes, how much is such relationship strong?
- Which media is associated with sale? and how much?
- How accurately can we predict sale based on the given data?
- Is the relationship between features and target linear? If yes, we can use linear regression. If not, can we transform features and/or response to make such relationship linear? (why linear? because its interpretation is easy)
- Is there any interaction between features?
For now, we assume a single feature and we want to model its relationship to y using a linear regression.
$y \approx \hat{y}=wx + b $
And, that's it! we just want to use a (linear) line with slope w
and y-intercept b
. The goal is to find w
and b
such that it minimizes MSE.
In the equation above, we regressed y onto x by a linear regression model. w
and b
are the coefficients.
Residual Sum of Squares (RSS): $RSS = \sum\limits_{i=1}^{n}{e_{i}^{2}}$, where $e_{i}=y_{i} - \hat{y}_{i} $
We can analytically minimize RSS (note that $MSE$ is just $\frac{1}{n} {RSS}$). In other words, we can find the optimal w
and b
analytically. This gives us the least square coefficient estimates for linear regression model.
(If interested in math and how to estimate w
and b
, please see Wikipedia).
EXAMPLE
df = pd.read_csv("./datasets/Advertising.csv", index_col=0)
df.head(2)
X = df['TV'].to_numpy()
X = X.reshape(-1,1)
y = df.iloc[:,-1].to_numpy()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
regressor = LinearRegression().fit(X_train, y_train)
y_train_pred = regressor.predict(X_train)
y_test_pred = regressor.predict(X_test)
plt.figure()
plt.title('regressing feature TV onto target sale')
plt.scatter(X_train.ravel(), y_train, color='r', label='train')
plt.scatter(X_test.ravel(), y_test, color='b', label='test')
plt.plot(X_train.ravel(), y_train_pred, color='k', linewidth=3, label='linear model')
plt.xlabel('TV budget')
plt.ylabel('sale')
plt.legend()
plt.show()
The black line linear-regression model is estimated/fitted based on the train data set. Then, it will be used to predict the y-values of test data set.
w = regressor.coef_[0] #because we have only one variable
b = regressor.intercept_
print(f'w(slop): {w}, and b(y-intercept): {b}')
Discussion on accuracy:
It is important to note that the estimated parameters w
and b
are unbiased, i.e. if we fit a linear model onto many different sets of training data, we can get closer to the true value of w
(b
) by taking the expected values of w
(b
), achieved for different sets of training data. Therefore, a better notation might be $\hat{w}$ and $\hat{b}$ as they are the estimation of true values $w$ and $b$. However, we usually avoid using hat.
Since the estimated value of a parameter (from one set of observation) has some deviation from the true value, we can calculate Standard Error (SE) of the parameter to show how accuracte the estimate is. Standard deviation (std) is for one set of observations while SE is the standard deviation of a statistical parameter for sets of observation. However, we are able to calculte SE even for one set of observations.
For more info, watch this video: std-vs-SE
SIDE NOTES
Before moving forward, I think it would be nice to review some concepts (we may study them in the future though. However, I think it is important to better understand SE/confidene interval/...). You may want to watch the following videos (better to follow the order provided below).
Bootstrapping: Watch Bootstrapping main concept
Standard Error: Watch Standard Error
Confidence Interval: Watch Confidence Interval
Suggestion: a short note on p-value: MIT note
We want to test the following null hypothosesis: There is no association with Y and X. So, we can repeat the experimnent with new sets of data (or do bootstrapping) and find different values for the estimation of $w$ and $b$. Let us do that for the data we have:
n_bootstrap = 1000
w_vec = np.zeros(n_bootstrap)
b_vec = np.zeros(n_bootstrap)
data_train = np.c_[X_train, y_train]
for i in range(n_bootstrap):
data_bootstrap = utils.resample(data_train, random_state=i)
regressor = LinearRegression().fit(data_bootstrap[:,0].reshape(-1,1), data_bootstrap[:,1])
w_vec[i] = regressor.coef_[0]
b_vec[i] = regressor.intercept_
w_mu = np.mean(w_vec)
w_std = np.std(w_vec)
w_SE = w_std #note that Standard Error of w is basically the standard deviation of different values we got for w
print('Standard Error: ', w_SE)
For linear regression, we already have a formula that can give us the w_SE (see book page 66):
y_diff = y_train - y_train_pred
RSS = np.sum(np.square(y_diff))
#Residual Standard Error(RSE)
RSE = np.sqrt(RSS/(X_train.shape[0]-2))
#X_dev = sum((X-Xavg)^2)
X_dev = np.sum(np.square(X_train - np.mean(X_train)))
w_SE_formula = np.sqrt((RSE**2)/X_dev)
print('w_SE using formula of book: ', w_SE_formula)
As observed, the value of SE of $w$ calculated by bootstrapping is close to what provided by the formula. We can also plot the distribution of $w$ based on the values collected throughout the bootstrapping process. And, we can get approx. 95\% confidence interval, by using the interval (mu-2std,mu+2std)
plt.figure()
plt.title('distribution of w')
plt.hist(w_vec,color='b',label='distribution', density=True)
plt.axvline(w,color='orange',label='estimated w')
plt.axvline(w_mu,color='k',label='distribution mu')
plt.axvline(w_mu-2*w_std,color='r',linestyle='--',label='2*std from mean')
plt.axvline(w_mu+2*w_std,color='r',linestyle='--')
plt.legend()
plt.show()
Also, you may want to dig more into "linear regression" topic by calculating how strong is such linear association. IF there is no association, we expect to see w=0. Right? But, we cannot just use the value of $w$, because its magnitude can be easily changed by scaling data. So, we should take into account its standad deviation to see how far $w$ is from zero. How to test whether the w is significantly
far from zero? In other words, does the result provide a significant evidence that there is an association?
Here is where we should use the hypothesis testing. Our null hypothesis is as follows: There is no association between x and y, i.e. slope $w=0$. We have samples, and with samples, we can just estimate our parameter w ($\hat{w}$). (note that when we calculate $w$, we assume we are calculating $w$ in y=wx+b) We would like to see $\hat{w}$ to be close to true value $w=0$ IF NULL HYPOTHESIS IS TRUE. Now, we observed $\hat{w}$ to be around 0.05. Is it significantly far from 0? Well, with null hypothesis $w=0$ and the standard deviation SE (note: SE in both alternative and null hypothesis are the same in linea regression), we can normalize the observed value and caluclate t-score as follows:
t = (w - 0)/w_SE
print('t is: ', t)
Now, if we use t-distribution table (with degree-of-freedom = function (n_samples)), we will see that the p-value is very small (almost zeros). Now, we know that we can reject our null hypothesis (i.e. there is an association between X and Y).
The next question is: "How accurate is the proposed (i.e. linear) model?" In other words, we know we should reject the non-association hypothesis. But, can we trust our LINEAR model? (i.e is that a good fit?):
We can use r2_score to measute it (more on this below)
regressor = LinearRegression().fit(X_train, y_train)
y_train_pred = regressor.predict(X_train)
y_test_pred = regressor.predict(X_test)
print('r2 score on train set: ', r2_score(y_train, y_train_pred))
print('r2 score on test set: ', r2_score(y_test, y_test_pred))
We can see that the r2_score is low for this model. A good r2_score is close to 1. (please continue for more detail)
$r2\_score = 1 - \frac{RSS}{TSS}$, where:
$RSS = \sum\limits_{i=1}^{n}(y_{i} - \hat{y}_{i})^{2}$, and $TSS = \sum\limits_{i=1}^{n}(y_{i} - \bar{y})^{2}$
We can think of RSS as amount of variation left UNexplained after using the model. (note that, initially (before modeling), the variation in Y was TSS).
Now that we leanred about the concepts of linear regression, we should see how we can extend it for more than one variable. Why not providing simple (univariate) linear regression for each variable? Mainly because we are ignoring two variables when using only one variable for predicing sale. We'll see that such consideration can result in misleading result if there is a correlation between two/more (input) variables.
$y \approx \hat{y} = w_{1}x_{1} + w_{2}x_{2} + ... + w_{p}x_{p} + b = W^{T}X + b$
As stated earlier, we should avoid using one simple linear regression for each individual predictor, and instead, we should consider them together.
practice: try to fit simple linear regression on each feature of advertisement data, and then try with multiple linear regression. Are the coefficienet of simple-vs-mutliple close for each predictor? If not, why? what does that mean? (Hint: you can plot the correlation matrix that shows the correlation between predictors)
Let us first implement a multi-linear regression:
X = df.iloc[:,:-1].to_numpy()
y = df.iloc[:,-1].to_numpy()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
regressor = LinearRegression().fit(X_train, y_train)
y_train_pred = regressor.predict(X_train)
y_test_pred = regressor.predict(X_test)
print('the y intercept is: ', regressor.intercept_)
print('the coefficients correspond to three features are: ', regressor.coef_)
please note that our results are slightly different than what presented in the book because: (1) the authors may used the whole data; however, we only used the train data set here, and/or (2) even if the authors use train set, they might got other samples in their train set that is different to the one obtained by the function train_test_split
.
How can I determine the importance of each feature? Note that the scale of coefficients depends on the scale of values per feature. So, better to standardize them!
from sklearn.preprocessing import StandardScaler
sc = StandardScaler().fit(X_train)
X_train = sc.transform(X_train)
X_test = sc.transform(X_test)
regressor = LinearRegression().fit(X_train, y_train)
y_train_pred = regressor.predict(X_train)
y_test_pred = regressor.predict(X_test)
print('the y intercept is: ', regressor.intercept_)
print('the coefficients correspond to three features are: ', regressor.coef_)
print('r2 score on training set: ', r2_score(y_train, y_train_pred))
print('r2 score on training set: ', r2_score(y_test, y_test_pred))
This is an interesting result. It shows the model with three predictors can be LINEAR! (We saw before that considering a single predictor cannot result in a good LINEAR model)
The results above indicate that there is a considerable relationship between the first two features and the target value. However, the coefficient for the last feature is almost zero, which suggest the last feature (i.e. Newspaper) has no significant impact on the target.
NOTE: Please note that bootstrapping on multiple linear regression is complicated. Also, calculating the SE of each coefficient requires you to be familiar with linear algebra. We use a python package to calculate this for us (sklearn linear regression does not provide p-value)
X_train_sm = sm.add_constant(X_train) # a constant column should be added as new feature (will act as the y-intercept!)
regressor = sm.OLS(y_train, X_train_sm)
regressor = regressor.fit()
print(regressor.summary())
Now, we can see the p-value of third feature is very large. It indicates that the NULL HYPOTHSIS of "there is no relationship between x3 (i.e. newspaper) and y (i.e. sales)" cannot be rejected! Let us see if removing the feature newspaper
can enhance the r2_score:
X = df.iloc[:,:-2].to_numpy() #excluding newspaper!
y = df.iloc[:,-1].to_numpy()
# train_test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
# scaling data (so the slopes not be impacted by the scale of data) [Optional!]
sc = StandardScaler().fit(X_train)
X_train = sc.transform(X_train)
X_test = sc.transform(X_test)
# fit liner model to data
regressor = LinearRegression().fit(X_train, y_train)
# pred y of train & test using both train and test sets
y_train_pred = regressor.predict(X_train)
y_test_pred = regressor.predict(X_test)
# get the coef & intercept of fitted linear model
print('the y intercept is: ', regressor.intercept_)
print('the coefficients correspond to two features are: ', regressor.coef_)
print('-----')
# calculate the r2_score:
print('r2 score on training set: ', r2_score(y_train, y_train_pred))
print('r2 score on test set: ', r2_score(y_test, y_test_pred))
It seems the r2_score is slightly decreased! It is important to note that increasing number of predictors increase $r2\_score$ anyway (However, we can see here the change is really small which indicates that the added predictor is not significant! Its inclusion may result in worse performance of model on test set). Since R2 changes, we cannot usually compare two models of same type and of same data that uses different number of predictors. So, we should use adjusted r2\_score
. (Read more here: Adjusted R2)
But, why do we get a not-close-to-zero slope for newspaper when it is being used alone as the single predictor in simple (univariate) linear regression? (in that case, which is not shown in the notebook, we also get low p-value, which means there exist a relationship between y and newspaper. So, what is going on?
The reason is that there is correlation between newspaper and probably another predictor. So, when we increase newspaper by $\Delta$ amount, we are affecting the other (hidden) predictor by (a fraction of) $\Delta$ as well. (E.g. You can change the weather from winter to summer. This changes the number of people go to beach! and that increases the sale of a person who sells ice cream! So, the actual relationship is between the salesman and the number of peoples in the beach. However, we can still use the season as the only predictor and get a linear model!)
X = df.iloc[:,:-1] # X is dataframe
print(X.corr())
Now, we can see there is a correlation between Radio and Newspaper. So, Even if we predict sale with newspaper, a change in newspaper is correlated with change in Radio (so the value of radio changes and so the value of target, i.e. sales!). Therefere we might believe that there is an actual relationship between y and the newspaper. However, we know that is not true (as we saw before when we used all of them for predicting target and got large p-value for newspaper, and we also noticed a small change in r2_score when it is excluded from set of features.)
Some interesting questions:
- Is at least one of predictors useful for predicting the response? --> use F-statistics
- If yes, then which subset of predictors help to explain Y? Is it all of them? --> Forward/Backward/mix selection
- Goodness-of-fit score?
- Given a new test point X, what/how to predict its response (corresponding target y value), and how accurate is that?
> Q1: Is at least one of predictors useful for predicting the response?
Similar to p-value calculated for each predictor in multi-variate linear regression, we can calculate a single F-statistic value as follows:
$\text{F-statistics} = {\frac{p}{n-p-1}}{\frac{TSS - RSS}{RSS}}$, where p is the number of predictors.
Recall that:
$TSS = {\sum\limits_{i=1}^{n}(y_{i} - \bar{y})^{2}}$, and $RSS = {\sum\limits_{i=1}^{n}(y_{i} - \hat{y}_{i})^{2}}$.
def f_statistics(y_true, y_pred, p):
"""
This function calculates the F-statistics of a fitted linear regression model.
Parameters
----------
y_true: ground truth values of target y
y_pred: predicted values of target y (using fitted linear regression model)
output
---------
f_statistics_value: scalar
shows the value of F-statistics.
"""
RSS_by_TSS = 1 - r2_score(y_true, y_train_pred)
n = len(y_true)
return ((n-p-1) / p) * (1/RSS_by_TSS - 1)
f_statistics_val = f_statistics(y_train, y_train_pred, p = 3)
print('f_statistics value is: ', f_statistics_val)
So, what does this value tell us?
F-statistics is a measure that helps us to reject OR not reject the NULL HYPOTHESIS: "there is no relationship between y and any other features" (H alternative is: "there is at least one feature that is related to target y")
If f-statistics is close to one, that means we canNOT reject null hypothesis (i.e. there might be no relationship between y and features). However, if it gets greater than one, we should reject null hypothsis. How much greater? Well, if the number of samples is large, a F-statistics that are higher than 1 should be enough for rejecting null hypothesis. However, if n_samples is small, we should get a very large value for f-statisitcs.
> which subset of predictors help to explain Y?
This is variable selection (a.k.a feature selection / feature extraction). The goal is to find a subset of predictors that are associated with response variable and dump the unncessary features.Three classicial approaches are: Forward-selection, Backward-selection, and Mixed-selection (more on this in later notebooks)
> Goodness-of-fit score?
Two of most common goodness-of-fit measures are: $R^{2}$, and $RSE$. Below, we review their formula:
$y \approx \hat{y}=wx + b $
$r2\_score = 1 - \frac{RSS}{TSS}$, with $RSS = \sum\limits_{i=1}^{n}(y_{i} - \hat{y}_{i})^{2}$, and $TSS = \sum\limits_{i=1}^{n}(y_{i} - \bar{y})^{2}$; (side note: $MSE = \frac{RSS}{n}$)
$RSE = \sqrt{\frac{RSS}{n-2}}$. Note that this is for univariate linear regression. For multi-variate linear regression (with $p>1$ predictors), the formula is as follows:
$RSE = \sqrt{\frac{RSS}{n-p-1}}$, why n-p-1? (well, you should go deeper in statisitics. It can be shown that RSS has the following distribution:
$RSS: {\sigma^{2}}.{ChiSq(dof = n-p-1)}$, where $\sigma$ is the unexplained (inevitable/irreducible) part of error.
It can be shown that $E(RSS)=\sigma^{2}(n−p-1)$. That means the RSS gets bigger by increasing the number of samples. To get an unbiased estimator, we can define $err = \frac{RSS}{n-p-1}$ to make sure the expected error is independent of size of samples.
NOTE: If $r2\_score$ is close to 1, that means the model explains a large portion of the variance in the response variable (because high value of $r2\_score$ means lower $RSS$, which basically means the predicted y
is close to true y
. In other words, the variance in response value is explained more by the model when r2_score
is higher. It seems RSE is not as interpretable as r2_score as r2_score is normalized; however, RSE is not.
#STEP1: Fitting model to train data
# get X and y
X = df.iloc[:,:-2].to_numpy() #excluding newspaper!
y = df.iloc[:,-1].to_numpy()
# train_test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
# scaling data (so the slopes not be impacted by the scale of data) [Optional!]
sc = StandardScaler().fit(X_train)
X_train = sc.transform(X_train)
X_test = sc.transform(X_test)
# fit liner model to data
regressor = LinearRegression().fit(X_train, y_train)
# pred y of train & test using both train and test sets
y_train_pred = regressor.predict(X_train)
y_test_pred = regressor.predict(X_test)
min_feature, max_feature = -3, 3
feature_0, feature_1 = np.meshgrid(np.arange(min_feature,max_feature,0.1),
np.arange(min_feature,max_feature,0.1))
X_mesh = np.c_[feature_0.ravel(), feature_1.ravel()]
y_mesh = regressor.predict(X_mesh)
fig = plt.figure(figsize=(30,10))
#plot the fitted hyperplane that contains point (x,y,z)
#... where the z is the prediction of response variable for predictors x,y
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X_mesh[:,0],X_mesh[:,1],y_mesh, c='b', alpha=0.1)
ax.scatter(X_train[:,0], X_train[:,1], y_train, c='r')
ax.set_xlabel('feature TV')
ax.set_ylabel('feature Radio')
ax.set_zlabel('target: Sale')
ax.view_init(elev=3, azim=110)
plt.show()
We can see that some of the points are above it and some others below it.
> Three types of errors associated with (mutiple) linear regression:
-
The paramters W and b are just estimation of true W and b. We can use the concept of confidence interval to show how close $\hat{f}$ is to $f$ (different estimation of parameters-e.g. via bootstrapping- leads to different parameters...and thus different $\hat{f}$).
-
Bias Error: Because we assumed that the model is linear (which is usually not correct in real-world problems!). So, our model has some bias error. For now, we ignore this and we think our model is true.
-
Even if we have true estimator, our prediction still has the irreducible error, with distribution $N(0, \sigma)$. We use Prediction Interval to show how close $\hat{Y}$ is to $Y$. For instance, let's say I want to focus on a subsample of data and fit a linear model onto it. I can say my model is closer to the actual model for this subsample. However, since n of samplesa are smaller, we are dealing with higher uncertainty, and there for prediction interval becomes longer/
Get deeper: Confidence Interval vs Prediction Interval
Later :)
Some extra consideration:
(1) Qualitative Predictors: fancy word for categorical features. sometimes the predictors are not numerical, so we should transform (encode) the categorical features into numerical ones.
(2) Some times the qualitative predicor contains two group only: (e.g. it is day or night). In such case, we can create a dummy variable and show the labels day/night with the help of binary value 0/1. In machine learning, we call it one-hot-encoding which can be used for a categorical feature with more than two labels as well)
Note:
As depcited in the figure above, when both features have high values, the model underestimate the target value. That can mean that the two features have some synergy that in practice, i.e. they boost each other.
To model this, we define a new variable X = TV * Radio
and add it to the previous ones (better to keep both TV
and Radio
as well). Let us see the result:
X = df.iloc[:,:-2].to_numpy() #excluding newspaper!
y = df.iloc[:,-1].to_numpy()
X = np.c_[X, X[:,0]*X[:,1]] #add TV * radio
# train_test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
# scaling data (so the slopes not be impacted by the scale of data) [Optional!]
sc = StandardScaler().fit(X_train)
X_train = sc.transform(X_train)
X_test = sc.transform(X_test)
# fit liner model to data
regressor = LinearRegression().fit(X_train, y_train)
# pred y of train & test using both train and test sets
y_train_pred = regressor.predict(X_train)
y_test_pred = regressor.predict(X_test)
# get the coef & intercept of fitted linear model
print('the y intercept is: ', regressor.intercept_)
print('the coefficients correspond to three features are: ', regressor.coef_)
print('-----')
# calculate the r2_score:
print('r2 score on training set: ', r2_score(y_train, y_train_pred))
print('r2 score on test set: ', r2_score(y_test, y_test_pred))
What we did can be considered as feature engineering. We tried to engineer a feature to help model!
TIP: We can even combine a categorical and numerical features. For instance, if X1 is categorical (with tags 0 an 1) and X2 is numerical, and the impact of X2 on response might be different for different tags of X1, we can create an additional feature $X3 = X2 \times X1$.
>practice: In credit data: predict balance using income and student features
TIP: If data shows non-linear relationship, we can use the concept of "polynomial feature", so, in addition to $X$, we may consider $X^{2}$, $X^{3}$, and so on. (However, we may overfit the data at some point and we should be careful about it.)
Potential problems when using linear regression:
(1) Non-linearity of response-predictor relationship... let us see it with an example
df = pd.read_csv("./datasets/Auto_v2.csv") #there two auto data sets. We are using auto_2.
df.head(3)
X = df['horsepower'].to_numpy()
y = df['mpg'].to_numpy()
#let us take a quick look at the graph
plt.figure(figsize=(20,5))
plt.scatter(X, y, color='b')
plt.xlabel('horsepower')
plt.ylabel('milage per gasoline')
plt.show()
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2)
poly.fit(X.reshape(-1,1))
X_poly = poly.transform(X.reshape(-1,1))
X_poly_train, X_poly_test, y_train, y_test = train_test_split(X_poly, y)
regressor_poly = LinearRegression().fit(X_poly_train, y_train)
y_train_pred_poly = regressor_poly.predict(X_poly_train)
y_test_pred_poly = regressor_poly.predict(X_poly_test)
print('r2 score with poly feature: ', r2_score(y_train, y_train_pred_poly))
print('r2 score with poly feature: ', r2_score(y_test, y_test_pred_poly))
regressor = LinearRegression().fit(X_poly_train[:,:2], y_train) #excluding degree2 (just simple linear regression)
y_train_pred = regressor.predict(X_poly_train[:,:2])
y_test_pred = regressor.predict(X_poly_test[:,:2])
print('r2 score using linear model w/o poly features: ', r2_score(y_train, y_train_pred))
print('r2 score using linear model w/o poly features: ', r2_score(y_test, y_test_pred))
plt.figure(figsize=(20,5))
plt.scatter(X, y, color='b')
plt.scatter(X, regressor.predict(X_poly[:,:2]), color='r', linewidth=3, label='simple feature')
plt.scatter(X, regressor_poly.predict(X_poly), color='orange', linewidth=3, label='poly feature (deg=2)')
plt.xlabel('horsepower')
plt.ylabel('milage per gasoline')
plt.legend()
plt.show()
residual_ytest_poly = y_test_pred_poly - y_test
residual_ytest = y_test_pred - y_test
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(15,3))
axs[0].scatter(y_test, residual_ytest_poly)
axs[1].scatter(y_test, residual_ytest)
axs[0].set_title('residual w.r.t y (on test set, with poly feature)')
axs[1].set_title('residual w.r.t y (on test set, without poly feature)')
axs[0].set_xlabel('y_test')
axs[1].set_xlabel('y_test')
axs[0].set_ylabel('residual')
axs[1].set_ylabel('residual')
plt.show()
print('\n')
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(15,3))
axs[0].hist(residual_ytest_poly, label=f'mu={np.mean(residual_ytest_poly)}, std={np.std(residual_ytest_poly)}')
axs[1].hist(residual_ytest, label=f'mu={np.mean(residual_ytest)}, std={np.std(residual_ytest)}')
axs[0].set_title('hist of residual on test set, w/ poly feature')
axs[1].set_title('hist of residual on test set, w/o poly feature')
axs[0].legend()
axs[1].legend()
plt.show()
And, to measure how close these two plots are to normal distribution, we can plot their QQ-plot:
fig, axs = plt.subplots(nrows=1,ncols=2, figsize=(10,5))
stats.probplot(residual_ytest_poly, plot=axs[0])
stats.probplot(residual_ytest, plot=axs[1])
axs[0].set_title('QQplot (with poly features)')
axs[1].set_title('QQplot (without poly features)')
plt.show()
It is hard to judge which one is closer to normal distribution...
(2) Correlation of Error Terms
We are assuming that the errors residual
are uncorrelated. In time series, however, such assumption is often not met. In other words, the residual error at t+1
has correlation to the one at t
. We can even see such correlation in the histogram above (right figure). For y-test beyond 20, we can see there is a clear trend in the residual values. (Solution?)
(3) non-constant varianc of error terms (a.k.a heteroscedasticity): when the error is getting wider / narrower at some response values. For instance, if variance of errors get larger at highr value of y, one solution might be using log()
or \sqrt(x)
functions to shrink larger valus of y.
(4) one/more(a few) outliers (here, we mean the point with high response value) may not have that much impact on the slope/y-intercept of the linear model. However, it can drastically change the value of RSE/ r2score. Therefore, it might be a good idea to remove outliers to get better sense on the performance of the model. How to detect them? one way is to plot studentized residuals which is: $\frac{e{i} - 0}{SE{e}}$. In other words, we want to know how far $e{i}$ is from distribtution with average of 0 and standard error $SE_{e}$, which is standard deviation of e. If it goes beyond +/- 3, then the point can be considered as an outlier. It is important to note that the outlier might be detected just because our model was not good enough to capture it.
(5) High Leveraging points: these are the points with extreme values in features. These can highly impact the (fitted) model and therefore it is important to detect and handle them. How to detect it? It is important to note that searching these points just based on an individual dimension is meaningless as we should consider all points together (see Fig 3.13 of the book).
So, we may have a point where its value of X1 feature and X2 feature is usual, but together, in 2d-space, they may be far from others. So, one way is to calculate pair-wise distances and choose the one that has the largest distance to its NN. or, we may (somehow) find the values of k (number of clusters) and then remove a group of data points. Another way is to calculate a statistical value (see 3.37).
(6) Colinearity
df = pd.read_csv("./datasets/Credit.csv", index_col=0)
df.head(3)
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(10,5))
axs[0].scatter(df['Limit'], df['Rating'])
axs[1].scatter(df['Limit'], df['Age'])
axs[0].set_xlabel('Limit')
axs[1].set_xlabel('Limit')
axs[0].set_ylabel('Rating')
axs[1].set_ylabel('Age')
plt.show()
We can see a clear linear relationship between Limit and Rating (they are colinear).Since both of them tends to increase/decrease together, their impact on response is not easily seperable. Therefore, their uncertainty (in estimation of linear model parameters) increases. Therefore, we should better detect and remove colinear feature.
- we may use correlation matrix.
- a better way (which works for multi-colinearity as well) is to measure VIF (variance inflation vector) as follows:
$VIF = \frac{1}{1 - R^{2}(X_{j}|X_{-j})}$, where R is r2_score.
$r2\_score(X_{j}|X_{-j})$ is the r2score from a regression of $X{j}$ onto $X{-j}$ (all predictors other than $X{j}$). So, if a particular predictor can be easily predicted by other predictors (i.e. high r2_score and thus high VIF), that means we do not need that particular predictor as it can be obtained from others! (rule of thumb: if VIF exceeds 5 or 10, indicates a high level of colinearity)
Solution: Drop one of variable with high VIF.
Linear Regression-vs-KNN
If relationship is linear, linear regression is better (and even better than KNN with large k). However, if the relationship between x and y becomes non-linear, then KNN (with lower k) can perform better. HOWEVER, one should note that this is not always the case! In fact, in high dimension, when the relationship between x and y is non-linear, KNN usually performs worse particularly when n<<p! (why? because the data becomes sparse in the space. Think about 10 points on [0,1] interval in 1-dim, and then think about them in 2-dim space in $[0,1]^{2}$).
So, parametric models (like linear regression) outperforms non-parametric ones when the number of samples is low and the dimension is high!
-
Quesetions to ask about our data: Let us assume we have a data with
predictors X
and theresponse variable y
...- Is there relationship between X and y? If yes, how much is that relationship strong?
- What predictor(s) are associate with y and how much?
- Is relationship between X and y linear? If yes, then do linear regression. If not, can we transform data such that the linear regression model becomes applicable?
- Is there any interaction (synergy) between predictors? (i.e. two predictors may have some impat together different their simple summation)
- Linear model: $\hat{f} = \hat{w}^{T}X + \hat{b}$
- Side Note: see table / summary of data: df.head() df.describe()
-
What are the optimal values for $\hat{w}$ and $\hat{b}$? optimal values of $\hat{w}$ and $\hat{b}$ should result it minimum MSE on train set.
Reminder: our ultimate goal is to get low MSE on test set using the model fitted on train set.
- MSE --> RSS (!) Let us assume the number of samples in train set is n. Then, we know that $MSE = \frac{1}{n}\sum\limits_{i=1}^{n}{(y_{i} - \hat{y}_{i})^{2}}$ (Note that $y_{i}$ is the true value of y for i-th sample, and $\hat{y}_{i}$ is the predicted value for y of that sample.). Now, if we just focus on the sum (and not the average), we have RSS (Residual Sum of Squares) and that's it! --> so, we have: $RSS = \sum\limits_{i=1}^{n}{(y_{i} - \hat{y}_{i})^{2}}$ and the optimization problem is to minimize RSS over the train set samples (we can as well minimis MSE, there is no difference!)
-
Discussion on the estimated values $\hat{w}$ and $\hat{b}$: Please note that the estimated values $\hat{w}$ and $\hat{b}$ are obtained based on the current train set. So, if we change the train set or splitted data into train/test in a different way, we may get different values for $\hat{w}$ and $\hat{b}$. All of these values are just estimation and no the true-but-unknown $w$ and $b$. So, what can we do? We can plot the distribution of each parameter and see how certain we are about our estimation. But, how to get that distribution? we can do bootstrapping on the given train set (e.g. create 1000 new train sets) and then find $\hat{w}$ and $\hat{b}$ for 1000 times. Now, we can create the histrogram of the obtained values. Let us focus on parameter $\hat{w}$ (same concept/discussion is true for $\hat{b}$.). The distribution shows the liklihood of getting different values for $\hat{w}$. The standard deviation of this distribution is
the standard deviation of the estimated value $\hat{w}$
, known as Standard Error (SE) of $\hat{w}$. Now, we can see how far we are from zero. If we are significantly far from 0, it means that y is related to the predictor X (we can invesitgate it by help of p-value). Please note that low p-value does not mean the model should be linear. In other words, low p-value just says there is a relationship between y and x. Now, we should see if a "linear model" can be suitabe here or not!
-
Measuring goodness-of-fit of Regression: As discussed before, we can just use MSE. However, MSE can get bigger when the values of y are larger and therefore, we need to way to normalize it. So, we use $r2\_score$ as follows: $r2\_score = 1 - \frac{RSS}{TSS}$, where RSS (TSS) is Residual (Total) Sum of squares: $RSS = \sum\limits_{i=1}^{n}{(y_{i} - \hat{y}_{i})^{2}}$ and $TSS = \sum\limits_{i=1}^{n}{(y_{i} - y_{avg})^{2}}$. The closer $r2\_score$ to one, the better! (we can see
r2_score
as the ratio of variance explained by the model)
- Scaling data in Linear Regression --> Although scaling data in linear regression does not change its accuracy, it can provide better insight on coefficients. If features X are not scaled to $\mu = 0$ and $\sigma = 1$, the coefficient of a feature can be impacted by the magnitude of its values. Therefore, it is more difficult to compare coefficient of features with each other. However, if we standardize each feature, the estimated value for the parameters correspond to features can be compared with each other!
-
F-statistics in Linear Regression --> In multi-variate linear regression, we can get p-value for parameters, each correspond to a single feature. We can also calculate a single score called F-statistics score that can help us test the null hypothesis "there is no relationship between X and y". To see the F-statistics and p-values of parameters, we can use statsmodels as follows:
import statsmodels.api as sm ##### X_train_sm = sm.add_constant(X_train) regressor = sm.OLS(y_train, X_train_sm) #OLS: Ordinary Least Squares regressor = regressor.fit() ##### print(regressor.summary())
- Side Note: p-value and F-statistics: please note that the p-value and F-statistics score can be calculated for parametric method. Also, getting low p-value cannot guarantee our model is good as we need to measure its goodness-of-fit. Calculating p-value and F-statistics can help us to better undertstand some aspects of relationship between X and y. They are good if the goal is inference and understanding what is happening.
-
RSE: We can get RSE as follows: $RSE = \sqrt{\frac{RSS}{n-p-1}}$, where $p$ is the number of features.
RSE
can be interpreted as the standard deviation of the unexplained variance, and has the useful property of being in the same units as the response variable.
-
Should I use all predictors? --> no...because sometimes a predictor may have a very insignificant impact and can unnecessarily make the model difficult in terms of interpretability. Furthermore, it is possible that two highly-correlated features may cuase trouble in linear regression. Because, it is not clear which of those should have higher coefficient and how each contribute to the prediction of response variable. To select feature, we should do
feature-selection-process
. Three common types of feature selection are: (1) Forward-selection (2) Backward-selection, and (3) Mix-selection. We can use p-value as a measure to help us decide whether we should keep a feature or not (read more on this later! Also, take a look at scikit-learn)
- Side Note: Confidence Interval-vs-Prediction Interval --> Confidence Interval (CI) is related to the estimated values of parameter and it is a way to help us understand how close $\hat{f}(x)$ is to $f(x)$. However, we have irreducible error $\epsilon$. To measure how close our $\hat{y}=\hat{f}(x)$ is to the true $y = f(x) + \epsilon$, we use Prediciton Interval (PI). Therefore, PI is longer (see ... for further detail).
-
Interaction (synergy) between predictors --> visualizing the fitted model and the data can be a good way to see how the model behave in different cases. For instance, in the
adversiting
data set, after using multi-linear-regression to model the data, we can see that the model underestimate y when the values of predictorsTV ad budget
andradio ad budget
are both high. So, they have some interaction impact. We can create a new variable $X_{TV}.X_{radio}$. We can see that using this new feature in addition to the two features TV and radio, can greatly increaser2_score
.
-
Some Notes: At the end, I would like to cover some notes
(1) If data shows quadaratic behavior: use polynomial features (and then apply linea regression); We can plot the residual against the y value. We would like to see the residual values around around the horizontal lineresidual=0
. We can also create QQplot withscipy.stats.probplot(residual_values, plot=plt)
. This can gives us an idea on how much the resiudal values are close to normal distribution.
(2) If there is correlation between $\epsilon$ values: Our assumption was that all errors are independent from each other. However, there might be correlation (i.e. $\epsilon(X)$ may give some information about the $\epsilon(X+1)$). In this case, we need some way to handle it. (feel free to search online, or wait and we may cover it later!)
(3) heteroscedasticity (the variance of $\epsilon$ changes): If we notice the errors are getting wider around the horizontal lineresidual=0
, that means the errors do not have consistent $\sigma$. We can use $log(..)$ or $\sqrt(..)$ transformation on y to resolve this issue in cases where the error gets wider for larger value of y
(4) Outlier -and- High Leveraging point: Here, when we talk about outlier, we mean its y value is not in the usual range as others. Although this may not impact the parameters of the model, it can have a drastic impact onr2_score
or other indices and therefore may mislead the user.High Leveraging point
is when the features of an observation have suprising value. It is important to note that we should consider all features together to detectHigh Leveraging point
(a.k.a outliers in feature space); because it is possible that each single feature has a normal-in-range value but the data point (i.e. considering all features of an observation together) might be far from others. This can have an impact on parameters of model and thus, it is important to handle them. But, how to find them? One way is to use pairwise-distances and find the point that is far from other. So we can find NearestNeighbor of each point and select the one that has the largest distance to its NN. Please note that the discovered point "might" be an outlier (because, for any set of samples, I can always find a point that has largest distance to its NN. But, that does not mean the point isHigh Leveraging point
. Another way is to cluster data points (if possible) and then remove the a group that has largest distance to its closest group! (but what if that group is just a bunch of observations with some particular values for their features. If my group has 1000 members, should I consider them asHigh Leveraging point
(this needs further investigation!)
(5) (multi) colinearity: is when two (or more) features are highly correlated with each other. Therefore, they can confuse the model as it is not create how much actual impact they each have on the prediction. A solution is to keep only one feature from a set of features that are highly correlated with each other. A better approach that checks multi colienarity as well is to calculateVIF
score as follows: $VIF = \frac{1}{1 - R^{2}(X_{j}|X_{-j})}$, where $R(X_{j}|X_{-j})$ means ther2_score
from regressing $X_{j}$ (considering thej-th
feature as response variable) on $X_{-j}$ (all features excluding thej-th
feature). Therefore, if we can get high VIF, that means the r2score in the denominator is high, which means the $X{j}$ can be predicted pretty well by other features. So, it provides redundant information. (rule of thumb: drop a feature if its VIF gets above 5 or 10.) Important: we should standardize data before calculating VIF score for each feature.
- Linear Regression-vs-KNN --> If relationship is linear, linear regression is better (and even better than KNN with large k. Recall that larger k make KNN closer to linear model). However, if the relationship between x and y becomes non-linear, then KNN (with lower k) can perform better. HOWEVER, one should note that this is not always true! In fact, in high dimension, when the relationship between x and y is non-linear, KNN usually perform worse! So, in short, parametric models (like linear regression) outperforms non-parametric ones when the number of samples is low and the dimension is high!
df = pd.read_csv("./datasets/Boston.csv", index_col=0)
df.head(3)
df = df.drop('black' ,axis='columns')
df.describe(include='all')
_ = df.hist(figsize=(24,15))
X = df['lstat'].to_numpy()
y = df.iloc[:,-1].to_numpy()
X = X.reshape(-1,1)
print('shape of X: ', X.shape)
print('shape of y: ', y.shape)
# if we want to get some statistical information, we should use `statsmodels`
import statsmodels.api as sm
X_sm = sm.add_constant(X)
regressor = sm.OLS(y, X_sm)
regressor = regressor.fit()
print(regressor.summary())
remarks:
(1) we can quickly see that the p-value is (almost) 0. This suggest that there is a relationship between response variable and the predictor lstat
(we can get the same conclusion by looking at the F-statistics score). Note that low p-value
does not mean our model is good. If we look at the R-squared
, we can see that the score is close to 0.5
, and this shows the linear model with single predictor cannot appropriately explain the variance in the target value.
(2) we have Standard Error (SE) of parameters, and thus we can approximatly say the Confidence Interval (CI) is between $\mu - 2 \sigma$ and $\mu + 2\sigma$ (note that $\sigma$ is SE
). So, for coeficient of x1, we can see the CI is [-0.9500 - 2*0.039, -0.9500 + 2*0.039]
.
(3) How to find CI and PI for the response variable y ?!?
(4) let us plot x-y (see below)
plt.scatter(X.ravel(), y, color = 'b')
plt.plot(X.ravel(), regressor.predict(X_sm), color = 'r')
plt.title('data and fitted simple linear model')
plt.xlabel('single feature: lstat')
plt.ylabel('response variable: medv')
plt.show()
side note: it can be seen that there is a quadratic behavior in the data... so, using polynomial features might help.
(5-I) let us plot residual-y
y_residual = regressor.predict(X_sm) - y
plt.scatter(y, y_residual, color='b')
plt.xlabel('y value')
plt.ylabel('y_residual')
plt.axhline(y=0, color='r')
plt.show()
plt.hist(y_residual)
plt.title('histogram of y_residual')
plt.show()
stats.probplot(y_residual, plot=plt)
plt.title('QQplot for y_residual')
plt.show()
we can see that the QQplot is deviated from straight line. Therefore, the residual values are not following a normal distribution. The scatter plot residual-vs-y shows that there is a correlation in highr values of y. In fact, we can see that residuals are following a trend for y>30.
(6-II) let us plot studentized residual-y (?!) --> LATER!
(7) Find outliers and high leverage point (How??)
Recall that outliers are in y-space and high leverage point is outlier in feature space.
from sklearn.metrics import pairwise_distances
D_y = pairwise_distances(y.reshape(-1,1))
np.fill_diagonal(D_y, np.inf) #Because of np.min(...) in next line
dist_to_nn = np.min(D_y, axis=1)
outlier_idx = np.argmax(dist_to_nn)
outlier_idx
D_x = pairwise_distances(X.reshape(-1,1))
np.fill_diagonal(D_x, np.inf) #Because of np.min(...) in next line
dist_to_nn = np.min(D_x, axis=1)
leverage_idx = np.argmax(dist_to_nn)
leverage_idx
plt.scatter(X.ravel(), y, color = 'b', label='data')
plt.plot(X.ravel(), regressor.predict(X_sm), color = 'r', label='fitted model')
plt.scatter(X[outlier_idx], y[outlier_idx], color='orange', label='outlier')
plt.scatter(X[leverage_idx], y[leverage_idx], color='black', label='leverage')
plt.title('data and fitted simple linear model')
plt.xlabel('single feature: lstat')
plt.ylabel('response variable: medv')
plt.legend()
plt.show()
(8) Let us use all predictors
X = df.iloc[:,:-1].to_numpy()
y = df.iloc[:,-1].to_numpy()
X_sm = sm.add_constant(X)
regressor = sm.OLS(y, X_sm)
regressor = regressor.fit()
print(regressor.summary())
remarks:
- F-statistics is large, which shows there is relationship between features and y
- p-value for x3 and x7 is large, which means there is little relationship between response variable y and these two features.
- How to know if there is (multi-) colinearity? let us calculate VIF as follows
def _get_VIF(X):
"""
This parameters calculate VIF for features and return them
Parameters
----------
X: numpy.ndarray
feature matrix of shape (n,p) where n is number of observations and p is number of features
Returns
---------
a vector of size p that contains the VIF of each feature
"""
if X.shape[1] <= 1:
raise ValueError ('the second dimension of X must be at least 2')
p = X.shape[1]
VIF_scores = np.full(p, -1, dtype=np.float64)
for i in range(p):
predictors = np.delete(X, i, axis=1)
target = X[:,i].reshape(-1,1)
regressor = LinearRegression().fit(predictors, target)
r_squared_score = r2_score(target, regressor.predict(predictors))
VIF_scores[i] = 1 / (1 - (r_squared_score**2 - 1e-6))
return VIF_scores
X = df.iloc[:,:-1].to_numpy()
y = df.iloc[:,-1].to_numpy()
VIF_scores = _get_VIF(X)
VIF_scores
remarks:
VIF score is high for feature no. 8 and 9 (in numpy indexing). This shows that these two variables can be obtained from others. Although we see two features with high VIF, it is possible that one of them is highly affected by the other! In other words, removing both of them may cause loss of informtion. Let us take a look at correlation matrix:
df_X = pd.DataFrame(X)
corr = df_X.corr()
corr.style.background_gradient(cmap='coolwarm')
The correlation matrix shows a high correlation (more than 0.9) between feature 8 and 9. Therefore, it is possible that removing feature 9 makes VIF scores of 8 lower.
In addition, we will delete feature 2 and 6 (numpy indexing) as it has high p-value. Let us take a look at the model again:
X_clean_v1 = np.delete(X, [2, 6, 9], axis=1)
X_sm = sm.add_constant(X_clean_v1)
regressor = sm.OLS(y, X_sm)
regressor = regressor.fit()
print(regressor.summary())
VIF_scores = _get_VIF(X_clean_v1)
VIF_scores
#The reason is that when a feature is not important, its coef is almost zero. And, if there is multi-colinearity, the coef
#is being divided among correlated features.
#However, removing redundant features can make interpretation easier!
(9) Add interaction between lstat
and age
(here, we are using the unclean version of X, and not the one after removing features with high VPF or high p-value)
X_df = df.iloc[:,:-1]
y_df = df.iloc[:,-1]
X_df['interation'] = X_df['age'] * X_df['lstat']
X_df
X = X_df.to_numpy()
y = y_df.to_numpy()
#clean it
X_clean_v1 = np.delete(X, [2, 6, 9], axis=1)
#let us see the result:
X_sm = sm.add_constant(X_clean_v1)
regressor = sm.OLS(y, X_sm)
regressor = regressor.fit()
print(regressor.summary())
Did we see any difference in R_squared score? No. That means there might be no interaction between the features age
and lstat
(or should we scale data first?!?)
(10) Recall that the relationship between y and lstat
was not linear, and it was quadratic-ish! So, let us apply regression while considering $X^{2}$ feature:
X = df['lstat'].to_numpy()
y = df.iloc[:,-1].to_numpy()
from sklearn.preprocessing import PolynomialFeatures
X_poly = PolynomialFeatures(degree=2).fit_transform(X.reshape(-1,1))
X_poly
regressor = sm.OLS(y, X_poly)
regressor = regressor.fit()
print(regressor.summary())
Remarks:
- The p-value are all close to zero, which suggest the features are related to response variable.
- the R-squared value is increased by ~20%.
plt.scatter(X, y, color='b', label='data')
plt.scatter(X, regressor.predict(X_poly), color='r', label='fitted model')
plt.xlabel('feature: lstat')
plt.ylabel('response variable')
plt.legend()
plt.show()
_get_VIF(X_poly[:,1:])
Now we can see VIF score is even more than 10; however, we noticed an increas in R-squared value after adding X^2;
The question is: should we trust VIF and always remove a feature with high VIF? Ok, let us drop original feature and just keep its squared version:
regressor = sm.OLS(y, X_poly[:,[0,2]]) #here we are skipping [:,1] as that corresponds to original feature
regressor = regressor.fit()
print(regressor.summary())
now, we can see that we should have kepts the original feature. So, if that is the case, why did we get high VIF values earlier?
possible solution: stackoverflow (needs further explanation)
X = df['lstat'].to_numpy()
y = df.iloc[:,-1].to_numpy()
X_scaled = StandardScaler().fit_transform(X.reshape(-1,1))
X_poly = PolynomialFeatures(degree=2).fit_transform(X_scaled)
regressor = sm.OLS(y, X_poly)
regressor = regressor.fit()
print(regressor.summary())
_get_VIF(X_poly[:,1:])
NOTE: now we can see the VIF values are low. So, we recommend to scale your data even in linear regression!Because, it makes sense to have low VIF values (Feel free to go back and review the previous questions and try to scale X first)
(11) Handling categorical feature
df = pd.read_csv('./datasets/Carseats.csv', index_col = 0)
df.head(3)
_ = df.hist(figsize=(10,10))
cat_features = list(set(df.columns) - set(df._get_numeric_data()))
cat_features
from collections import Counter
for feature in cat_features:
tags_counts = Counter(df[feature])
plt.bar(list(tags_counts.keys()), tags_counts.values())
plt.show()
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer([('ohe', OneHotEncoder(),cat_features)], remainder='passthrough')
X = ct.fit_transform(df.iloc[:,1:]) #note that the first column of df is actually the response variable
X
ct.get_feature_names_out()
y = df.iloc[:,0]
regressor = sm.OLS(y, X).fit()
print(regressor.summary())
Remarks:
we can see large p-value (i.e. > 0.05) in x1, x11 and x14.... For now, we prefer to remove only x11 and x14. Because p-value of x1 is close to threshold 0.05 (common threshold for p-value).
X = np.delete(X, obj=[10,13], axis=1)
regressor = sm.OLS(y, X).fit()
print(regressor.summary())
X_scaled = StandardScaler().fit_transform(X)
_get_VIF(X_scaled)
#because if we have other encoded tags, we can predict the remaining one!
#so, let us drop the first tag as we do encoding...
ct = ColumnTransformer([('ohe', OneHotEncoder(drop='first'),cat_features)], remainder='passthrough')
X = ct.fit_transform(df.iloc[:,1:]) #note that the first column of df is actually the response variable
y = df.iloc[:,0]
regressor = sm.OLS(y, X).fit()
print(regressor.summary())
Reducing number of dummy variables by dropping one of them (e.g. in case of binary, we can just keep 1, if it is not 1, it means it is 0!!!) increased the r2_score! (please note that this appoach leads to unsymmetric dummy variables which may cause problem (??? further investigation can be conducted and presented here)).
We can see that the p-value is not large anymore. Let us check out VIF:
X_scaled = StandardScaler().fit_transform(X)
_get_VIF(X_scaled)
now we can see the VIF scores are low!... All good :)