Note: The content of this post is based on my understanding from Chapter 03 of the book Introduction to Statistical Learning, where the authors introduce the concept of Linear Regression and provide comprehensive discussion.

If you haven't read this chapter from the book, I recommend reading it first (You can find its free online version via the provided link) Please feel free to change the code and see how the results behave. You may find something interesting...

import libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn import utils
from sklearn.preprocessing import StandardScaler

import statsmodels.api as sm
from scipy import stats

Probably you have heard regression a lot. It shows it is an important topic and it has helped many users in different areas. It can also help us better understand statistical models (in general) and it can be considered as the fundamental tool for building several more-complicated models.

Why are you digging the data?

It is important to improve our skills to ask questions and then try to find the answers! Recall that advertising data, where the features were the TV/Radio/Newspaper budgets, and the target was sale. Given the data, we are asked to come up with a marketing plan for the next year.
Ok, so what kind of questions should we ask?

  • Is there any evidence that shows a relationship between media(features) and sale(target, ouput)?
  • If yes, how much is such relationship strong?
  • Which media is associated with sale? and how much?
  • How accurately can we predict sale based on the given data?
  • Is the relationship between features and target linear? If yes, we can use linear regression. If not, can we transform features and/or response to make such relationship linear? (why linear? because its interpretation is easy)
  • Is there any interaction between features?

Simple Linear regression: Univariate Linear Regression

For now, we assume a single feature and we want to model its relationship to y using a linear regression.

$y \approx \hat{y}=wx + b $

And, that's it! we just want to use a (linear) line with slope w and y-intercept b. The goal is to find w and b such that it minimizes MSE.

In the equation above, we regressed y onto x by a linear regression model. w and b are the coefficients.

Residual Sum of Squares (RSS): $RSS = \sum\limits_{i=1}^{n}{e_{i}^{2}}$, where $e_{i}=y_{i} - \hat{y}_{i} $

We can analytically minimize RSS (note that $MSE$ is just $\frac{1}{n} {RSS}$). In other words, we can find the optimal w and b analytically. This gives us the least square coefficient estimates for linear regression model.

(If interested in math and how to estimate w and b , please see Wikipedia).

EXAMPLE

df = pd.read_csv("./datasets/Advertising.csv", index_col=0)
df.head(2)
TV Radio Newspaper Sales
1 230.1 37.8 69.2 22.1
2 44.5 39.3 45.1 10.4
X = df['TV'].to_numpy()
X = X.reshape(-1,1)

y = df.iloc[:,-1].to_numpy()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
regressor = LinearRegression().fit(X_train, y_train)

y_train_pred = regressor.predict(X_train)
y_test_pred = regressor.predict(X_test)
plt.figure()
plt.title('regressing feature TV onto target sale')

plt.scatter(X_train.ravel(), y_train, color='r', label='train')
plt.scatter(X_test.ravel(), y_test, color='b', label='test')
plt.plot(X_train.ravel(), y_train_pred, color='k', linewidth=3, label='linear model')

plt.xlabel('TV budget')
plt.ylabel('sale')
plt.legend()
plt.show()

The black line linear-regression model is estimated/fitted based on the train data set. Then, it will be used to predict the y-values of test data set.

w = regressor.coef_[0] #because we have only one variable
b = regressor.intercept_ 
print(f'w(slop): {w}, and b(y-intercept): {b}')
w(slop): 0.04614340753674499, and b(y-intercept): 7.248909283005855

Discussion on accuracy:
It is important to note that the estimated parameters w and b are unbiased, i.e. if we fit a linear model onto many different sets of training data, we can get closer to the true value of w(b) by taking the expected values of w(b), achieved for different sets of training data. Therefore, a better notation might be $\hat{w}$ and $\hat{b}$ as they are the estimation of true values $w$ and $b$. However, we usually avoid using hat.

Since the estimated value of a parameter (from one set of observation) has some deviation from the true value, we can calculate Standard Error (SE) of the parameter to show how accuracte the estimate is. Standard deviation (std) is for one set of observations while SE is the standard deviation of a statistical parameter for sets of observation. However, we are able to calculte SE even for one set of observations.

For more info, watch this video: std-vs-SE

SIDE NOTES

Before moving forward, I think it would be nice to review some concepts (we may study them in the future though. However, I think it is important to better understand SE/confidene interval/...). You may want to watch the following videos (better to follow the order provided below).

Bootstrapping: Watch Bootstrapping main concept
Standard Error: Watch Standard Error
Confidence Interval: Watch Confidence Interval

Suggestion: a short note on p-value: MIT note

We want to test the following null hypothosesis: There is no association with Y and X. So, we can repeat the experimnent with new sets of data (or do bootstrapping) and find different values for the estimation of $w$ and $b$. Let us do that for the data we have:

n_bootstrap = 1000

w_vec = np.zeros(n_bootstrap)
b_vec = np.zeros(n_bootstrap)

data_train = np.c_[X_train, y_train]
for i in range(n_bootstrap):
    data_bootstrap = utils.resample(data_train, random_state=i)
    regressor = LinearRegression().fit(data_bootstrap[:,0].reshape(-1,1), data_bootstrap[:,1])
    
    w_vec[i] = regressor.coef_[0]
    b_vec[i] = regressor.intercept_
w_mu = np.mean(w_vec)
w_std = np.std(w_vec)

w_SE = w_std #note that Standard Error of w is basically the standard deviation of different values we got for w   
print('Standard Error: ', w_SE)
Standard Error:  0.003279658392741697

For linear regression, we already have a formula that can give us the w_SE (see book page 66):

y_diff = y_train - y_train_pred
RSS = np.sum(np.square(y_diff))

#Residual Standard Error(RSE)
RSE = np.sqrt(RSS/(X_train.shape[0]-2))

#X_dev = sum((X-Xavg)^2)
X_dev = np.sum(np.square(X_train - np.mean(X_train)))

w_SE_formula = np.sqrt((RSE**2)/X_dev)
print('w_SE using formula of book: ', w_SE_formula)
w_SE using formula of book:  0.003249284898592108

As observed, the value of SE of $w$ calculated by bootstrapping is close to what provided by the formula. We can also plot the distribution of $w$ based on the values collected throughout the bootstrapping process. And, we can get approx. 95\% confidence interval, by using the interval (mu-2std,mu+2std)

plt.figure()
plt.title('distribution of w')

plt.hist(w_vec,color='b',label='distribution', density=True)
plt.axvline(w,color='orange',label='estimated w')
plt.axvline(w_mu,color='k',label='distribution mu')
plt.axvline(w_mu-2*w_std,color='r',linestyle='--',label='2*std from mean')
plt.axvline(w_mu+2*w_std,color='r',linestyle='--')

plt.legend()
plt.show()

Also, you may want to dig more into "linear regression" topic by calculating how strong is such linear association. IF there is no association, we expect to see w=0. Right? But, we cannot just use the value of $w$, because its magnitude can be easily changed by scaling data. So, we should take into account its standad deviation to see how far $w$ is from zero. How to test whether the w is significantly far from zero? In other words, does the result provide a significant evidence that there is an association?

Here is where we should use the hypothesis testing. Our null hypothesis is as follows: There is no association between x and y, i.e. slope $w=0$. We have samples, and with samples, we can just estimate our parameter w ($\hat{w}$). (note that when we calculate $w$, we assume we are calculating $w$ in y=wx+b) We would like to see $\hat{w}$ to be close to true value $w=0$ IF NULL HYPOTHESIS IS TRUE. Now, we observed $\hat{w}$ to be around 0.05. Is it significantly far from 0? Well, with null hypothesis $w=0$ and the standard deviation SE (note: SE in both alternative and null hypothesis are the same in linea regression), we can normalize the observed value and caluclate t-score as follows:

t = (w - 0)/w_SE
print('t is: ', t) 
t is:  14.069577380030264

Now, if we use t-distribution table (with degree-of-freedom = function (n_samples)), we will see that the p-value is very small (almost zeros). Now, we know that we can reject our null hypothesis (i.e. there is an association between X and Y).

The next question is: "How accurate is the proposed (i.e. linear) model?" In other words, we know we should reject the non-association hypothesis. But, can we trust our LINEAR model? (i.e is that a good fit?):
We can use r2_score to measute it (more on this below)

regressor = LinearRegression().fit(X_train, y_train)

y_train_pred = regressor.predict(X_train)
y_test_pred = regressor.predict(X_test)

print('r2 score on train set: ', r2_score(y_train, y_train_pred))
print('r2 score on test set: ', r2_score(y_test, y_test_pred))
r2 score on train set:  0.5767451599400782
r2 score on test set:  0.6902574858002379

We can see that the r2_score is low for this model. A good r2_score is close to 1. (please continue for more detail)

$r2\_score = 1 - \frac{RSS}{TSS}$, where:

$RSS = \sum\limits_{i=1}^{n}(y_{i} - \hat{y}_{i})^{2}$, and $TSS = \sum\limits_{i=1}^{n}(y_{i} - \bar{y})^{2}$

We can think of RSS as amount of variation left UNexplained after using the model. (note that, initially (before modeling), the variation in Y was TSS).

Multi-variate (Multiple) Linear Regression

Now that we leanred about the concepts of linear regression, we should see how we can extend it for more than one variable. Why not providing simple (univariate) linear regression for each variable? Mainly because we are ignoring two variables when using only one variable for predicing sale. We'll see that such consideration can result in misleading result if there is a correlation between two/more (input) variables.

$y \approx \hat{y} = w_{1}x_{1} + w_{2}x_{2} + ... + w_{p}x_{p} + b = W^{T}X + b$

As stated earlier, we should avoid using one simple linear regression for each individual predictor, and instead, we should consider them together.

practice: try to fit simple linear regression on each feature of advertisement data, and then try with multiple linear regression. Are the coefficienet of simple-vs-mutliple close for each predictor? If not, why? what does that mean? (Hint: you can plot the correlation matrix that shows the correlation between predictors)

Let us first implement a multi-linear regression:

X = df.iloc[:,:-1].to_numpy()
y = df.iloc[:,-1].to_numpy()

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

regressor = LinearRegression().fit(X_train, y_train)

y_train_pred = regressor.predict(X_train)
y_test_pred = regressor.predict(X_test)
print('the y intercept is: ', regressor.intercept_)
print('the coefficients correspond to three features are: ', regressor.coef_)
the y intercept is:  2.89257005115115
the coefficients correspond to three features are:  [0.04416235 0.19900368 0.00116268]

please note that our results are slightly different than what presented in the book because: (1) the authors may used the whole data; however, we only used the train data set here, and/or (2) even if the authors use train set, they might got other samples in their train set that is different to the one obtained by the function train_test_split.

How can I determine the importance of each feature? Note that the scale of coefficients depends on the scale of values per feature. So, better to standardize them!

from sklearn.preprocessing import StandardScaler
sc = StandardScaler().fit(X_train)
X_train = sc.transform(X_train)
X_test = sc.transform(X_test)

regressor = LinearRegression().fit(X_train, y_train)

y_train_pred = regressor.predict(X_train)
y_test_pred = regressor.predict(X_test)

print('the y intercept is: ', regressor.intercept_)
print('the coefficients correspond to three features are: ', regressor.coef_)
the y intercept is:  14.351333333333333
the coefficients correspond to three features are:  [3.72650688 2.94496827 0.02370631]
print('r2 score on training set: ', r2_score(y_train, y_train_pred))
print('r2 score on training set: ', r2_score(y_test, y_test_pred))
r2 score on training set:  0.9072183330817297
r2 score on training set:  0.8576396745320893

This is an interesting result. It shows the model with three predictors can be LINEAR! (We saw before that considering a single predictor cannot result in a good LINEAR model)

The results above indicate that there is a considerable relationship between the first two features and the target value. However, the coefficient for the last feature is almost zero, which suggest the last feature (i.e. Newspaper) has no significant impact on the target.

NOTE: Please note that bootstrapping on multiple linear regression is complicated. Also, calculating the SE of each coefficient requires you to be familiar with linear algebra. We use a python package to calculate this for us (sklearn linear regression does not provide p-value)

X_train_sm = sm.add_constant(X_train) # a constant column should be added as new feature (will act as the y-intercept!) 

regressor = sm.OLS(y_train, X_train_sm)
regressor = regressor.fit()
print(regressor.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.907
Model:                            OLS   Adj. R-squared:                  0.905
Method:                 Least Squares   F-statistic:                     475.9
Date:                Sun, 27 Feb 2022   Prob (F-statistic):           3.89e-75
Time:                        15:11:46   Log-Likelihood:                -279.71
No. Observations:                 150   AIC:                             567.4
Df Residuals:                     146   BIC:                             579.5
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         14.3513      0.129    111.038      0.000      14.096      14.607
x1             3.7265      0.131     28.543      0.000       3.468       3.985
x2             2.9450      0.135     21.750      0.000       2.677       3.213
x3             0.0237      0.137      0.174      0.862      -0.246       0.294
==============================================================================
Omnibus:                       12.292   Durbin-Watson:                   2.055
Prob(Omnibus):                  0.002   Jarque-Bera (JB):               13.534
Skew:                          -0.736   Prob(JB):                      0.00115
Kurtosis:                       3.001   Cond. No.                         1.40
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Now, we can see the p-value of third feature is very large. It indicates that the NULL HYPOTHSIS of "there is no relationship between x3 (i.e. newspaper) and y (i.e. sales)" cannot be rejected! Let us see if removing the feature newspaper can enhance the r2_score:

X = df.iloc[:,:-2].to_numpy() #excluding newspaper!
y = df.iloc[:,-1].to_numpy()

# train_test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# scaling data (so the slopes not be impacted by the scale of data) [Optional!]
sc = StandardScaler().fit(X_train)
X_train = sc.transform(X_train)
X_test = sc.transform(X_test)

# fit liner model to data
regressor = LinearRegression().fit(X_train, y_train)

# pred y of train & test using both train and test sets
y_train_pred = regressor.predict(X_train)
y_test_pred = regressor.predict(X_test)

# get the coef & intercept of fitted linear model
print('the y intercept is: ', regressor.intercept_)
print('the coefficients correspond to two features are: ', regressor.coef_)
print('-----')

# calculate the  r2_score:
print('r2 score on training set: ', r2_score(y_train, y_train_pred))
print('r2 score on test set: ', r2_score(y_test, y_test_pred))
the y intercept is:  14.351333333333333
the coefficients correspond to two features are:  [3.72945528 2.95186219]
-----
r2 score on training set:  0.9071991785449653
r2 score on test set:  0.8580883042722334

It seems the r2_score is slightly decreased! It is important to note that increasing number of predictors increase $r2\_score$ anyway (However, we can see here the change is really small which indicates that the added predictor is not significant! Its inclusion may result in worse performance of model on test set). Since R2 changes, we cannot usually compare two models of same type and of same data that uses different number of predictors. So, we should use adjusted r2\_score. (Read more here: Adjusted R2)

But, why do we get a not-close-to-zero slope for newspaper when it is being used alone as the single predictor in simple (univariate) linear regression? (in that case, which is not shown in the notebook, we also get low p-value, which means there exist a relationship between y and newspaper. So, what is going on?

The reason is that there is correlation between newspaper and probably another predictor. So, when we increase newspaper by $\Delta$ amount, we are affecting the other (hidden) predictor by (a fraction of) $\Delta$ as well. (E.g. You can change the weather from winter to summer. This changes the number of people go to beach! and that increases the sale of a person who sells ice cream! So, the actual relationship is between the salesman and the number of peoples in the beach. However, we can still use the season as the only predictor and get a linear model!)

X = df.iloc[:,:-1] # X is dataframe
print(X.corr())
                 TV     Radio  Newspaper
TV         1.000000  0.054809   0.056648
Radio      0.054809  1.000000   0.354104
Newspaper  0.056648  0.354104   1.000000

Now, we can see there is a correlation between Radio and Newspaper. So, Even if we predict sale with newspaper, a change in newspaper is correlated with change in Radio (so the value of radio changes and so the value of target, i.e. sales!). Therefere we might believe that there is an actual relationship between y and the newspaper. However, we know that is not true (as we saw before when we used all of them for predicting target and got large p-value for newspaper, and we also noticed a small change in r2_score when it is excluded from set of features.)

Some interesting questions:

  • Is at least one of predictors useful for predicting the response? --> use F-statistics
  • If yes, then which subset of predictors help to explain Y? Is it all of them? --> Forward/Backward/mix selection
  • Goodness-of-fit score?
  • Given a new test point X, what/how to predict its response (corresponding target y value), and how accurate is that?

> Q1: Is at least one of predictors useful for predicting the response?

Similar to p-value calculated for each predictor in multi-variate linear regression, we can calculate a single F-statistic value as follows:

$\text{F-statistics} = {\frac{p}{n-p-1}}{\frac{TSS - RSS}{RSS}}$, where p is the number of predictors.

Recall that:
$TSS = {\sum\limits_{i=1}^{n}(y_{i} - \bar{y})^{2}}$, and $RSS = {\sum\limits_{i=1}^{n}(y_{i} - \hat{y}_{i})^{2}}$.

def f_statistics(y_true, y_pred, p):
    """
    This function calculates the F-statistics of a fitted linear regression model.
    
    Parameters
    ----------
    y_true: ground truth values of target y
    y_pred: predicted values of target y (using fitted linear regression model)
    
    output
    ---------
    f_statistics_value: scalar
    shows the value of F-statistics.
    """
    RSS_by_TSS = 1 - r2_score(y_true, y_train_pred)
    
    n = len(y_true)
    
    return ((n-p-1) / p) * (1/RSS_by_TSS - 1)
f_statistics_val  = f_statistics(y_train, y_train_pred, p = 3)
print('f_statistics value is: ', f_statistics_val)
f_statistics value is:  475.7539785778091

So, what does this value tell us?

F-statistics is a measure that helps us to reject OR not reject the NULL HYPOTHESIS: "there is no relationship between y and any other features" (H alternative is: "there is at least one feature that is related to target y")

If f-statistics is close to one, that means we canNOT reject null hypothesis (i.e. there might be no relationship between y and features). However, if it gets greater than one, we should reject null hypothsis. How much greater? Well, if the number of samples is large, a F-statistics that are higher than 1 should be enough for rejecting null hypothesis. However, if n_samples is small, we should get a very large value for f-statisitcs.

> which subset of predictors help to explain Y?
This is variable selection (a.k.a feature selection / feature extraction). The goal is to find a subset of predictors that are associated with response variable and dump the unncessary features.Three classicial approaches are: Forward-selection, Backward-selection, and Mixed-selection (more on this in later notebooks)

> Goodness-of-fit score?
Two of most common goodness-of-fit measures are: $R^{2}$, and $RSE$. Below, we review their formula:

$y \approx \hat{y}=wx + b $

$r2\_score = 1 - \frac{RSS}{TSS}$, with $RSS = \sum\limits_{i=1}^{n}(y_{i} - \hat{y}_{i})^{2}$, and $TSS = \sum\limits_{i=1}^{n}(y_{i} - \bar{y})^{2}$; (side note: $MSE = \frac{RSS}{n}$)

$RSE = \sqrt{\frac{RSS}{n-2}}$. Note that this is for univariate linear regression. For multi-variate linear regression (with $p>1$ predictors), the formula is as follows:

$RSE = \sqrt{\frac{RSS}{n-p-1}}$, why n-p-1? (well, you should go deeper in statisitics. It can be shown that RSS has the following distribution:

$RSS: {\sigma^{2}}.{ChiSq(dof = n-p-1)}$, where $\sigma$ is the unexplained (inevitable/irreducible) part of error.

It can be shown that $E(RSS)=\sigma^{2}(n−p-1)$. That means the RSS gets bigger by increasing the number of samples. To get an unbiased estimator, we can define $err = \frac{RSS}{n-p-1}$ to make sure the expected error is independent of size of samples.

NOTE: If $r2\_score$ is close to 1, that means the model explains a large portion of the variance in the response variable (because high value of $r2\_score$ means lower $RSS$, which basically means the predicted y is close to true y. In other words, the variance in response value is explained more by the model when r2_score is higher. It seems RSE is not as interpretable as r2_score as r2_score is normalized; however, RSE is not.

#STEP1: Fitting model to train data
# get X and y
X = df.iloc[:,:-2].to_numpy() #excluding newspaper!
y = df.iloc[:,-1].to_numpy()

# train_test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# scaling data (so the slopes not be impacted by the scale of data) [Optional!]
sc = StandardScaler().fit(X_train)
X_train = sc.transform(X_train)
X_test = sc.transform(X_test)

# fit liner model to data
regressor = LinearRegression().fit(X_train, y_train)

# pred y of train & test using both train and test sets
y_train_pred = regressor.predict(X_train)
y_test_pred = regressor.predict(X_test)
min_feature, max_feature = -3, 3

feature_0, feature_1 = np.meshgrid(np.arange(min_feature,max_feature,0.1),
                                   np.arange(min_feature,max_feature,0.1))

X_mesh = np.c_[feature_0.ravel(), feature_1.ravel()]
y_mesh = regressor.predict(X_mesh)

fig = plt.figure(figsize=(30,10))
#plot the fitted hyperplane that contains point (x,y,z) 
#... where the z is the prediction of response variable for predictors x,y
ax = fig.add_subplot(111, projection='3d')

ax.scatter(X_mesh[:,0],X_mesh[:,1],y_mesh, c='b', alpha=0.1)
ax.scatter(X_train[:,0], X_train[:,1], y_train, c='r')

ax.set_xlabel('feature TV')
ax.set_ylabel('feature Radio')
ax.set_zlabel('target: Sale')
    
ax.view_init(elev=3, azim=110)
plt.show()

We can see that some of the points are above it and some others below it.

> Three types of errors associated with (mutiple) linear regression:

  • The paramters W and b are just estimation of true W and b. We can use the concept of confidence interval to show how close $\hat{f}$ is to $f$ (different estimation of parameters-e.g. via bootstrapping- leads to different parameters...and thus different $\hat{f}$).

  • Bias Error: Because we assumed that the model is linear (which is usually not correct in real-world problems!). So, our model has some bias error. For now, we ignore this and we think our model is true.

  • Even if we have true estimator, our prediction still has the irreducible error, with distribution $N(0, \sigma)$. We use Prediction Interval to show how close $\hat{Y}$ is to $Y$. For instance, let's say I want to focus on a subsample of data and fit a linear model onto it. I can say my model is closer to the actual model for this subsample. However, since n of samplesa are smaller, we are dealing with higher uncertainty, and there for prediction interval becomes longer/

Get deeper: Confidence Interval vs Prediction Interval
Later :)

Some extra consideration:
(1) Qualitative Predictors: fancy word for categorical features. sometimes the predictors are not numerical, so we should transform (encode) the categorical features into numerical ones.

(2) Some times the qualitative predicor contains two group only: (e.g. it is day or night). In such case, we can create a dummy variable and show the labels day/night with the help of binary value 0/1. In machine learning, we call it one-hot-encoding which can be used for a categorical feature with more than two labels as well)

Note:
As depcited in the figure above, when both features have high values, the model underestimate the target value. That can mean that the two features have some synergy that in practice, i.e. they boost each other.

To model this, we define a new variable X = TV * Radio and add it to the previous ones (better to keep both TV and Radioas well). Let us see the result:

X = df.iloc[:,:-2].to_numpy() #excluding newspaper!
y = df.iloc[:,-1].to_numpy()

X = np.c_[X, X[:,0]*X[:,1]] #add TV * radio

# train_test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# scaling data (so the slopes not be impacted by the scale of data) [Optional!]
sc = StandardScaler().fit(X_train)
X_train = sc.transform(X_train)
X_test = sc.transform(X_test)

# fit liner model to data
regressor = LinearRegression().fit(X_train, y_train)

# pred y of train & test using both train and test sets
y_train_pred = regressor.predict(X_train)
y_test_pred = regressor.predict(X_test)

# get the coef & intercept of fitted linear model
print('the y intercept is: ', regressor.intercept_)
print('the coefficients correspond to three features are: ', regressor.coef_)
print('-----')

# calculate the  r2_score:
print('r2 score on training set: ', r2_score(y_train, y_train_pred))
print('r2 score on test set: ', r2_score(y_test, y_test_pred))
the y intercept is:  14.351333333333335
the coefficients correspond to three features are:  [1.57858188 0.49099955 3.60921627]
-----
r2 score on training set:  0.9739792008933047
r2 score on test set:  0.9481495211786501

What we did can be considered as feature engineering. We tried to engineer a feature to help model!

TIP: We can even combine a categorical and numerical features. For instance, if X1 is categorical (with tags 0 an 1) and X2 is numerical, and the impact of X2 on response might be different for different tags of X1, we can create an additional feature $X3 = X2 \times X1$.

>practice: In credit data: predict balance using income and student features

TIP: If data shows non-linear relationship, we can use the concept of "polynomial feature", so, in addition to $X$, we may consider $X^{2}$, $X^{3}$, and so on. (However, we may overfit the data at some point and we should be careful about it.)

Potential problems when using linear regression:
(1) Non-linearity of response-predictor relationship... let us see it with an example

df = pd.read_csv("./datasets/Auto_v2.csv") #there two auto data sets. We are using auto_2.
df.head(3)
Unnamed: 0 mpg cylinders displacement horsepower weight acceleration year origin name
0 1 18.0 8 307.0 130 3504 12.0 70 1 chevrolet chevelle malibu
1 2 15.0 8 350.0 165 3693 11.5 70 1 buick skylark 320
2 3 18.0 8 318.0 150 3436 11.0 70 1 plymouth satellite
X = df['horsepower'].to_numpy()
y = df['mpg'].to_numpy()

#let us take a quick look at the graph
plt.figure(figsize=(20,5))
plt.scatter(X, y, color='b')
plt.xlabel('horsepower')
plt.ylabel('milage per gasoline')
plt.show()
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2)
poly.fit(X.reshape(-1,1))

X_poly = poly.transform(X.reshape(-1,1))

X_poly_train, X_poly_test, y_train, y_test = train_test_split(X_poly, y)
regressor_poly = LinearRegression().fit(X_poly_train, y_train)
y_train_pred_poly  = regressor_poly.predict(X_poly_train)
y_test_pred_poly  = regressor_poly.predict(X_poly_test)

print('r2 score with poly feature: ', r2_score(y_train, y_train_pred_poly))
print('r2 score with poly feature: ', r2_score(y_test, y_test_pred_poly))
r2 score with poly feature:  0.677765663462048
r2 score with poly feature:  0.7126347880982686
regressor = LinearRegression().fit(X_poly_train[:,:2], y_train) #excluding degree2 (just simple linear regression)
y_train_pred  = regressor.predict(X_poly_train[:,:2])
y_test_pred  = regressor.predict(X_poly_test[:,:2])

print('r2 score using linear model w/o poly features: ', r2_score(y_train, y_train_pred))
print('r2 score using linear model w/o poly features: ', r2_score(y_test, y_test_pred))
r2 score using linear model w/o poly features:  0.599786154466244
r2 score using linear model w/o poly features:  0.6200928542447959
plt.figure(figsize=(20,5))
plt.scatter(X, y, color='b')

plt.scatter(X, regressor.predict(X_poly[:,:2]), color='r', linewidth=3, label='simple feature')
plt.scatter(X, regressor_poly.predict(X_poly), color='orange',  linewidth=3, label='poly feature (deg=2)')

plt.xlabel('horsepower')
plt.ylabel('milage per gasoline')

plt.legend()
plt.show()
residual_ytest_poly = y_test_pred_poly - y_test
residual_ytest = y_test_pred - y_test

fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(15,3))
axs[0].scatter(y_test, residual_ytest_poly)
axs[1].scatter(y_test, residual_ytest)

axs[0].set_title('residual w.r.t y (on test set, with poly feature)')
axs[1].set_title('residual w.r.t y (on test set, without poly feature)')


axs[0].set_xlabel('y_test')
axs[1].set_xlabel('y_test')

axs[0].set_ylabel('residual')
axs[1].set_ylabel('residual')
plt.show()

print('\n')

fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(15,3))
axs[0].hist(residual_ytest_poly, label=f'mu={np.mean(residual_ytest_poly)}, std={np.std(residual_ytest_poly)}')
axs[1].hist(residual_ytest, label=f'mu={np.mean(residual_ytest)}, std={np.std(residual_ytest)}')

axs[0].set_title('hist of residual on test set, w/ poly feature')
axs[1].set_title('hist of residual on test set, w/o poly feature')

axs[0].legend()
axs[1].legend()
plt.show()

And, to measure how close these two plots are to normal distribution, we can plot their QQ-plot:

fig, axs = plt.subplots(nrows=1,ncols=2, figsize=(10,5))
stats.probplot(residual_ytest_poly, plot=axs[0])
stats.probplot(residual_ytest, plot=axs[1])

axs[0].set_title('QQplot (with poly features)')
axs[1].set_title('QQplot (without poly features)')
plt.show()

It is hard to judge which one is closer to normal distribution...

(2) Correlation of Error Terms

We are assuming that the errors residual are uncorrelated. In time series, however, such assumption is often not met. In other words, the residual error at t+1 has correlation to the one at t. We can even see such correlation in the histogram above (right figure). For y-test beyond 20, we can see there is a clear trend in the residual values. (Solution?)

(3) non-constant varianc of error terms (a.k.a heteroscedasticity): when the error is getting wider / narrower at some response values. For instance, if variance of errors get larger at highr value of y, one solution might be using log() or \sqrt(x) functions to shrink larger valus of y.

(4) one/more(a few) outliers (here, we mean the point with high response value) may not have that much impact on the slope/y-intercept of the linear model. However, it can drastically change the value of RSE/ r2score. Therefore, it might be a good idea to remove outliers to get better sense on the performance of the model. How to detect them? one way is to plot studentized residuals which is: $\frac{e{i} - 0}{SE{e}}$. In other words, we want to know how far $e{i}$ is from distribtution with average of 0 and standard error $SE_{e}$, which is standard deviation of e. If it goes beyond +/- 3, then the point can be considered as an outlier. It is important to note that the outlier might be detected just because our model was not good enough to capture it.

(5) High Leveraging points: these are the points with extreme values in features. These can highly impact the (fitted) model and therefore it is important to detect and handle them. How to detect it? It is important to note that searching these points just based on an individual dimension is meaningless as we should consider all points together (see Fig 3.13 of the book).

So, we may have a point where its value of X1 feature and X2 feature is usual, but together, in 2d-space, they may be far from others. So, one way is to calculate pair-wise distances and choose the one that has the largest distance to its NN. or, we may (somehow) find the values of k (number of clusters) and then remove a group of data points. Another way is to calculate a statistical value (see 3.37).

(6) Colinearity

df = pd.read_csv("./datasets/Credit.csv", index_col=0)
df.head(3)
Income Limit Rating Cards Age Education Gender Student Married Ethnicity Balance
1 14.891 3606 283 2 34 11 Male No Yes Caucasian 333
2 106.025 6645 483 3 82 15 Female Yes Yes Asian 903
3 104.593 7075 514 4 71 11 Male No No Asian 580
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(10,5))
axs[0].scatter(df['Limit'], df['Rating'])
axs[1].scatter(df['Limit'], df['Age'])

axs[0].set_xlabel('Limit')
axs[1].set_xlabel('Limit')

axs[0].set_ylabel('Rating')
axs[1].set_ylabel('Age')

plt.show()

We can see a clear linear relationship between Limit and Rating (they are colinear).Since both of them tends to increase/decrease together, their impact on response is not easily seperable. Therefore, their uncertainty (in estimation of linear model parameters) increases. Therefore, we should better detect and remove colinear feature.

  • we may use correlation matrix.
  • a better way (which works for multi-colinearity as well) is to measure VIF (variance inflation vector) as follows:

$VIF = \frac{1}{1 - R^{2}(X_{j}|X_{-j})}$, where R is r2_score.

$r2\_score(X_{j}|X_{-j})$ is the r2score from a regression of $X{j}$ onto $X{-j}$ (all predictors other than $X{j}$). So, if a particular predictor can be easily predicted by other predictors (i.e. high r2_score and thus high VIF), that means we do not need that particular predictor as it can be obtained from others! (rule of thumb: if VIF exceeds 5 or 10, indicates a high level of colinearity)

Solution: Drop one of variable with high VIF.

Linear Regression-vs-KNN
If relationship is linear, linear regression is better (and even better than KNN with large k). However, if the relationship between x and y becomes non-linear, then KNN (with lower k) can perform better. HOWEVER, one should note that this is not always the case! In fact, in high dimension, when the relationship between x and y is non-linear, KNN usually performs worse particularly when n<<p! (why? because the data becomes sparse in the space. Think about 10 points on [0,1] interval in 1-dim, and then think about them in 2-dim space in $[0,1]^{2}$).

So, parametric models (like linear regression) outperforms non-parametric ones when the number of samples is low and the dimension is high!

Summary

  • Quesetions to ask about our data: Let us assume we have a data with predictors X and the response variable y...
    • Is there relationship between X and y? If yes, how much is that relationship strong?
    • What predictor(s) are associate with y and how much?
    • Is relationship between X and y linear? If yes, then do linear regression. If not, can we transform data such that the linear regression model becomes applicable?
    • Is there any interaction (synergy) between predictors? (i.e. two predictors may have some impat together different their simple summation)
  • Linear model: $\hat{f} = \hat{w}^{T}X + \hat{b}$
  • Side Note: see table / summary of data: df.head() df.describe()
  • What are the optimal values for $\hat{w}$ and $\hat{b}$? optimal values of $\hat{w}$ and $\hat{b}$ should result it minimum MSE on train set.
    Reminder: our ultimate goal is to get low MSE on test set using the model fitted on train set.
  • MSE --> RSS (!) Let us assume the number of samples in train set is n. Then, we know that $MSE = \frac{1}{n}\sum\limits_{i=1}^{n}{(y_{i} - \hat{y}_{i})^{2}}$ (Note that $y_{i}$ is the true value of y for i-th sample, and $\hat{y}_{i}$ is the predicted value for y of that sample.). Now, if we just focus on the sum (and not the average), we have RSS (Residual Sum of Squares) and that's it! --> so, we have: $RSS = \sum\limits_{i=1}^{n}{(y_{i} - \hat{y}_{i})^{2}}$ and the optimization problem is to minimize RSS over the train set samples (we can as well minimis MSE, there is no difference!)
  • Discussion on the estimated values $\hat{w}$ and $\hat{b}$: Please note that the estimated values $\hat{w}$ and $\hat{b}$ are obtained based on the current train set. So, if we change the train set or splitted data into train/test in a different way, we may get different values for $\hat{w}$ and $\hat{b}$. All of these values are just estimation and no the true-but-unknown $w$ and $b$. So, what can we do? We can plot the distribution of each parameter and see how certain we are about our estimation. But, how to get that distribution? we can do bootstrapping on the given train set (e.g. create 1000 new train sets) and then find $\hat{w}$ and $\hat{b}$ for 1000 times. Now, we can create the histrogram of the obtained values. Let us focus on parameter $\hat{w}$ (same concept/discussion is true for $\hat{b}$.). The distribution shows the liklihood of getting different values for $\hat{w}$. The standard deviation of this distribution is the standard deviation of the estimated value $\hat{w}$, known as Standard Error (SE) of $\hat{w}$. Now, we can see how far we are from zero. If we are significantly far from 0, it means that y is related to the predictor X (we can invesitgate it by help of p-value). Please note that low p-value does not mean the model should be linear. In other words, low p-value just says there is a relationship between y and x. Now, we should see if a "linear model" can be suitabe here or not!
  • Measuring goodness-of-fit of Regression: As discussed before, we can just use MSE. However, MSE can get bigger when the values of y are larger and therefore, we need to way to normalize it. So, we use $r2\_score$ as follows: $r2\_score = 1 - \frac{RSS}{TSS}$, where RSS (TSS) is Residual (Total) Sum of squares: $RSS = \sum\limits_{i=1}^{n}{(y_{i} - \hat{y}_{i})^{2}}$ and $TSS = \sum\limits_{i=1}^{n}{(y_{i} - y_{avg})^{2}}$. The closer $r2\_score$ to one, the better! (we can see r2_score as the ratio of variance explained by the model)
  • Scaling data in Linear Regression --> Although scaling data in linear regression does not change its accuracy, it can provide better insight on coefficients. If features X are not scaled to $\mu = 0$ and $\sigma = 1$, the coefficient of a feature can be impacted by the magnitude of its values. Therefore, it is more difficult to compare coefficient of features with each other. However, if we standardize each feature, the estimated value for the parameters correspond to features can be compared with each other!
  • F-statistics in Linear Regression --> In multi-variate linear regression, we can get p-value for parameters, each correspond to a single feature. We can also calculate a single score called F-statistics score that can help us test the null hypothesis "there is no relationship between X and y". To see the F-statistics and p-values of parameters, we can use statsmodels as follows:
    import statsmodels.api as sm
    #####
    X_train_sm = sm.add_constant(X_train)
    regressor = sm.OLS(y_train, X_train_sm) #OLS: Ordinary Least Squares
    regressor = regressor.fit()
    #####
    print(regressor.summary())
  • Side Note: p-value and F-statistics: please note that the p-value and F-statistics score can be calculated for parametric method. Also, getting low p-value cannot guarantee our model is good as we need to measure its goodness-of-fit. Calculating p-value and F-statistics can help us to better undertstand some aspects of relationship between X and y. They are good if the goal is inference and understanding what is happening.
  • RSE: We can get RSE as follows: $RSE = \sqrt{\frac{RSS}{n-p-1}}$, where $p$ is the number of features. RSE can be interpreted as the standard deviation of the unexplained variance, and has the useful property of being in the same units as the response variable.
  • Should I use all predictors? --> no...because sometimes a predictor may have a very insignificant impact and can unnecessarily make the model difficult in terms of interpretability. Furthermore, it is possible that two highly-correlated features may cuase trouble in linear regression. Because, it is not clear which of those should have higher coefficient and how each contribute to the prediction of response variable. To select feature, we should do feature-selection-process. Three common types of feature selection are: (1) Forward-selection (2) Backward-selection, and (3) Mix-selection. We can use p-value as a measure to help us decide whether we should keep a feature or not (read more on this later! Also, take a look at scikit-learn)
  • Side Note: Confidence Interval-vs-Prediction Interval --> Confidence Interval (CI) is related to the estimated values of parameter and it is a way to help us understand how close $\hat{f}(x)$ is to $f(x)$. However, we have irreducible error $\epsilon$. To measure how close our $\hat{y}=\hat{f}(x)$ is to the true $y = f(x) + \epsilon$, we use Prediciton Interval (PI). Therefore, PI is longer (see ... for further detail).
  • Interaction (synergy) between predictors --> visualizing the fitted model and the data can be a good way to see how the model behave in different cases. For instance, in the adversiting data set, after using multi-linear-regression to model the data, we can see that the model underestimate y when the values of predictors TV ad budget and radio ad budget are both high. So, they have some interaction impact. We can create a new variable $X_{TV}.X_{radio}$. We can see that using this new feature in addition to the two features TV and radio, can greatly increase r2_score.
  • Some Notes: At the end, I would like to cover some notes
    (1) If data shows quadaratic behavior: use polynomial features (and then apply linea regression); We can plot the residual against the y value. We would like to see the residual values around around the horizontal line residual=0. We can also create QQplot with scipy.stats.probplot(residual_values, plot=plt). This can gives us an idea on how much the resiudal values are close to normal distribution.
    (2) If there is correlation between $\epsilon$ values: Our assumption was that all errors are independent from each other. However, there might be correlation (i.e. $\epsilon(X)$ may give some information about the $\epsilon(X+1)$). In this case, we need some way to handle it. (feel free to search online, or wait and we may cover it later!)
    (3) heteroscedasticity (the variance of $\epsilon$ changes): If we notice the errors are getting wider around the horizontal line residual=0, that means the errors do not have consistent $\sigma$. We can use $log(..)$ or $\sqrt(..)$ transformation on y to resolve this issue in cases where the error gets wider for larger value of y
    (4) Outlier -and- High Leveraging point: Here, when we talk about outlier, we mean its y value is not in the usual range as others. Although this may not impact the parameters of the model, it can have a drastic impact on r2_score or other indices and therefore may mislead the user. High Leveraging point is when the features of an observation have suprising value. It is important to note that we should consider all features together to detect High Leveraging point (a.k.a outliers in feature space); because it is possible that each single feature has a normal-in-range value but the data point (i.e. considering all features of an observation together) might be far from others. This can have an impact on parameters of model and thus, it is important to handle them. But, how to find them? One way is to use pairwise-distances and find the point that is far from other. So we can find NearestNeighbor of each point and select the one that has the largest distance to its NN. Please note that the discovered point "might" be an outlier (because, for any set of samples, I can always find a point that has largest distance to its NN. But, that does not mean the point is High Leveraging point. Another way is to cluster data points (if possible) and then remove the a group that has largest distance to its closest group! (but what if that group is just a bunch of observations with some particular values for their features. If my group has 1000 members, should I consider them as High Leveraging point (this needs further investigation!)
    (5) (multi) colinearity: is when two (or more) features are highly correlated with each other. Therefore, they can confuse the model as it is not create how much actual impact they each have on the prediction. A solution is to keep only one feature from a set of features that are highly correlated with each other. A better approach that checks multi colienarity as well is to calculate VIF score as follows: $VIF = \frac{1}{1 - R^{2}(X_{j}|X_{-j})}$, where $R(X_{j}|X_{-j})$ means the r2_score from regressing $X_{j}$ (considering the j-th feature as response variable) on $X_{-j}$ (all features excluding the j-th feature). Therefore, if we can get high VIF, that means the r2score in the denominator is high, which means the $X{j}$ can be predicted pretty well by other features. So, it provides redundant information. (rule of thumb: drop a feature if its VIF gets above 5 or 10.) Important: we should standardize data before calculating VIF score for each feature.
  • Linear Regression-vs-KNN --> If relationship is linear, linear regression is better (and even better than KNN with large k. Recall that larger k make KNN closer to linear model). However, if the relationship between x and y becomes non-linear, then KNN (with lower k) can perform better. HOWEVER, one should note that this is not always true! In fact, in high dimension, when the relationship between x and y is non-linear, KNN usually perform worse! So, in short, parametric models (like linear regression) outperforms non-parametric ones when the number of samples is low and the dimension is high!

Lab

df = pd.read_csv("./datasets/Boston.csv", index_col=0)
df.head(3)
crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
1 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
2 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
3 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
df = df.drop('black' ,axis='columns')
df.describe(include='all')
crim zn indus chas nox rm age dis rad tax ptratio lstat medv
count 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000
mean 3.613524 11.363636 11.136779 0.069170 0.554695 6.284634 68.574901 3.795043 9.549407 408.237154 18.455534 12.653063 22.532806
std 8.601545 23.322453 6.860353 0.253994 0.115878 0.702617 28.148861 2.105710 8.707259 168.537116 2.164946 7.141062 9.197104
min 0.006320 0.000000 0.460000 0.000000 0.385000 3.561000 2.900000 1.129600 1.000000 187.000000 12.600000 1.730000 5.000000
25% 0.082045 0.000000 5.190000 0.000000 0.449000 5.885500 45.025000 2.100175 4.000000 279.000000 17.400000 6.950000 17.025000
50% 0.256510 0.000000 9.690000 0.000000 0.538000 6.208500 77.500000 3.207450 5.000000 330.000000 19.050000 11.360000 21.200000
75% 3.677083 12.500000 18.100000 0.000000 0.624000 6.623500 94.075000 5.188425 24.000000 666.000000 20.200000 16.955000 25.000000
max 88.976200 100.000000 27.740000 1.000000 0.871000 8.780000 100.000000 12.126500 24.000000 711.000000 22.000000 37.970000 50.000000
_ = df.hist(figsize=(24,15))
X = df['lstat'].to_numpy()
y = df.iloc[:,-1].to_numpy()

X = X.reshape(-1,1) 

print('shape of X: ', X.shape)
print('shape of y: ', y.shape)
shape of X:  (506, 1)
shape of y:  (506,)
# if we want to get some statistical information, we should use  `statsmodels`

import statsmodels.api as sm
X_sm = sm.add_constant(X)
regressor = sm.OLS(y, X_sm) 
regressor = regressor.fit()
print(regressor.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.544
Model:                            OLS   Adj. R-squared:                  0.543
Method:                 Least Squares   F-statistic:                     601.6
Date:                Sun, 27 Feb 2022   Prob (F-statistic):           5.08e-88
Time:                        15:11:50   Log-Likelihood:                -1641.5
No. Observations:                 506   AIC:                             3287.
Df Residuals:                     504   BIC:                             3295.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         34.5538      0.563     61.415      0.000      33.448      35.659
x1            -0.9500      0.039    -24.528      0.000      -1.026      -0.874
==============================================================================
Omnibus:                      137.043   Durbin-Watson:                   0.892
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              291.373
Skew:                           1.453   Prob(JB):                     5.36e-64
Kurtosis:                       5.319   Cond. No.                         29.7
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

remarks:
(1) we can quickly see that the p-value is (almost) 0. This suggest that there is a relationship between response variable and the predictor lstat (we can get the same conclusion by looking at the F-statistics score). Note that low p-value does not mean our model is good. If we look at the R-squared, we can see that the score is close to 0.5, and this shows the linear model with single predictor cannot appropriately explain the variance in the target value.

(2) we have Standard Error (SE) of parameters, and thus we can approximatly say the Confidence Interval (CI) is between $\mu - 2 \sigma$ and $\mu + 2\sigma$ (note that $\sigma$ is SE). So, for coeficient of x1, we can see the CI is [-0.9500 - 2*0.039, -0.9500 + 2*0.039].

(3) How to find CI and PI for the response variable y ?!?

(4) let us plot x-y (see below)

plt.scatter(X.ravel(), y, color = 'b')
plt.plot(X.ravel(), regressor.predict(X_sm), color = 'r')

plt.title('data and fitted simple linear model')
plt.xlabel('single feature: lstat')
plt.ylabel('response variable: medv')
plt.show()

side note: it can be seen that there is a quadratic behavior in the data... so, using polynomial features might help.

(5-I) let us plot residual-y

y_residual = regressor.predict(X_sm) - y

plt.scatter(y, y_residual, color='b')
plt.xlabel('y value')
plt.ylabel('y_residual')
plt.axhline(y=0, color='r')
plt.show()


plt.hist(y_residual)
plt.title('histogram of y_residual')
plt.show()

stats.probplot(y_residual, plot=plt)
plt.title('QQplot for y_residual')
plt.show()

we can see that the QQplot is deviated from straight line. Therefore, the residual values are not following a normal distribution. The scatter plot residual-vs-y shows that there is a correlation in highr values of y. In fact, we can see that residuals are following a trend for y>30.

(6-II) let us plot studentized residual-y (?!) --> LATER!

(7) Find outliers and high leverage point (How??)
Recall that outliers are in y-space and high leverage point is outlier in feature space.

from sklearn.metrics import pairwise_distances
D_y = pairwise_distances(y.reshape(-1,1))
np.fill_diagonal(D_y, np.inf) #Because of np.min(...) in next line
dist_to_nn = np.min(D_y, axis=1)
outlier_idx = np.argmax(dist_to_nn)
outlier_idx
180
D_x = pairwise_distances(X.reshape(-1,1))
np.fill_diagonal(D_x, np.inf) #Because of np.min(...) in next line
dist_to_nn = np.min(D_x, axis=1)
leverage_idx = np.argmax(dist_to_nn)
leverage_idx
387
plt.scatter(X.ravel(), y, color = 'b', label='data')
plt.plot(X.ravel(), regressor.predict(X_sm), color = 'r', label='fitted model')

plt.scatter(X[outlier_idx], y[outlier_idx], color='orange', label='outlier')
plt.scatter(X[leverage_idx], y[leverage_idx], color='black', label='leverage')

plt.title('data and fitted simple linear model')
plt.xlabel('single feature: lstat')
plt.ylabel('response variable: medv')

plt.legend()
plt.show()

(8) Let us use all predictors

X = df.iloc[:,:-1].to_numpy()
y = df.iloc[:,-1].to_numpy()

X_sm = sm.add_constant(X)

regressor = sm.OLS(y, X_sm) 
regressor = regressor.fit()

print(regressor.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.734
Model:                            OLS   Adj. R-squared:                  0.728
Method:                 Least Squares   F-statistic:                     113.5
Date:                Sun, 27 Feb 2022   Prob (F-statistic):          2.23e-133
Time:                        15:11:51   Log-Likelihood:                -1504.9
No. Observations:                 506   AIC:                             3036.
Df Residuals:                     493   BIC:                             3091.
Df Model:                          12                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         41.6173      4.936      8.431      0.000      31.919      51.316
x1            -0.1214      0.033     -3.678      0.000      -0.186      -0.057
x2             0.0470      0.014      3.384      0.001       0.020       0.074
x3             0.0135      0.062      0.217      0.829      -0.109       0.136
x4             2.8400      0.870      3.264      0.001       1.131       4.549
x5           -18.7580      3.851     -4.870      0.000     -26.325     -11.191
x6             3.6581      0.420      8.705      0.000       2.832       4.484
x7             0.0036      0.013      0.271      0.787      -0.023       0.030
x8            -1.4908      0.202     -7.394      0.000      -1.887      -1.095
x9             0.2894      0.067      4.325      0.000       0.158       0.421
x10           -0.0127      0.004     -3.337      0.001      -0.020      -0.005
x11           -0.9375      0.132     -7.091      0.000      -1.197      -0.678
x12           -0.5520      0.051    -10.897      0.000      -0.652      -0.452
==============================================================================
Omnibus:                      171.096   Durbin-Watson:                   1.077
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              709.937
Skew:                           1.477   Prob(JB):                    6.90e-155
Kurtosis:                       7.995   Cond. No.                     1.17e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.17e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

remarks:

  • F-statistics is large, which shows there is relationship between features and y
  • p-value for x3 and x7 is large, which means there is little relationship between response variable y and these two features.
  • How to know if there is (multi-) colinearity? let us calculate VIF as follows
def _get_VIF(X):
    """
    This parameters calculate VIF for features and return them
    
    Parameters
    ----------
    X: numpy.ndarray
        feature matrix of shape (n,p) where n is number of observations and p is number of features
    
    
    Returns
    ---------
    a vector of size p that contains the VIF of each feature 
    """
    if X.shape[1] <= 1:
        raise  ValueError ('the second dimension of X must be at least 2')
    
    p = X.shape[1]
    
    VIF_scores = np.full(p, -1, dtype=np.float64)
    
    for i in range(p):
        predictors = np.delete(X, i, axis=1)
        target = X[:,i].reshape(-1,1)
        
        regressor = LinearRegression().fit(predictors, target)
        r_squared_score = r2_score(target, regressor.predict(predictors))
        
        VIF_scores[i] = 1 / (1 - (r_squared_score**2 - 1e-6))
    
    return VIF_scores
X = df.iloc[:,:-1].to_numpy()
y = df.iloc[:,-1].to_numpy()

VIF_scores = _get_VIF(X)
VIF_scores
array([1.23236186, 1.46873125, 2.2794307 , 1.00443276, 2.46684754,
       1.29475809, 1.84240814, 2.26320274, 3.99063224, 4.76575827,
       1.24490007, 1.73811057])

remarks:
VIF score is high for feature no. 8 and 9 (in numpy indexing). This shows that these two variables can be obtained from others. Although we see two features with high VIF, it is possible that one of them is highly affected by the other! In other words, removing both of them may cause loss of informtion. Let us take a look at correlation matrix:

df_X = pd.DataFrame(X)
corr = df_X.corr()
corr.style.background_gradient(cmap='coolwarm')
  0 1 2 3 4 5 6 7 8 9 10 11
0 1.000000 -0.200469 0.406583 -0.055892 0.420972 -0.219247 0.352734 -0.379670 0.625505 0.582764 0.289946 0.455621
1 -0.200469 1.000000 -0.533828 -0.042697 -0.516604 0.311991 -0.569537 0.664408 -0.311948 -0.314563 -0.391679 -0.412995
2 0.406583 -0.533828 1.000000 0.062938 0.763651 -0.391676 0.644779 -0.708027 0.595129 0.720760 0.383248 0.603800
3 -0.055892 -0.042697 0.062938 1.000000 0.091203 0.091251 0.086518 -0.099176 -0.007368 -0.035587 -0.121515 -0.053929
4 0.420972 -0.516604 0.763651 0.091203 1.000000 -0.302188 0.731470 -0.769230 0.611441 0.668023 0.188933 0.590879
5 -0.219247 0.311991 -0.391676 0.091251 -0.302188 1.000000 -0.240265 0.205246 -0.209847 -0.292048 -0.355501 -0.613808
6 0.352734 -0.569537 0.644779 0.086518 0.731470 -0.240265 1.000000 -0.747881 0.456022 0.506456 0.261515 0.602339
7 -0.379670 0.664408 -0.708027 -0.099176 -0.769230 0.205246 -0.747881 1.000000 -0.494588 -0.534432 -0.232471 -0.496996
8 0.625505 -0.311948 0.595129 -0.007368 0.611441 -0.209847 0.456022 -0.494588 1.000000 0.910228 0.464741 0.488676
9 0.582764 -0.314563 0.720760 -0.035587 0.668023 -0.292048 0.506456 -0.534432 0.910228 1.000000 0.460853 0.543993
10 0.289946 -0.391679 0.383248 -0.121515 0.188933 -0.355501 0.261515 -0.232471 0.464741 0.460853 1.000000 0.374044
11 0.455621 -0.412995 0.603800 -0.053929 0.590879 -0.613808 0.602339 -0.496996 0.488676 0.543993 0.374044 1.000000

The correlation matrix shows a high correlation (more than 0.9) between feature 8 and 9. Therefore, it is possible that removing feature 9 makes VIF scores of 8 lower.

In addition, we will delete feature 2 and 6 (numpy indexing) as it has high p-value. Let us take a look at the model again:

X_clean_v1 = np.delete(X, [2, 6, 9], axis=1)

X_sm = sm.add_constant(X_clean_v1)
regressor = sm.OLS(y, X_sm) 
regressor = regressor.fit()

print(regressor.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.727
Model:                            OLS   Adj. R-squared:                  0.722
Method:                 Least Squares   F-statistic:                     146.9
Date:                Sun, 27 Feb 2022   Prob (F-statistic):          8.23e-134
Time:                        15:11:51   Log-Likelihood:                -1511.5
No. Observations:                 506   AIC:                             3043.
Df Residuals:                     496   BIC:                             3085.
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         39.9841      4.945      8.085      0.000      30.268      49.700
x1            -0.1185      0.033     -3.559      0.000      -0.184      -0.053
x2             0.0366      0.014      2.695      0.007       0.010       0.063
x3             3.1394      0.870      3.610      0.000       1.431       4.848
x4           -21.3757      3.501     -6.106      0.000     -28.254     -14.497
x5             3.8506      0.411      9.368      0.000       3.043       4.658
x6            -1.4508      0.189     -7.674      0.000      -1.822      -1.079
x7             0.1046      0.041      2.569      0.010       0.025       0.185
x8            -1.0018      0.130     -7.677      0.000      -1.258      -0.745
x9            -0.5535      0.048    -11.537      0.000      -0.648      -0.459
==============================================================================
Omnibus:                      160.486   Durbin-Watson:                   1.085
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              623.744
Skew:                           1.400   Prob(JB):                    3.59e-136
Kurtosis:                       7.663   Cond. No.                         774.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
VIF_scores = _get_VIF(X_clean_v1)
VIF_scores
array([1.23099938, 1.40259561, 1.00220154, 2.06024597, 1.24327435,
       1.99655042, 1.65730221, 1.21072661, 1.57340251])
#The reason is that when a feature is not important, its coef is almost zero. And, if there is multi-colinearity, the coef 
#is being divided among correlated features. 

#However, removing redundant features can make interpretation easier!

(9) Add interaction between lstat and age
(here, we are using the unclean version of X, and not the one after removing features with high VPF or high p-value)

X_df = df.iloc[:,:-1]
y_df = df.iloc[:,-1]

X_df['interation'] = X_df['age'] * X_df['lstat']
X_df
crim zn indus chas nox rm age dis rad tax ptratio lstat interation
1 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 4.98 324.696
2 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 9.14 721.146
3 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 4.03 246.233
4 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 2.94 134.652
5 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 5.33 288.886
... ... ... ... ... ... ... ... ... ... ... ... ... ...
502 0.06263 0.0 11.93 0 0.573 6.593 69.1 2.4786 1 273 21.0 9.67 668.197
503 0.04527 0.0 11.93 0 0.573 6.120 76.7 2.2875 1 273 21.0 9.08 696.436
504 0.06076 0.0 11.93 0 0.573 6.976 91.0 2.1675 1 273 21.0 5.64 513.240
505 0.10959 0.0 11.93 0 0.573 6.794 89.3 2.3889 1 273 21.0 6.48 578.664
506 0.04741 0.0 11.93 0 0.573 6.030 80.8 2.5050 1 273 21.0 7.88 636.704

506 rows × 13 columns

X = X_df.to_numpy()
y = y_df.to_numpy()


#clean it
X_clean_v1 = np.delete(X, [2, 6, 9], axis=1)

#let us see the result:
X_sm = sm.add_constant(X_clean_v1)

regressor = sm.OLS(y, X_sm) 
regressor = regressor.fit()

print(regressor.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.728
Model:                            OLS   Adj. R-squared:                  0.722
Method:                 Least Squares   F-statistic:                     132.2
Date:                Sun, 27 Feb 2022   Prob (F-statistic):          7.69e-133
Time:                        15:11:51   Log-Likelihood:                -1511.2
No. Observations:                 506   AIC:                             3044.
Df Residuals:                     495   BIC:                             3091.
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         41.1792      5.189      7.936      0.000      30.985      51.374
x1            -0.1208      0.033     -3.611      0.000      -0.187      -0.055
x2             0.0352      0.014      2.574      0.010       0.008       0.062
x3             3.1276      0.870      3.594      0.000       1.418       4.837
x4           -22.1390      3.642     -6.079      0.000     -29.295     -14.983
x5             3.7727      0.424      8.905      0.000       2.940       4.605
x6            -1.4165      0.194     -7.287      0.000      -1.798      -1.035
x7             0.1079      0.041      2.634      0.009       0.027       0.188
x8            -1.0122      0.131     -7.711      0.000      -1.270      -0.754
x9            -0.6372      0.120     -5.324      0.000      -0.872      -0.402
x10            0.0008      0.001      0.764      0.445      -0.001       0.003
==============================================================================
Omnibus:                      159.161   Durbin-Watson:                   1.094
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              608.497
Skew:                           1.394   Prob(JB):                    7.35e-133
Kurtosis:                       7.592   Cond. No.                     3.43e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.43e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Did we see any difference in R_squared score? No. That means there might be no interaction between the features age and lstat (or should we scale data first?!?)

(10) Recall that the relationship between y and lstat was not linear, and it was quadratic-ish! So, let us apply regression while considering $X^{2}$ feature:

X = df['lstat'].to_numpy()
y = df.iloc[:,-1].to_numpy()
from sklearn.preprocessing import PolynomialFeatures
X_poly = PolynomialFeatures(degree=2).fit_transform(X.reshape(-1,1))
X_poly
array([[ 1.    ,  4.98  , 24.8004],
       [ 1.    ,  9.14  , 83.5396],
       [ 1.    ,  4.03  , 16.2409],
       ...,
       [ 1.    ,  5.64  , 31.8096],
       [ 1.    ,  6.48  , 41.9904],
       [ 1.    ,  7.88  , 62.0944]])
regressor = sm.OLS(y, X_poly) 
regressor = regressor.fit()

print(regressor.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.641
Model:                            OLS   Adj. R-squared:                  0.639
Method:                 Least Squares   F-statistic:                     448.5
Date:                Sun, 27 Feb 2022   Prob (F-statistic):          1.56e-112
Time:                        15:11:51   Log-Likelihood:                -1581.3
No. Observations:                 506   AIC:                             3169.
Df Residuals:                     503   BIC:                             3181.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         42.8620      0.872     49.149      0.000      41.149      44.575
x1            -2.3328      0.124    -18.843      0.000      -2.576      -2.090
x2             0.0435      0.004     11.628      0.000       0.036       0.051
==============================================================================
Omnibus:                      107.006   Durbin-Watson:                   0.921
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              228.388
Skew:                           1.128   Prob(JB):                     2.55e-50
Kurtosis:                       5.397   Cond. No.                     1.13e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.13e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Remarks:

  • The p-value are all close to zero, which suggest the features are related to response variable.
  • the R-squared value is increased by ~20%.
plt.scatter(X, y, color='b', label='data')
plt.scatter(X, regressor.predict(X_poly), color='r', label='fitted model')

plt.xlabel('feature: lstat')
plt.ylabel('response variable')

plt.legend()
plt.show()
 
_get_VIF(X_poly[:,1:])
array([6.72828883, 6.72828883])

Now we can see VIF score is even more than 10; however, we noticed an increas in R-squared value after adding X^2;

The question is: should we trust VIF and always remove a feature with high VIF? Ok, let us drop original feature and just keep its squared version:

regressor = sm.OLS(y, X_poly[:,[0,2]]) #here we are skipping [:,1] as that corresponds to original feature 
regressor = regressor.fit()

print(regressor.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.387
Model:                            OLS   Adj. R-squared:                  0.386
Method:                 Least Squares   F-statistic:                     318.3
Date:                Sun, 27 Feb 2022   Prob (F-statistic):           1.50e-55
Time:                        15:11:51   Log-Likelihood:                -1716.4
No. Observations:                 506   AIC:                             3437.
Df Residuals:                     504   BIC:                             3445.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         27.6474      0.430     64.308      0.000      26.803      28.492
x1            -0.0242      0.001    -17.842      0.000      -0.027      -0.022
==============================================================================
Omnibus:                      148.413   Durbin-Watson:                   0.848
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              323.708
Skew:                           1.568   Prob(JB):                     5.10e-71
Kurtosis:                       5.349   Cond. No.                         425.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

now, we can see that we should have kepts the original feature. So, if that is the case, why did we get high VIF values earlier?

possible solution: stackoverflow (needs further explanation)

X = df['lstat'].to_numpy()
y = df.iloc[:,-1].to_numpy()

X_scaled = StandardScaler().fit_transform(X.reshape(-1,1))
X_poly = PolynomialFeatures(degree=2).fit_transform(X_scaled)

regressor = sm.OLS(y, X_poly)
regressor = regressor.fit()

print(regressor.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.641
Model:                            OLS   Adj. R-squared:                  0.639
Method:                 Least Squares   F-statistic:                     448.5
Date:                Sun, 27 Feb 2022   Prob (F-statistic):          1.56e-112
Time:                        15:11:51   Log-Likelihood:                -1581.3
No. Observations:                 506   AIC:                             3169.
Df Residuals:                     503   BIC:                             3181.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         20.3165      0.311     65.357      0.000      19.706      20.927
x1            -8.7807      0.300    -29.273      0.000      -9.370      -8.191
x2             2.2163      0.191     11.628      0.000       1.842       2.591
==============================================================================
Omnibus:                      107.006   Durbin-Watson:                   0.921
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              228.388
Skew:                           1.128   Prob(JB):                     2.55e-50
Kurtosis:                       5.397   Cond. No.                         3.16
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
_get_VIF(X_poly[:,1:])
array([1.12205353, 1.12205353])

NOTE: now we can see the VIF values are low. So, we recommend to scale your data even in linear regression!Because, it makes sense to have low VIF values (Feel free to go back and review the previous questions and try to scale X first)

(11) Handling categorical feature

df = pd.read_csv('./datasets/Carseats.csv', index_col = 0)
df.head(3)
Sales CompPrice Income Advertising Population Price ShelveLoc Age Education Urban US
1 9.50 138 73 11 276 120 Bad 42 17 Yes Yes
2 11.22 111 48 16 260 83 Good 65 10 Yes Yes
3 10.06 113 35 10 269 80 Medium 59 12 Yes Yes
_ = df.hist(figsize=(10,10))
cat_features = list(set(df.columns) - set(df._get_numeric_data()))
cat_features
['US', 'ShelveLoc', 'Urban']
from collections import Counter

for feature in cat_features:
    tags_counts = Counter(df[feature])
    
    plt.bar(list(tags_counts.keys()), tags_counts.values())
    plt.show()
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder


ct = ColumnTransformer([('ohe', OneHotEncoder(),cat_features)], remainder='passthrough')
X = ct.fit_transform(df.iloc[:,1:]) #note that the first column of df is actually the response variable
X
array([[  0.,   1.,   1., ..., 120.,  42.,  17.],
       [  0.,   1.,   0., ...,  83.,  65.,  10.],
       [  0.,   1.,   0., ...,  80.,  59.,  12.],
       ...,
       [  0.,   1.,   0., ..., 159.,  40.,  18.],
       [  0.,   1.,   1., ...,  95.,  50.,  12.],
       [  0.,   1.,   0., ..., 120.,  49.,  16.]])
ct.get_feature_names_out()
array(['ohe__US_No', 'ohe__US_Yes', 'ohe__ShelveLoc_Bad',
       'ohe__ShelveLoc_Good', 'ohe__ShelveLoc_Medium', 'ohe__Urban_No',
       'ohe__Urban_Yes', 'remainder__CompPrice', 'remainder__Income',
       'remainder__Advertising', 'remainder__Population',
       'remainder__Price', 'remainder__Age', 'remainder__Education'],
      dtype=object)
y = df.iloc[:,0]
regressor = sm.OLS(y, X).fit()
print(regressor.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  Sales   R-squared:                       0.873
Model:                            OLS   Adj. R-squared:                  0.870
Method:                 Least Squares   F-statistic:                     243.4
Date:                Sun, 27 Feb 2022   Prob (F-statistic):          1.60e-166
Time:                        15:11:53   Log-Likelihood:                -568.99
No. Observations:                 400   AIC:                             1162.
Df Residuals:                     388   BIC:                             1210.
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1             3.0542      0.232     13.141      0.000       2.597       3.511
x2             2.8701      0.235     12.196      0.000       2.407       3.333
x3            -0.2942      0.170     -1.734      0.084      -0.628       0.039
x4             4.5560      0.172     26.506      0.000       4.218       4.894
x5             1.6625      0.161     10.319      0.000       1.346       1.979
x6             2.9007      0.228     12.702      0.000       2.452       3.350
x7             3.0236      0.229     13.208      0.000       2.573       3.474
x8             0.0928      0.004     22.378      0.000       0.085       0.101
x9             0.0158      0.002      8.565      0.000       0.012       0.019
x10            0.1231      0.011     11.066      0.000       0.101       0.145
x11            0.0002      0.000      0.561      0.575      -0.001       0.001
x12           -0.0954      0.003    -35.700      0.000      -0.101      -0.090
x13           -0.0460      0.003    -14.472      0.000      -0.052      -0.040
x14           -0.0211      0.020     -1.070      0.285      -0.060       0.018
==============================================================================
Omnibus:                        0.811   Durbin-Watson:                   2.013
Prob(Omnibus):                  0.667   Jarque-Bera (JB):                0.765
Skew:                           0.107   Prob(JB):                        0.682
Kurtosis:                       2.994   Cond. No.                     1.23e+19
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 3.23e-31. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

Remarks:
we can see large p-value (i.e. > 0.05) in x1, x11 and x14.... For now, we prefer to remove only x11 and x14. Because p-value of x1 is close to threshold 0.05 (common threshold for p-value).

X = np.delete(X, obj=[10,13], axis=1)
regressor = sm.OLS(y, X).fit()
print(regressor.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  Sales   R-squared:                       0.873
Model:                            OLS   Adj. R-squared:                  0.870
Method:                 Least Squares   F-statistic:                     297.6
Date:                Sun, 27 Feb 2022   Prob (F-statistic):          1.20e-168
Time:                        15:11:53   Log-Likelihood:                -569.82
No. Observations:                 400   AIC:                             1160.
Df Residuals:                     390   BIC:                             1200.
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1             2.9755      0.194     15.316      0.000       2.594       3.357
x2             2.7886      0.211     13.234      0.000       2.374       3.203
x3            -0.3474      0.149     -2.336      0.020      -0.640      -0.055
x4             4.5046      0.155     29.080      0.000       4.200       4.809
x5             1.6069      0.142     11.304      0.000       1.327       1.886
x6             2.8205      0.195     14.484      0.000       2.438       3.203
x7             2.9435      0.199     14.762      0.000       2.552       3.336
x8             0.0925      0.004     22.407      0.000       0.084       0.101
x9             0.0159      0.002      8.621      0.000       0.012       0.019
x10            0.1247      0.011     11.810      0.000       0.104       0.145
x11           -0.0953      0.003    -35.722      0.000      -0.101      -0.090
x12           -0.0462      0.003    -14.528      0.000      -0.052      -0.040
==============================================================================
Omnibus:                        0.610   Durbin-Watson:                   2.007
Prob(Omnibus):                  0.737   Jarque-Bera (JB):                0.639
Skew:                           0.094   Prob(JB):                        0.727
Kurtosis:                       2.944   Cond. No.                     1.97e+18
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 3.84e-30. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
 
X_scaled = StandardScaler().fit_transform(X)
_get_VIF(X_scaled)
array([1.00000000e+06, 1.00000000e+06, 1.00000000e+06, 1.00000000e+06,
       1.00000000e+06, 1.00000000e+06, 1.00000000e+06, 1.13260045e+00,
       1.00035432e+00, 1.28567546e+00, 1.13675082e+00, 1.00020284e+00])
#because if we have other encoded tags, we can predict the remaining one!
#so, let us drop the first tag as we do encoding...
ct = ColumnTransformer([('ohe', OneHotEncoder(drop='first'),cat_features)], remainder='passthrough')
X = ct.fit_transform(df.iloc[:,1:]) #note that the first column of df is actually the response variable
y = df.iloc[:,0]
regressor = sm.OLS(y, X).fit()
print(regressor.summary())
                                 OLS Regression Results                                
=======================================================================================
Dep. Variable:                  Sales   R-squared (uncentered):                   0.981
Model:                            OLS   Adj. R-squared (uncentered):              0.980
Method:                 Least Squares   F-statistic:                              1801.
Date:                Sun, 27 Feb 2022   Prob (F-statistic):                        0.00
Time:                        15:11:53   Log-Likelihood:                         -609.87
No. Observations:                 400   AIC:                                      1242.
Df Residuals:                     389   BIC:                                      1286.
Df Model:                          11                                                  
Covariance Type:            nonrobust                                                  
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1            -0.0333      0.165     -0.202      0.840      -0.357       0.291
x2             5.0276      0.168     29.914      0.000       4.697       5.358
x3             2.1366      0.138     15.497      0.000       1.866       2.408
x4             0.2296      0.124      1.847      0.066      -0.015       0.474
x5             0.1159      0.004     31.356      0.000       0.109       0.123
x6             0.0209      0.002     10.738      0.000       0.017       0.025
x7             0.1162      0.012      9.462      0.000       0.092       0.140
x8             0.0012      0.000      3.190      0.002       0.000       0.002
x9            -0.0950      0.003    -32.156      0.000      -0.101      -0.089
x10           -0.0356      0.003    -10.795      0.000      -0.042      -0.029
x11            0.0688      0.019      3.608      0.000       0.031       0.106
==============================================================================
Omnibus:                        0.981   Durbin-Watson:                   2.075
Prob(Omnibus):                  0.612   Jarque-Bera (JB):                1.009
Skew:                          -0.007   Prob(JB):                        0.604
Kurtosis:                       2.754   Cond. No.                     1.20e+03
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[3] The condition number is large, 1.2e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Reducing number of dummy variables by dropping one of them (e.g. in case of binary, we can just keep 1, if it is not 1, it means it is 0!!!) increased the r2_score! (please note that this appoach leads to unsymmetric dummy variables which may cause problem (??? further investigation can be conducted and presented here)).

We can see that the p-value is not large anymore. Let us check out VIF:

X_scaled = StandardScaler().fit_transform(X)
_get_VIF(X_scaled)
array([1.32477648, 1.1292939 , 1.13174355, 1.00049212, 1.14583405,
       1.0005818 , 1.3795381 , 1.01640417, 1.13906471, 1.00042423,
       1.00065819])

now we can see the VIF scores are low!... All good :)