Note: The content of this post is based on my understanding from Chapter 02 of the book Introduction to Statistical Learning, where the authors discuss the basic concepts in statistical learning such as parametric vs. non-parameteric, bias vs. variance, etc.

If you haven't read this chapter from the book, I recommend reading it first (You can find its free online version via the provided link) Please feel free to change the code and see how the results behave. You may find something interesting...

import libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split

from  sklearn.linear_model import LinearRegression 
from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import accuracy_score

What is "Statistical Learning"?

Let us start with Advertising dataset

df = pd.read_csv("./datasets/Advertising.csv", index_col=0)
df.head(3)
TV Radio Newspaper Sales
1 230.1 37.8 69.2 22.1
2 44.5 39.3 45.1 10.4
3 17.2 45.9 69.3 9.3

We can think of the first three columns as features (predictors, independent variables) and the last column as the output (target, dependent variable, also known as response variable).
NOTE: Since the target value is continuous, this can be considered as a regression problem. In regression, the goal is to predict a continuous variable.

X = df.iloc[:,:-1]
y = df.iloc[:,-1]
axs = X.hist(figsize=(20,5), layout=(1,3)) 

# let us change the title of figures
for ax in axs[0]:
    ax.set_title(ax.get_title() + ' ' + 'ad cost') 

We can plot the values of each feature against the target, and see if there is any particular relationship between them.

I am going to use sklearn.linear_model.LinearRegression to fit the best line for each data considering only one feature at a time! (LinearRegression will be discussed comprehensively in the next chapter.)

fig, axs = plt.subplots(1,3, figsize=(20,5))
for i, feature in enumerate(X.columns):
    X_feature = X[feature].to_numpy()
    regressor = LinearRegression().fit(X_feature.reshape(-1,1), y) 
    
    axs[i].scatter(X_feature, y, color='b')
    axs[i].plot(X_feature, regressor.predict(X_feature.reshape(-1,1)), color='r')
    
    axs[i].set_title(f'Sales-{feature}')
    axs[i].set_xlabel(f'{feature} ad cost')
    axs[i].set_ylabel('Sales')

plt.show()
  • As TV ad cost increases, there is an increase in the values of Sales (i.e. the response variable).
  • In Newspaper, we can see that the fitted line has a lot of error and it cannot be a good model for fitting the data.

Output Target= f(Input Variables) + Error, where f is fixed but unknown function
NOTE: Error is considered to be independent of input, and it has mean of 0.

Statisitcal Learning is a set of approaches for estimating f (the estimation of $f$ is shown as $\hat{f}$)

$y = f(X) + Error$, and our estimation is: $\hat{y} = \hat{f}$.

$E[(Y-\hat{Y})^{2}] = E[(f(X) + Error - \hat{f}(X))^{2}] = [f(X) - \hat{f}(X)]^{2} + Var(Error) $

As shown above, finding better approximation for $f$ can reduce $[f(X) - \hat{f}(X)]^{2}$, however, it cannot reduce Var(Error). That is why it is called irreducible.

NOTE: Some machine learning models are considered as black box. However, if we are interested in understanding the relationship (association) between the output and the input variables (i.e. inference), we need the exact form of f. Otherwise, all you can see is just input->output.

What kind of questions can we ask about our data?

  • Which input variables (predictors) are (more) associated with the output (i.e. response variable/target)?
  • Can the relationship between Y and each predictor be modeled using linear equation? Why linear function? Because, it is easy to understand. So, it might be a good idea to try simple, more understandable method before jumping onto using complicated methods.

Parametric vs. Non-parametric Methods:

In Parametric methods, an assumption is made about the functional form of f (e.g. linear) and the goal is to estimate its parameters. (drawback? the assumption that is made about f!)
In contrast, non-parametric methods do not assume any particular form for function f (drawback? it may not perform well when there are only a few samples)

Prediction Accuracy vs. Model Interpretability

If your goal is to increase accuracy of prediction, you may ignore your model's interpretability. However, if you (or your business client) is interested in understanding the relationship between input and output, then your goal is inference. You may lose some accuracy in favor of getting better interpretability!

Supervised- vs- Unsupervised learning, and others...

Supervised: there is a response variable in the data that we want to predict (if the response variable is continuous, it is known as regression problem and if it is discrete, it is a classification problem).

UnSupervised: there is no response variable in the data and the goal is to group similar data and determine if there are some underlying structure of groups in the data.

(Note: sometimes we may use both supervised and unsupervised learning in a problem. So, if the data has labels and we decicde to use unsupervised learning at some stages, we usually ignore the response variable and just apply the unsupervised learning method on the feature space.)

Semi-supervised learning: Where a portion of data sets is labeled, and there are some un-labled observations as well. The goal is to use both (i.e. labeled and un-labeled data).

NOTE: Logistic Regression, in spite of its name, is a classification method (Given a simple, it predicts the probability of that sample being a member of class 0 or 1.

How to measure the performance of a "Statistical Learning" method ?

In regression, Mean Square Error (MSE) is usually used to measure the quality of fit. It can be calculated as follows:

$MSE = {\frac{1}{n}}{\sum\limits_{i=1}^{n} {(y_{i} - \hat{y}_{i})}^{2}}$

We want to select a model (i.e.a statistical learning model) that results in lower $MSE$.

You may have heard about splitting data into train/test datasets. We can increase the flexibitiy of $\hat{f}$ (a.k.a degree of freedom (dof), un-smoothness!) and fit the train data as best as we can (by fitting, we mean estimating the function f). However, at some point, increasing the flexibility causes the overfitting problem as the model is less generalized. So, we can plot MSE-vs-dof and see how the model behaves for different dof.

As we increase flexibility, the MSE of the fitted model on test set may reduce, and then at some point, it will increases. How can we find the good-fitted model? One solution is to use cross-validation method to estimate test MSE using training data (more on this later).

NOTE: We usually expect the test MSE to be higher than train MSE. Because, the model is usually being selected based on its score on train set. (So, if we have a very high MSE on train set, that means we are underfitting because we cannot even fit the train set, let alone the test set!)

Bias vs. Variance (trade-off)

For a particular $\hat{f}$, we can fit it to a large number of training set and calculate the expected value of the error of estimation of $\hat{f}$ on a test point $x_{0}$, we will see the following relationship:

$E[(y_{0}-\hat{y}_{0})^{2}] = Var(\hat{f}(x_{0})) + [Bias(\hat{f}(x_{0}))]^{2} + Var(\epsilon)$

  • What is $Var(\hat{f}(x_{0}))$?: This shows how much our prediction $\hat{f}(x_{0})$ varies. So, if we OVER-fit our model to each training set, it can easily vary from one training set to another.

  • What is $[Bias(\hat{f}(x_{0}))]^{2}$? We can think of "bias" as the bias of our $\hat{f}$ towards linear model. The more flexible $\hat{f}$, the less bias it is.

  • What is $Var(\epsilon)$? This is irreducible error.

Now, let us think about two extreme models:

  • Linear model: high biased and low variance (because $\hat{f}$ changes very little if we change one data point in training set)
  • VERY flexible model: low biased and high variance (beause little change in train set can significantly change $\hat{f}$ because of over-fitting problem)

Thus, a good model is a model that has low variance and low bias.

Classification setting

We can transfer the aforementioned concepts to the "classification" setting:

$\text{Training Error} = \frac{1}{n} { \sum\limits_{i=1}^{n}{ I(y_{i} \neq \hat{y}_{i}) }} = 1 - accuracy $

$accuracy = \frac{TP+TN}{n}$, where TP and TN are true positive/true negative (i.e. predicting correctly if the label is 1 or 0), and n is total number of samples.

NOTE: as we discussed before, our goal is to minimize the error on test set.

Bayes Classifier (vs. Naive Bayes Classifier!!)

Bayes Classifier is the gold standard! In otherwords, this is what gives us the LOWEST possible error. (If you are thinking about Naive Bayes Classifier, that is a little bit different and we will cover it shortly and show you the difference!).

The idea is simple: The class of X is "j", iff $P(Y_{j}|X)$ is maximum. To put it simple, we assume X belongs to a class that has the highest conditional probablity given X.

Ok, but how to calculate $P(Y_{j}|X)$?
$P(Y_{j}|X) = \frac{P(X|Y_{j})P(Y_{j})}{P(X)}$ (bayes theorem); and to compare $P(Y_{0}|X)$ and $P(Y_{1}|X)$ in a binary classification (you can generalize it for more than two classes), we just need to calculate (and compare) the numerator. So, IF we know about the probability distribution of X in two classes, we can calculate it.

HOWEVER, in real world, we do not have such information. We can make an assumption and calculate it. For instance, we can say the variables (features) are independent, therefore, we can say: $P(X|Y_{j}) = P(x_{1}|Y_{j})...P(x_{p}|Y_{j})$ when we have p features. And, we can assume each variable is coming from a normal distribution fitten on training set (so, we can fit a normal distribution to $x_{1}$ values of all training points that belong to $Y_{j}$). This is Naive Bayes Classifier because we made a "naive" assumption that variables are independent from each other.

How does non-parametric method work in this context?

In K-NearestNeighbor (KNN), we are trying to estimate $P(Y_{j}|X)$ based on the information of k closest point to X. So: $\hat{P}(Y_{j}|X) = \frac{1}{k} { \sum\limits_{i=1}^{k}{ I(y_{i} = j) }}$. Now, let us consider the two following extreme cases:

  • k is very small (e.g. k=1): We are just using the closest point! Maybe a point happens to be close just because of some random noise or other cause and not the actual behavior. So, our model OVER-fits our data. This is similar to the case of regression where we have a lot of flexibility (high degree of freedom) and thus, we have high variance and low bias.

  • k is very large (e.g k=100): We are generalizing our model "too much"! So, we will have low variance but we are biased. As we are not considering what is going on in the data and we ignore the underlying structure. (e.g. suppose $k = n_{samples} - 1$. In this case, we are considering almost all points when we want to make a decision about the ouput of a single point)

applying non-parametric model (KNN) on breast cancer data

DATA = datasets.load_breast_cancer() #loading from sklearn.datasets
X = DATA.data
y = DATA.target

print('number of observations: ', X.shape[0])
print('number of features: ', X.shape[1])
print('number of classes: ', len(set(y)))
number of observations:  569
number of features:  30
number of classes:  2
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)
n_plots = 5

fig, axs = plt.subplots(1,n_plots, figsize=(25,5))
for i in range(n_plots):
    axs[i].set_title(f'histogram at feature index: {i}')
    axs[i].hist(X_train[:,i], color='r', label='train')
    axs[i].hist(X_test[:,i], color='b', label='train')

plt.show()

Now, we can see the shape of distribution of each feature remains the same. How about the histogram of classes?

plt.figure()
plt.title('histogram of labels in train/test sets')
plt.hist(y_train, color='r', label='train')
plt.hist(y_test, color='b', label='test')
plt.show()

As observed, the classes have balance (i.e. each class has a fair share of the total observations. And, the classes in test set are still balanced).

It might be a good idea to reduce dimension using PCA (so we can have better visualization) Please note that we use the whole features for fitting the model but we use its PCA-reduced version for visualization!

scaler = StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test) #we apply the scaler already fitted to train onto the test set.

#apply pca:
pca = PCA(n_components=2).fit(X_train)
X_train_reduced = pca.transform(X_train)
X_test_reduced = pca.transform(X_test)

print('total variance explained by pca: ', sum(pca.explained_variance_ratio_))
total variance explained by pca:  0.6280476169407645

we notice that the total variance explained by first two principal axes is very low. We should usually cover around 90% of variance. However, we are okay to move forward for the sake of this tutorial.

colors_vec = ['r','b'] #each color is for each class

plt.figure(figsize=(20,5))
plt.title('train (circle) and test data(square), using first two components of PCA')

plt.scatter(X_train_reduced[:,0], X_train_reduced[:,1], marker='o', c=np.where(y_train==0, 'r', 'b'))
plt.scatter(X_test_reduced[:,0], X_test_reduced[:,1], marker='s', c=np.where(y_test==0, 'maroon', 'c'))

plt.xlabel('pca 1st component')
plt.ylabel('pca 2nd component')
plt.show()

Now, we use KNN with different values of k, and plot the accuracy of the model vs 1/k (recall that lower k, e.g. k=1, is a high variance, which is similar to higher degree of freedom (i.e. more flexibilty))

def explore_K_in_KNN(X_train, X_test, y_train, y_test, min_k=1, max_k=20):
    """
    This function tries different k in kNN, and return the accuracy of train set, test set, and fitted models!
    
    Parameters
    ----------
    X_train: features in train set 
    X_test: features in test set
    y_train: labels in train set
    y_test: labels in test set
    min_k: minimum number of k (default: 1)
    max_k: maximum number of k (default: 20)
    
    Output
    ---------
    train_err: a vector with size max_k-min_k+1 that stores the accuracy of train set for each value of k
    test_err: a vector with size max_k-min_k+1 that stores the accuracy of train set for each value of k
    models: a list that contains fitted model
    """
    n = max_k - min_k + 1
    train_err = np.zeros(n)
    test_err = np.zeros(n)
    
    models = []
    for i, k in enumerate(range(min_k, max_k + 1)):
        classifier = KNeighborsClassifier(n_neighbors = k).fit(X_train, y_train)
        models.append(classifier)
        
        y_train_pred = classifier.predict(X_train)
        y_test_pred = classifier.predict(X_test)
        
        train_err[i] = 1 - accuracy_score(y_train, y_train_pred)
        test_err[i] = 1 - accuracy_score(y_test, y_test_pred)
    
    return train_err, test_err, models
min_k = 1
max_k = 50

train_err, test_err, models = explore_K_in_KNN(X_train, X_test, y_train, y_test, min_k, max_k)
plt.figure(figsize=(10,3))
plt.title('error-vs-k in KNN (train/test sets)')

plt.plot(train_err[::-1],'r',label='train err')
plt.plot(test_err[::-1],'b',label='test err')

plt.xticks(ticks=np.arange(0, max_k - min_k + 1, 5), labels=np.arange(min_k, max_k+1,5))
plt.xlabel('k')
plt.ylabel('error')
plt.legend()
plt.show()

it seems k=40 should provide a good model as the error in test set is (close to) minimum around k=40.

Now, let us plot the regions of classes for three different cases: k=1, k=40 (optimal), and k=50.

k_values =  [1, 40, 50]

#I set the values based on the figure that shows data points with their first two PCA components...
min_x_reduced, max_x_reduced = -7, 17 
min_y_reduced, max_y_reduced = -8, 13

#now, we can create a meshgrid of values in PCA-ed feature (2-dim) space, and then 
#we inverse_transform these values to original space.
xx, yy = np.meshgrid(np.arange(min_x_reduced, max_x_reduced, 0.1), np.arange(min_y_reduced, max_y_reduced, 0.1))
fig, axs = plt.subplots(1, len(k_values), figsize=(15,5))

for i, k in enumerate(k_values):
    X_mesh_reduced = np.c_[xx.ravel(), yy.ravel()]
    X_mesh = pca.inverse_transform(X_mesh_reduced)
    
    y_mesh_pred = models[k-1].predict(X_mesh)
    y_mesh_pred = y_mesh_pred.reshape(xx.shape) #we want y to have the same configuration (shape) as xx
    
    axs[i].set_title(f'regions for k={k}')
    
    axs[i].contourf(xx, yy, y_mesh_pred, alpha=0.4)
    axs[i].scatter(X_train_reduced[:,0], X_train_reduced[:,1], c=y_train)
    
    axs[i].set_xlabel('pca 1st component')
    axs[i].set_ylabel('pca 2nd component')

plt.tight_layout()
plt.show()

As illustrated, for $k=1$, the decision boundary is very flexible. On the other hand, when $k=50$, the decision boundary is almost a straight line (high bias and low variance) as if it doesn't care how the data changes. $k=40$ is the (close to) optimal value where the decison boundary tries to capture the behavior of data in the classes.

Brain teaser:
What does KKN learn from train set that helps it to predict test set? For instance, in linear regression mentioned earlier, we need to estimate its parameters. In other words, the machine learns the parameter from the data. What about KNN? :)

Summary

  • $y = f(X) + \epsilon$ --> Statistical learning refers to a set of approaches for estimating the function f. By using better models, we can decrease the "reducible error"; however, we cannot get rid of the error $\epsilon$ provided in the formula above. Because, $\epsilon$ is independent of variables and is irreducible.
  • $𝐸[(𝑌−𝑌̂ )^{2}]=𝐸[(𝑓(𝑋)+\epsilon − 𝑓̂ (𝑋))^{2}]=[𝑓(𝑋)−𝑓̂ (𝑋)]^{2}+𝑉𝑎𝑟(\epsilon) = Reducible Error + Irreducible Error$
  • Statistical Learning methods: Parametric- vs. Non-parametric- methods --> In parametric methods, we make an assumption about the function f. For instance, we can say f is linear as: $f(X) = w^{T}X + b$. Now, we just need to estimate these parameters w and b.Since we assume the form of f in parametric methods, the estimated function may not be a good fit. In non-parametric methods, we do not pre-assume any particular form for function f. However, we should consider some level of smoothness to avoid overfitting.
  • Prediction -vs- Inference: For inference, we usually prefer easy-to-understand model. So, if we have a little lower accuracy but the model can be interpreted more easily, we are ok if our goal is inference. So, we can understand the relationship between response variable and features (predictors).
  • Side Note: Logistic Regression --> In spite of its name, it is for classification. It predicts the probability of a test point being a member of class. If that probability is greater than a pre-specified value (say 0.5), then it belongs to that class.
  • Goodness-of-fit in regression: can be measure by MSE (Mean Square Error) --> $MSE = np.mean(np.sum([y_{true}-y_{pred}]^{2}))$. Note that we can measure MSE on train & test sets! MSE on test set is usually higher than MSE on train set. We usually care about test set. IF MSE is high in train set, it means we are underfitting and we should probably see a high MSE in test set as well. However, if MSE in train set is low, this does not guarantee to have low MSE on test set as well. Our goal is to get low MSE in test set! IF we get high MSE in test set but low MSE in train set, that means we are over-fitting data, i.e. we are doing really well on train set that the model takes into account noises and cannot reflect the correct behavior in the data --> we can use cross-validation to estimate test MSE using training data (more on this later!)
  • Bias -vs- Variance TradeOff (Part I) For a particular $\hat{f}$, we can fit it to a large number of training sets and calculate the expected value of the error as follows: $E[(y_{0}-\hat{y}_{0})^{2}] = Var(\hat{f}(x_{0})) + [Bias(\hat{f}(x_{0}))]^{2} + Var(\epsilon)$ --> please note that we are focusing on expected value of error on a single data point $x_{0}$. As you can see, the error has the term $Var(\epsilon)$ that we cannot get rid of! We have $Var(\hat{f}(x_{0}))$ that shows how much the estimated value of f at $x_{0}$ changes. So, if we over-fit our data (i.e. using a highly flexible function, i.e. large degree of freedom, i.e. lowering smoothness), our $\hat{f}(x_{0})$ can easily change by changing our train set. The bias term $Bias(\hat{f}(x_{0}))$ is caused by our bias towards considering linear model. So, if we assume our model is linear, we are inserting a bias into our model. So, to reduce the bias, we should add flexibility and consider higher degree of freedom, which on the other hand increases the $Var(\hat{f}(x_{0}))$. Now, we can feel the trade-off between Bias and Variance! --> A good model is a model that has LOW bias and LOW variance.
  • Bias -vs- Variance TradeOff (Part II)
    (1) Linear model: high biased and low variance (because 𝑓̂ changes very little if we change one data point in training set)
    (2) VERY flexible model: low biased and high variance (beause little change in train set can significantly change 𝑓̂ because of over-fitting problem)
  • Error on classification: $np.mean(np.sum(is\_y\_equal\_\hat{y}))$, where $is\_y\_equal\_\hat{y}$ is a boolean vector and it is TRUE if $y_{true}=y_{pred}$ and it is FALSE otherwise. Our goal is to have low error on test set.
  • Bayes Classifier -vs- Naive Bayes Classifier: Bayes Classifier is standard gold model with lowest possible error. In binary classification, we find $P(class_{0} | x_{0})$ and $P(class_{1} | x_{0})$, and assign $x_{0}$ to the class with higher probability. However, since we do not know the conditional distribution, we cannot do it. A NAIVE way to do like this is to assume all predictors are independent from each other. So, we first use bayes theorem: $P(class_{0} | x_{0}) \propto P(x_{0} | class_{0}) \times P(class_{0}) $, and then to caculate $P(x_{0} | class_{0})$, we can say: $P(x_{0} | class_{0}) \propto P(x^{(feature\#1)}_{0} | class_{0}) \times ... \times P(x^{(feature\#F)}_{0} | class_{0})$ (because we assume they are independent) and then we can simply build a pdf for each feature on train set and use it to calculate $P(x^{(feature\#1)}_{0} | class_{0})$ on test set.
  • use KNN as classifier / regressor: When k=1, we consider ONLY closest point. In other words, we are over-fitting. this is similar to having highly-flexible model. As we increase k, we are getting closer to linear-model. So, when k is really really large, we are dealing with bias.
  • MSE-vs-Flexibility Please note that as we increase flexibility (by increasing degree of freedom to our model), MSE decreases in train set. In test set, however, MSE follws a U-shape trend. So, it decreases, and at some point when we reach an over-fitting state in train set, the MSE goes up again! (NOTE: IF the data is really from a linear mode, we should see a the MSE on test set goes up right from the beginning where degree of freedom is lowest)
  • Plot region boundaries in classification: To do so, we need to generate a bunch of coordinates that covers most of the space and then predict their classes using the fitted classifer. We can do as follows:
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1), np.arange(y_min, y_max, 0.1))
    X_mesh = np.c_[xx.ravel(), yy.ravel()] 
    ####
    #Then use fitted model on X_mesh.
    ####
    #NOtE: If the features are in high-dimension, we may use PCA to summarize them in two dimension, then get `X_mesh` on PCA-ed version and then do PCA-inverse to transform X_mesh from PCA 2D space to the actual dimension. And, use the fitted model to predict their y.