This notebook covers chapter 04: classification of the book ISL.

Classification: Predicting the qualitative (categorical) response variable based on a set of predictors

How does classification work? The classification methods usually predict the probability of a sample (observation) belongs to each class, and then, based on a (user-define) threshold, they try to predict its class.

NOTE: This chapter covers some of widely-used classification methods. (In later chapters, some other classification methods will be covered as well.)

In this chapter, the dataset Default will be used, let us take a look at it...

import libraries

import math
from collections import Counter
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

#sklearn
from sklearn.model_selection import  train_test_split

from sklearn.utils import resample
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import PoissonRegressor
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis as QDA
from sklearn.naive_bayes import GaussianNB 
from sklearn.neighbors import KNeighborsClassifier as KNNclassifier

from sklearn.compose import ColumnTransformer

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from  sklearn.metrics import accuracy_score, roc_curve,roc_auc_score

#statsmodel
import statsmodels.api as sm
C:\Users\nimas\anaconda3\lib\site-packages\statsmodels\tsa\base\tsa_model.py:7: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  from pandas import (to_datetime, Int64Index, DatetimeIndex, Period,
C:\Users\nimas\anaconda3\lib\site-packages\statsmodels\tsa\base\tsa_model.py:7: FutureWarning: pandas.Float64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  from pandas import (to_datetime, Int64Index, DatetimeIndex, Period,

data: default

df = pd.read_csv('./datasets/Default.csv', index_col=0)
df.head(3)
default student balance income
1 No No 729.526495 44361.625074
2 No Yes 817.180407 12106.134700
3 No No 1073.549164 31767.138947

The column default is the response variable and the goal is to predict Yes/No based on its predictors.

NOTE: a person with "default: yes" is a person who cannot pay back their debt.

_ = df.hist()
plt.figure()
plt.title('histogram of student')
student_tags_counts = Counter(df['student'])
plt.bar(student_tags_counts.keys(), student_tags_counts.values())
plt.show()
num_features = df._get_numeric_data().columns

fig, axs = plt.subplots(nrows=1, ncols=len(num_features), figsize=(15,5))
for i_axs, feature in enumerate(num_features):
    groups = df[feature].groupby(df['default'])
    for i, (key, item) in enumerate(groups):
        axs[i_axs].boxplot(item, positions=[i], labels=[key])
    
    axs[i_axs].set_xlabel('response variable: default')
    axs[i_axs].set_ylabel(feature)

plt.show()

The figure shows that there is strong relationship between the predictor balance and the response variable default. In other words, people with higher balance are more likely to have default=yes.

NOTE: As discussed in the book, we rarely see such clear relationship in real-world problems)

BONUS:
How about the categorical predictor student? How can we see its relationship with the response variable?
One solution is to show the relationship with help of bar charts...

groups = df['default'].groupby(df['student']) 

fig, ax = plt.subplots()
bar_width = 0.8
xticklabels_lst = []
for i, (key, item) in enumerate(groups):
    xticklabels_lst.append(key)
    counts = Counter(item)
    ax.bar(5*i - 0.5*bar_width, counts['No'], color = 'r', label='default: No' if i == 0 else "")
    ax.bar(5*i + 0.5*bar_width, counts['Yes'], color='b', label='default: Yes' if i == 0 else "")


ax.set_xticks([5*x for x in range(len(groups))])
ax.set_xticklabels(xticklabels_lst)
plt.xlabel('student')
plt.ylabel('default')
plt.legend()
plt.show()
# Show the relationship between two numerical predictors and 
# ... respones variable

num_features = df._get_numeric_data().columns 
#here, num stands for numerical

colors_lst = ['b', 'r']
groups = df[num_features].groupby(df['default'])
for i, (key, item) in enumerate(groups):
    plt.scatter(item.iloc[:,0], item.iloc[:,1], color=colors_lst[i], label=f'default: {key}')

plt.xlabel(num_features[0])
plt.ylabel(num_features[1])
plt.legend()
plt.show()

We can notice that people with higher balance, are probably have default=yes (when we move from left to right on x-axis). However, if the income value changes from one person to another, it cannot help us that much to predict the response variable default (when we move from bottom to top on y-axis).

Why not using linear regression?

To use linear regression, we first need to consider some ordinal numbers for the classes. This can become a problem if we have more than two classes. For instance, let us assume we have four classes in our response variable: Programming, Machine Learning, Statistics, and others.

We can convert these tags to numbers 1,2,3, and 4, and then apply linear regression. But, how should we convert these tags to ordinal numbers? Should we assign 1 to programming or to Machine Learning? Different arrangement can lead to different results. Also, if we go with this approach, we are making the assumption that the difference between label-1 and label-2, is the same as label-2 and label-3, and so on. However, such assumption is usually not correct in the classification context.

If we have two classes (i.e. binary classes), we can use dummy variable and then we can use linear regression (and, if we swap 0 and 1, it results in an equivalent model). There is still one problem: the value of y (from the fitted model) should represent the probability of an observaton being a member of a class given its predictors. However, the y value in linear regression in not restricted. In other words, we might get some value below 0 or above 1, both of which are not meaningful when they are treated as the probability. Although we can restrict them, it would be better if we could have a method that is designed for classification and produces probability between 0 and 1 in the first place.

Thus, we should use methods that are designed for qualitative response variable.

Logistic Regression

  • A method that is good for binary classification (and it can be extended for more-than-two-classes problems)
  • It calculates the probability of response variable given its predictors. For example: the goal might be calculatig P(Y=yes | balance), which means the probability of an observation being in class yes given the predictor balance. (If this probability is bigger than a threshold, say 0.5, then we can say the observation belongs to the class yes.) A company might prefer to lower False Negative Error. (i.e. the class is actually positive, but it is predicted as negative.) In this case, the user might choose small threshold, like 0.1. Then, if P(Y=yes | balance) becomes bigger than 0.1, the prediction is yes.
  • P(X) denotes P(Y=1|X) in this chapter unless stated otherwise.
  • Goal: to model P(X) such that it gives value between 0 and 1
  • Solution: we can use logistic function: $P(X) = \frac{1}{1 + e^{-(w^{T}X+b)}}$: In contrast to linear regression where the slope was constant, the $\frac{dP}{dX}$ is not constast (it depends on the value of X).
  • In single-predictor case: If w>0 (w<0), then P(x) is increasing (decreasing): You can get the derivative of P(X) and check it out yourself
  • We want to find parameters w and b such that P(X) becomes close to 1 when y=1, and becomes close to 0 when y=0. In regression, we had MSE and minimizing MSE gave us the estimation of w and b. In classification, we need to define an objective first: in logistic regression, the idea is to maximimze the following objective function: $L(w,b) = \prod\limits_{\forall y_{i}=1}{P(X_{i})} \prod\limits_{\forall y_{i}=0}{(1 - P(X_{i}))}$ --> NOTE: this is the overal likelihood based on a binomial distribution (read more here: Logistic_regression).
  • we can use libraries to fit this model and find the parameters.
  • For a probability p, its corresponding odd is defined as: $\frac{p}{1-p}$
  • Also note that: $\log(\frac{P(X)}{1-P(X)}) = wX + b$ (single predictor case), where the left-hand side is called log odds (or logit).
#preparing feature
X = df['balance'].to_numpy()
X = sm.add_constant(X)

#preparing response variable
y = df['default'].to_numpy()
label_encoder = LabelEncoder().fit(y)
y = label_encoder.transform(y)

print('classes transformed by label_encoder: ', label_encoder.classes_)
classes transformed by label_encoder:  ['No' 'Yes']
classifier = sm.Logit(y, X).fit()
print(classifier.summary())
Optimization terminated successfully.
         Current function value: 0.079823
         Iterations 10
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                10000
Model:                          Logit   Df Residuals:                     9998
Method:                           MLE   Df Model:                            1
Date:                Thu, 14 Apr 2022   Pseudo R-squ.:                  0.4534
Time:                        21:33:13   Log-Likelihood:                -798.23
converged:                       True   LL-Null:                       -1460.3
Covariance Type:            nonrobust   LLR p-value:                6.233e-290
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const        -10.6513      0.361    -29.491      0.000     -11.359      -9.943
x1             0.0055      0.000     24.952      0.000       0.005       0.006
==============================================================================

Possibly complete quasi-separation: A fraction 0.13 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.

Cool! We reproduced Table 4.1! Now, we can see $w>0$, which means P(X) is an increasing function, and thus higher balance means higher probability for an observation to be in class Y=1 (i.e. default=Yes). Now, you may want to take a look again at the scatterplot and boxplot provdided earlier in this notebook :)

Practice: We can use the feature student as the only predictor and see the result when is used in logsitic regression for classifying the response variable default (try to reproduce table 4.2)

How about using all predictors? (Multi-variate Logistic Regression)

X = df.iloc[:,1:] #skipping the first column as the first column is response variable `default`
X['student'] = label_encoder.fit_transform(X['student']) #encoding student to binary
X = sm.add_constant(X.to_numpy())

#preparing response variable
y = df['default'].to_numpy()
label_encoder = LabelEncoder().fit(y)
y = label_encoder.transform(y)

classifier = sm.Logit(y, X).fit()
print(classifier.summary())
Optimization terminated successfully.
         Current function value: 0.078577
         Iterations 10
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                10000
Model:                          Logit   Df Residuals:                     9996
Method:                           MLE   Df Model:                            3
Date:                Thu, 14 Apr 2022   Pseudo R-squ.:                  0.4619
Time:                        21:33:16   Log-Likelihood:                -785.77
converged:                       True   LL-Null:                       -1460.3
Covariance Type:            nonrobust   LLR p-value:                3.257e-292
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const        -10.8690      0.492    -22.079      0.000     -11.834      -9.904
x1            -0.6468      0.236     -2.738      0.006      -1.110      -0.184
x2             0.0057      0.000     24.737      0.000       0.005       0.006
x3          3.033e-06    8.2e-06      0.370      0.712    -1.3e-05    1.91e-05
==============================================================================

Possibly complete quasi-separation: A fraction 0.15 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.

Note that the order of columns is different here:
predictors --> x1: student, x2: balance, x3: income

Brain teaser: Try to use x1: student as the only predictor and see its coefficient. You will notice that it is positive. How is that possible? (because in the table above, in multi-variate logistic regression, the coefficient of x1:student is negative.

Answer: checkout the book!

Multi-class Logistic Regression (a.k.a Multinomial Logistic Regression)

Please note that here we would like to talk about cases that have more than two classes (Do not confuse it with multi-variate case where we have more than one predictors)

In the book, you can see how P(X) of two-class case can be extended to Multi-class Logistic Regression. Here, I skip that part and just review the softmax method. (Note that the extended version of Logistic Regression and softmax are equivalent in terms of predicted value, log odds, and key outputs)

So, let us see how it works:

Let us suppose that we have K classes:
$ P(Y=k | X) = \frac{e^{z_{k}}}{\sum\limits_{i=1}^{K}{e^{z_{i}}}}$

And, we can easily see that: $\sum\limits_{i=1}^{K}{P(Y=i|X)} = 1$

$z_{i} = w_{i}^{T}X + b_{i}$ --> so, we use a linear regression to predict z, that will be used in softmax to predict P(Y=y|X).

Practice: Try to calculate log odds: $\frac{P(Y=j|X=x)}{P(Y=i|X=x)}$

Generative Model for Classification

In previous part, we introduced the likelihood function, where the goal was to estimate parameters $w$ and $b$ by maximizing the objective function, and then, for a given point X, predict its probability of being in class 1 as: $P(Y=1|X)=P(X)=\frac{1}{1+e^{-z}}$, where, $z = \hat{w}^{T}X + \hat{b}$. And, in K-class case, we have K pairs of parameters $(w,b)$ that should be estimated.

Now, we would like to propose an alternative approach for predicting $P(Y=i|X)$ for K-class classification problem. Instead of using $\frac{1}{1+e^{-(w^{T}X + b)}}$ as the P(X), we now try to calculate it based on the bayes theorem:

$P(Y=k|X) = \frac{P(X|Y=k) P(Y=k)}{P(X)} = \frac{P(X|Y=k) P(Y=k)}{\sum\limits_{i=1}^{K}{P(X|Y=i) P(Y=i)}}$

So, for a given X, we would like to calculate P(Y=i|X) for $i \in [1,K]$. The class with maximum P(Y=?|X) is the class of X.

Let us dig into the terms appeared in the fraction above:

  • $P(X|Y=k)$: likelihood, the probability of X belongs to the k-th class (this is density function of X if it comes from k-th class)
  • $P(Y=k)$: prior probability, the probability of a randomly-chosen observation comes from class k
  • Denominator: Normalizing constant

Therefore, to compare probability of X belonging to a particular class we need to compare the numerator. The prior probability $P(Y=k)$ is simple: we can say, $P(Y=k) = \frac{\text{# of observations with label k}}{\text{total # of observations}}$. The challenge is $P(X|Y=k)$? How can I calculate probability of x given that it belongs to class k? (In other words, how can I model the distribution of X in k-th class?)

Here are a few methods:

  • Linear Discriminant Analysis
  • Quadratic Discriminant Analysis
  • Naive Bayes

Each of the methods above has its own assumption based on which the $P(X|Y=k)$ can be obtained, then we can calculate the posterior probability $P(Y=k|X)$.

Let's go through them!

Note: Why should I use these instead of logistic regression? these methods are good when there is a substantial seperation between classes (in this case, logisitc regresstion gives unstable parameters)

Linear Discriminant Analysis (following discussion is for p=1, where p is the number of predictors)
Assumption: data points in the k-th class are from $N(\mu_{k}, \sigma)$. $\mu_{k} = avg(X \in \text{class k})$, which is class-specific and $\sigma$ is the standard deviation common in all classes, and it is:
$\sigma^{2} = \frac{1}{n-K} \times \text{sum(sum of squares of deviation of X from its corresponding }\mu)$.

Now, we can plug in these values in the PDF of normal distribution and calculate the posterior probability for each class. Why do we call it Linear Discriminant Analysis? Because, if we break down the posterior probability, we can see $P(Y=k|X)$ becomes maximum when the discriminant function $\sigma_{k}(X)$ becomes maximum (see book for formula of $\sigma_{k}(X)$). $\sigma_{k}(X)$ is LINEARLY related to $X$. That is why it is called Linear Discriminant Analysis.

How about p>1 (p: number of predictors)? In that case, we use multi-variate normal distribution, where $\mu$ is a vector that should be separately calculated for each class, and covariance matrix,$\text{$\sum$}$, is common for all classes. And:
$\text{$\sum[i,j]$} = COV(X_{i},X_{j})$

mean_vec = np.array([5,5])
cov_mat = np.array([[0.01, 0.08],[0.08, 1]])
#why these numbers? 
#I decided to consider same value for mean of each variable. mean shifts the distribution and I do not care about that here.
#I just want to see how the shape looks like... 
#In 2-by-2 cov matrix, the diagonal values are variance of variables (features). So, cov[0,0] is variance of first feature
#cov[1,1] is variance of second feature.
#cov[0,1]=cov[1,0]=E[(X0 - X0_mu)(X1- X1_mu)] = corr*std(X0)*std(X1) => I decided to choose the value 0.8 for corr, so I
#can see the behavior when there is high correlation, and then based on that, I found the cov[0,1]...

np.random.seed(seed=0)
toy_data = np.random.multivariate_normal(mean_vec, cov_mat, size=100000)
plt.scatter(toy_data[:,0], toy_data[:,1])
plt.xlabel('X_0') #first feature
plt.ylabel('X_1') #second feature
plt.show()

we can see that there is a good linear correlation between X_0 and X_1...

def multi_nomial_dist_pdf(X, mean_vec, cov_mat):
    """
    This function returns f(x) for each row x in data, where f is PDF of mutlti-nomial distribution
    with mean vector `mean_vec` and covariance matrix `cov_mat`
    
    Parameters
    ----------
    data: numpy.ndarray
        a array with shape (m,) where m is the dimension of data.
    
    mean_vec: numpy.ndarray
        array with shape (m,): the mean parameter of multi-nomial distribution
    
    cov_mat: numpy.ndarray
        array with shape (m,m): the `cov matrix` parameter of multi-nomial distribution
    
    Returns
    ----------
    likelihood: scalar
         the value of f(data[i]), where f is the multi-nomial distribution
    """
    m = X.shape[0]
    
    if (len(mean_vec) != m):
        raise ValueError(f"length of mean_vec, {len(mean_vec)}, is not equal to the dimension(i.e number of features) of data, {m}")
    
    if cov_mat.ndim != 2:
        raise ValueError(f"size of cov_mat, {cov_mat.ndim}, is not 2.")
    
    if not np.allclose(cov_mat, cov_mat.T):
        raise ValueError(f"cov_mat is not symmetic")
    
    if cov_mat.shape[0] != m:
        raise ValueError(f"shape of cov_mat, {cov_mat.shape}, is not {(m,m)}")
    
    #eq 4.23 of book
    coef = math.sqrt(1/(np.linalg.det(cov_mat) * math.pow(2*math.pi, m)))
    
    X_offset_removed = X - mean_vec
    exp_term = math.exp(-1/2 * np.matmul(np.matmul(X_offset_removed.T, np.linalg.inv(cov_mat)), X_offset_removed))
    
    return coef * exp_term                  
data_likelihood = np.array([multi_nomial_dist_pdf(x, mean_vec, cov_mat) for x in toy_data])
data_likelihood.shape
(100000,)
fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(projection='3d')
ax.scatter(toy_data[:,0], toy_data[:,1], data_likelihood)
plt.xlabel('X_0')
plt.ylabel('X_1')
plt.show()

Note that the current graph gives likelihood. It gives $P(X|Y=i)$. In other words, it shows pdf of X in the class i. However, what we are looking for is $P(Y=i|X)$. Thus, we can use bayes theorem to calculate the posterior probability $P(Y=i|X)$.

The relationship between $\delta(x)$, discriminant function (see eq 4.24 in the book), and x is still linear when p>1, and that is why it is called Linear Discriminant Analysis.

#note: student is qualitative and violates the normality assumption of LDA. 
#However, LDA  is prettery robust to model violation. 
#alternative  option: Naive Bayes, which does NOT require normality assumption

y = df['default']
X = df[['student','balance']]
###############################

le = LabelEncoder().fit(y)
y = le.transform(y)
print('classes of target encoded by LabelEncoder: ', le.classes_)
print('---------------------------')

ct = ColumnTransformer([('ohe',OneHotEncoder(drop='first'),['student'])], remainder = 'passthrough')
X = ct.fit_transform(X)
print('the slices each transformer affect: ', ct.output_indices_) 
classes of target encoded by LabelEncoder:  ['No' 'Yes']
---------------------------
the slices each transformer affect:  {'ohe': slice(0, 1, None), 'remainder': slice(1, 2, None)}
print(X) #see X
[[0.00000000e+00 7.29526495e+02]
 [1.00000000e+00 8.17180407e+02]
 [0.00000000e+00 1.07354916e+03]
 ...
 [0.00000000e+00 8.45411989e+02]
 [0.00000000e+00 1.56900905e+03]
 [1.00000000e+00 2.00922183e+02]]

NOTE: If you get confused by slice(), take a look at the document: Built-in functions: slice

NOTE: default=yes means failure to repay debt

transformer = ct.transformers_ #this is list of transformers for OneHotEncoder
transformer
[('ohe', OneHotEncoder(drop='first'), ['student']),
 ('remainder', 'passthrough', [1])]
ohe = transformer[0][1]
ohe.categories_ #this shows `No` is converted to 0 and `Yes` is converted to 1... 
[array(['No', 'Yes'], dtype=object)]
X_scaled = StandardScaler().fit_transform(X) #not required in LDA classification... 
#However, I prefer to standardize it anyway unless I know I must not!!!

classifier = LDA().fit(X_scaled,y)
y_pred = classifier.predict(X_scaled)

conf_mat = confusion_matrix(y, y_pred)
conf_mat
array([[9644,   23],
       [ 252,   81]], dtype=int64)

here, we reproduced TABLE 4.4. Please note that the confusion_matrix, c, is constructed as follows: c[i,j]: number of observations that have true label of i but predicted as j. That is why we see the two numbers 252 and 23 are swapped compared to what provied in TABLE 4.4 of the book (In the TABLE 4.4 of the book, the predicted labels are in rows and the true labels are in columns.)

SIDE NOTE: please note that we can measure different kinds of error here. A quick search gives you their definition...

  • error
  • Type I error
  • Type II error
  • Precision
  • Recall
  • Sensitiviy

$error = 1 - Accuracy = 1 - \frac{\text{TP}+\text{TN}}{\text{Total # of samples}}$, where TP (TN) is the number of true positive (true negative) prediction of samples.

error = 1 - accuracy_score(y, y_pred)
print('error is: ', error) #so, approx. 2.75%
error is:  0.02749999999999997

What kind of issues we should be worry about?

  • error on test set => please note that the error we got is the error of model on train set. However, we know that the accuracy of model on test set is what actually matters! And, we know that the error of model on test set is ususally higher than the error of train set! Of course, one way to see whether or not we need to worry about the model's performance on test set is to do train/test split and measure the performance of the model on the test set. In this case, we can see we should not be worry about overfitting problem. The number of parameters (p)=2, and the number of samples (n) = 10,000. In such cases where p << n, the ratio of p/n is vary small that we do not expect to see an over fitting problem. Therefore, if we have a good performance on train set, we might have a good perfomance on test set as well.

  • unbalanced classes: we have 333 out of 10000 samples with default='yes'. In other words, only 3.33% of samples have default='yes', and others have default='No'. So, if we use a trivial null classifier that simply predict default=No for any sample, it can get an error of 3.33%, which is just a little higher than the one achieved by LDA.

In binary classification, we usually have two types of errors: FP (type I) and FN (type II). Let us take at confusion matrix again:

cm_display = ConfusionMatrixDisplay(conf_mat).plot()

Recall that label 0 refers to not default while 1 refers to default. According to the confusion matrix above, 23 are False Positive, which is the number of observations that are predicted to be default by mistake. So, out of 9667 sampels with true label 0, 23 of them got predicted wrongly, which is low error.

Now, if we take a look at the ones who are defaulted, we can see that 252 out of 333 observations have wrong prediction. That is a huge error. Therefore, the total error is low, but the error of predicting observations who are default is high.

So, for a credit card company who is interested in identifying (predicting) the individuals who will default (i.e. default=yes), having high error in predicting default is not good.

$ sensitivity = \frac{TP}{TP+FN} = \frac{\text{predicted }TP}{true\text{ }P}$.

In this example, sensitivity is $\frac{81}{333}$, which is equivelent to 24.3%.

How can we improve this? Recall that LDA assign label 1 if P(Y=1|X)>0.5. We want to be more sensitive in a sense that if we, for instance, get even 20% probability, we say it is label 1! So, we can use lower threshold, say 0.2.

classifier = LDA().fit(X_scaled,y)
y_predict_proba = classifier.predict_proba(X_scaled)

sensitivity_threshold = 0.2
y_pred = np.where(y_predict_proba[:,1]>sensitivity_threshold, 1, 0)
print(y_pred)
[0 0 0 ... 0 0 0]
conf_mat = confusion_matrix(y,y_pred)
cm_display = ConfusionMatrixDisplay(conf_mat).plot()

We reproduced Table 4.5 of the book. Now, we can see the sensitivity is now $\frac{195}{333}$, which is about 58.56%. So, we only missed 41.44% in predicting the default.

However, we pay it off somewhere else.If you look at the FP, you can see that it increases from 23 to 235. The total error also increase to 3.73%, which is acceptable by company as, in return, it gets a much better prediction for those who are default.

Let us see how the error changes w.r.t the threshold...

classifier = LDA().fit(X_scaled,y)
y_predict_proba = classifier.predict_proba(X_scaled)

threshold_values = np.arange(0,0.51,0.01)

err_total=[]
err_type_I = [] #fp/actually negative
err_type_II = [] #fn/actually positive

for threshold in threshold_values:
    y_pred = np.where(y_predict_proba[:,1]>threshold, 1, 0)
    tn, fp, fn, tp = confusion_matrix(y, y_pred).ravel()
    
    err_total.append(1 - accuracy_score(y, y_pred))
    err_type_I.append(fp/(fp+tn))
    err_type_II.append(fn/(fn+tp))
    
plt.plot(threshold_values, err_total, color='k', label='err_total')
plt.scatter(threshold_values, err_type_I, color='r', marker='.', label='err_type_I')
plt.scatter(threshold_values, err_type_II, color='b', marker='.', label='err_type_II')

plt.title('performance of classifer with different threshold')
plt.xlabel('threshold')
plt.ylabel('error')
plt.legend()
plt.show()

Lowering the threshold upto ~0.1 has minor impact on the total error and err-type-I while it hugely reduces the err-type-II. The err-type-II shows the ratio of $\frac{FN}{\text{actual }P}$. In other words, it shows of the all samples with true label 1, how many are predicted wrong. So, in our data default, the label 1 refers to default. So, the error shows the misclassification error on samples whose default is 1. If the goal is to better detect people with default and we do not care that much abou misclassification on not-default people (i.e. we do not care about fp), then we may prefer lower threshold. The optimal threshold can be calculated by considering the domain knowledge and the cost of each error.

Some measures in classification:

  • Type-I-error = $\frac{FP}{\text{actually N}}$ = False Positive Rate ---> Specificity = 1 - Type-I-error
  • Type-II-error = $\frac{FN}{\text{actually P}}$---> Power = Recall = Sensitivity = 1 - Type-II-err = $\frac{TP}{actually P}$ = True Positive Rate
  • Precision = p(TP|predicted P) = $\frac{TP}{predicted P}$

we would like to have high True Positive Rate and low False positive Rate. ROC curve is a good way to get some sense about the performance of classifier on these two measures (note that you may use any other measures to plot ROC).

In ROC, we change the threshold, and for each value of threshold, we calculate True Positive Rate and False positive Rate and plot them.

classifier = LDA().fit(X_scaled,y)
y_predict_proba = classifier.predict_proba(X_scaled) 
# note that we get the posterier probailty of X as the roc_curve function (seee below) 
# needs probability so it can predict labels for different values of threshold.

y_score = y_predict_proba[:,1] #prob of samples on positive class
fpr, tpr, _ = roc_curve(y, y_score)

plt.plot(fpr, tpr)
plt.xlabel('false positive rate')
plt.ylabel('true positive rate')
plt.title(f'with auc score: {roc_auc_score(y, y_score)}')
plt.show()

The area under curve (AUC) shows the performance of classifier overl all thresholds. it should be larger than a randomly-assigning classifier which has AUC of 0.5. In our example, the roc_auc of classifier is ~0.95 which shows the model has good performance.

Practice: Find the threshold that results in the elbow of the curve.

Quadratic Discriminant Analysis: QDA
In contrast to LDA that assume a single covariance matrix for all classes, QDA assumes each class has its own covariance matrix. Why "Quadratic"? because the discriminant funciton, $\sigma(x)$, in QDA has some form of $x^{2}$.

LDA-vs-QDA:
This is bias-variane trade-off. QDA needs to estimate more parameters, therefore it is more flexible. It has lower bias but higher variance.

Naive Bayes:
we assume all predictors are indepenedent from each other. such assumption can simplify the caculation of $f_{k}(X)$...because, we can now simply say: $f_{k}(X) = f_{k}(X_{0})...f_{k}(X_{P})$, where $X_{i}$ refers to the i-th feature. Although we are ignoring the marginal distribution, the classifier can still do a good job particularly when n (number of samples) is small. (Because, we need large number of samples to take into account marginal distribution when estimating the multi-variate pdf $f_{k}(X)$. So, when n is small, our estimation might have a significant error and therefore, we may be better off by just ignoring the existance of marginal distribution.)

How should we estimate $f_{k}(X_{i})$ in Naive Bayes Classifer? (see page 156 of the book for more details)

  • We can say: $X^{(k)}_{i}$ (i.e. the i-th feature data in class k) is from $N(\mu, \sigma)$, where $\mu$ and $\sigma$ can be calculated using the $X_{i}$ data of samples in class k.

  • Another option is to use historgram. Then, we need to choose number/size of bins (hyperparameter). Then, we can see for a new data X, and for each class k, find the bin $X^{(k)}_{j}$ belongs to and calculate the fraction of observations of that class k that is in that bin. Another way is to use KDE (Kernel Density Estimation) to estimate histogram of j-th feature in class k with pdf f and then calculate likelihood as $f(X^{(k)}_{j})$

  • How about qualitative feature? Similar to previous approach, but instead of histogram, we have bar charts that show the counts of each tag for $X^{(k)}_{j}$

classifier_LDA = LDA().fit(X_scaled, y)
classifier_QDA = QDA().fit(X_scaled, y) 
classifier_GaussianNB = GaussianNB().fit(X_scaled, y) 

classifier_dict = {'LDA': classifier_LDA, 'QDA':classifier_QDA, 'NB':classifier_GaussianNB}
plt.figure()
for key, classifier in classifier_dict.items():
    y_predict_proba = classifier.predict_proba(X_scaled)
    y_score = y_predict_proba[:,1] #prob of samples on positive class
    
    
    fpr, tpr, _ = roc_curve(y, y_score)
    auc_score = np.round(roc_auc_score(y, y_score), decimals=3)
    
    plt.plot(fpr, tpr, label=f"{key}, auc={auc_score}")

plt.xlabel('false positive rate')
plt.ylabel('true positive rate')
plt.legend()
plt.show()

While they show similar ROC curve, their confusion matrix is different...

threshold = 0.2

for key, classifier in classifier_dict.items():
    y_predict_proba = classifier.predict_proba(X_scaled)
    y_score = y_predict_proba[:,1] #prob of samples on positive class
    y_pred = np.where(y_score>threshold, 1, 0)
    
    cm = confusion_matrix(y, y_pred)
    cm_display = ConfusionMatrixDisplay(cm).plot()
    ax = cm_display.ax_
    ax.set_title(key)

In this data set n>>p, so we should not be worried about variance problem. Therefore, it is not worth it to reduce variance with help of Naive Bayes.

Side Note: LDA, QDA, Naive Baysian are all parametric models. In non-parametric models, such as KNN, we need a lot of observations (n >> p), otherwise (i.e. when n is small, p is high), the non-parametric model can result in high-variance model.

Section 4.5 is important and should be reviewed from the book. We will get back to it when when study cross-validation in chapter 5.

"counts"-type data

This is when the target value is neither quantitative nor qualitative! It is counts!

Example: let us take a look at bikeshare data: to predict # of people rent a bike

Here the data is not continuous (not regression) but it is also not qualitative! When type of target value (i.e. response) is counts, that means target value is non-negative integer numbers... Possible ways to predict y for a given X:

  • Linear Regression: we can do it...however, the challenge is that we might get some negative prediction for some X which is meaningless. Also, if, for some X, number of counts is small, that means few people would rent a bike, and thus std should be small (In oppositve, when number of counts is large, we should expect higher std). This is violation of normally-distirbuted error assumption in linear regression.

  • Another solution: we can do log(y) = wx+b. So, instead of modeling y with linear regression, we model log(y). Thus, we can avoid negative prediction and we may statisfy the normally-distribution error assumption. HOWEVER, this method is hard to interpret and it cannot work when y is 0.

  • Alternative solution? See below!

Alternative solution: using Poission Regression

First, let us take a look at Poission Distribution:
$P(Y=k) = \frac{e^{-\lambda}\lambda^k}{k!}$ (k is non-negative integer).

In poission distribution, $\mu = \sigma = \lambda$. In other words, the mean ($\mu$) and standard deviation ($\sigma$) are requal. So, if mean gets higher, the standard deviation is getting higher. To better understand how poission works, let us take a look at linear regression again:

In linear regression, we have $y = wx+b + \epsilon$, where $\epsilon \in ~ N(0,\sigma)$. In other words, what we predict with our model (i.e. $\hat{y} = wx+b$ is expected value of response. The response comes from a distribution $N(\hat{y}, \sigma)$.

How can we use "Poission Regression" model?
We fit this model by maximimzing likelihood below:

$likelihood = \prod\limits_{i=0}^{K}{P(Y=i)} = \prod {\frac{e^{-\lambda}\lambda^i}{i!}}$.

So,we can say each point is coming from poission distribution where we expect $\lambda$ to be the expected value of counts we want to predict. Since the counts change for different X, we would like to consider diffent value of $\lambda$. So, we model $\lambda$ based on data: $log(\lambda) = wx + b$. why log? so we don't get negative value for $\lambda$.

After fitting model to data (i.e. finding w and b such that likelihood becomes maximum), then for a given X, we predict the expected value of response as: $E(\hat{y}) = \lambda = e^{wX+b}$

 
df = pd.read_csv('./datasets/Bikeshare_hour.csv', index_col = 0)
df.head()
dteday season yr mnth hr holiday weekday workingday weathersit temp atemp hum windspeed casual registered cnt
instant
1 2011-01-01 1 0 1 0 0 6 0 1 0.24 0.2879 0.81 0.0 3 13 16
2 2011-01-01 1 0 1 1 0 6 0 1 0.22 0.2727 0.80 0.0 8 32 40
3 2011-01-01 1 0 1 2 0 6 0 1 0.22 0.2727 0.80 0.0 5 27 32
4 2011-01-01 1 0 1 3 0 6 0 1 0.24 0.2879 0.75 0.0 3 10 13
5 2011-01-01 1 0 1 4 0 6 0 1 0.24 0.2879 0.75 0.0 0 1 1
predictors = ['mnth','hr','workingday','weathersit','temp']
response = ['cnt']

X = df[predictors]
y = df[response].to_numpy().ravel()

As opposed to the book, we only consider weathersitas qualitative feature.

ct = ColumnTransformer([('ohe',OneHotEncoder(),['weathersit'])], remainder='passthrough')
ct.fit(X)
X_transformed = ct.transform(X)
X_transformed
array([[ 1.  ,  0.  ,  0.  , ...,  0.  ,  0.  ,  0.24],
       [ 1.  ,  0.  ,  0.  , ...,  1.  ,  0.  ,  0.22],
       [ 1.  ,  0.  ,  0.  , ...,  2.  ,  0.  ,  0.22],
       ...,
       [ 1.  ,  0.  ,  0.  , ..., 21.  ,  1.  ,  0.26],
       [ 1.  ,  0.  ,  0.  , ..., 22.  ,  1.  ,  0.26],
       [ 1.  ,  0.  ,  0.  , ..., 23.  ,  1.  ,  0.26]])
ct.transformers_
[('ohe', OneHotEncoder(), ['weathersit']),
 ('remainder', 'passthrough', [0, 1, 2, 4])]
ct.transformers_[0][1].categories_
[array([1, 2, 3, 4], dtype=int64)]
X_transformed = StandardScaler().fit_transform(X_transformed)
regressor = PoissonRegressor(alpha=0).fit(X_transformed, y)
y_pred = regressor.predict(X_transformed)

#let us plot y_pred and y together:
plt.scatter(y, y_pred)
plt.xlabel('y groundtruth')
plt.ylabel('y predict')
plt.show()

Instead of seeing a linear line $y=x$, it seems our model could not predict of $y>600$. (Why?! Maybe..we should have considered month and hr as qualitative features... Try it!)

4.7.1: Stock Market Data

df = pd.read_csv('./datasets/Smarket.csv', index_col=0)
df.head()
Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today Direction
ID
1 2001 0.381 -0.192 -2.624 -1.055 5.010 1.1913 0.959 Up
2 2001 0.959 0.381 -0.192 -2.624 -1.055 1.2965 1.032 Up
3 2001 1.032 0.959 0.381 -0.192 -2.624 1.4112 -0.623 Down
4 2001 -0.623 1.032 0.959 0.381 -0.192 1.2760 0.614 Up
5 2001 0.614 -0.623 1.032 0.959 0.381 1.2057 0.213 Up

The target is "Direction" that has labels Up and Down. Also, note that we cannot use the feature Today as it is the same as Direction when we encode positive values to Up, and negative ones to Down. Therefore, we assume our features are Lag1...Lg5 and volume.

X = df.iloc[:,:-2].to_numpy()
y = df.iloc[:,-1].to_numpy()
np.corrcoef(X.T) #note that in np.corrcoef, each column (not each row) is one observation
array([[ 1.        ,  0.02969965,  0.03059642,  0.03319458,  0.03568872,
         0.02978799,  0.53900647],
       [ 0.02969965,  1.        , -0.02629433, -0.0108034 , -0.00298591,
        -0.00567461,  0.04090991],
       [ 0.03059642, -0.02629433,  1.        , -0.02589667, -0.01085353,
        -0.00355795, -0.04338321],
       [ 0.03319458, -0.0108034 , -0.02589667,  1.        , -0.02405104,
        -0.01880834, -0.04182369],
       [ 0.03568872, -0.00298591, -0.01085353, -0.02405104,  1.        ,
        -0.02708364, -0.04841425],
       [ 0.02978799, -0.00567461, -0.00355795, -0.01880834, -0.02708364,
         1.        , -0.02200231],
       [ 0.53900647,  0.04090991, -0.04338321, -0.04182369, -0.04841425,
        -0.02200231,  1.        ]])
plt.matshow(np.corrcoef(X.T))
plt.colorbar()
plt.show()

This is just to give us some idea about how features correlated with each other.

What now? Let us fit differenet models to predict Direction

>>> preprocessing

label_encoder = LabelEncoder()
label_encoder.fit(y)
y = label_encoder.transform(y)

print('classes in y that would be encoded: ', label_encoder.classes_)
classes in y that would be encoded:  ['Down' 'Up']
X_train, X_test, y_train ,y_test = train_test_split(X, y, test_size=0.25, random_state=0)
sc = StandardScaler().fit(X_train)
X_train = sc.transform(X_train)
X_test = sc.transform(X_test)

>>>Logistic Regression

regressor = LogisticRegression(penalty='none', solver='newton-cg').fit(X_train, y_train)

y_train_pred_proba = regressor.predict_proba(X_train)
y_test_pred_proba = regressor.predict_proba(X_test)

fpr_train, tpr_train, _ = roc_curve(y_train, y_train_pred_proba[:,1])
fpr_test, tpr_test, _ = roc_curve(y_test, y_test_pred_proba[:,1])

plt.title('ROC with Logistic Regression')
plt.plot(fpr_train, tpr_train, label='roc train')
plt.plot(fpr_test, tpr_test, label='roc test')
plt.xlabel('False Positive Rate (FP/ GroundTruth_N)')
plt.ylabel('True Positive Rate (TP/ GroundTruth_P)')
plt.legend()
plt.show()

The ROC shows that the classes are being predict almost on a random basis! We might be able to improve the accuracy to some extent with help of feature selection. Say, we could improve it upto 60% for TP rate. One needs to dig further and make sure that the obtained result is not by chance (e.g. we can perform cross-validation and take average of accuracy across several data sets)

threshold = 0.5
y_test_pred = np.where(y_test_pred_proba[:,1]>threshold, 1, 0)
cm = confusion_matrix(y_test, y_test_pred)

cm_display = ConfusionMatrixDisplay(cm).plot()
ax = cm_display.ax_
ax.set_title('Logistic Regression on test set (threshold: 0.5)')
ax.set_xlabel('Predicted Labels')
ax.set_ylabel('Ground Truth Labels')
plt.show()

print('accuracy: ', accuracy_score(y_test, y_test_pred))
accuracy:  0.5079872204472844
regressor = LDA().fit(X_train, y_train)
y_train_pred_proba = regressor.predict_proba(X_train)
y_test_pred_proba = regressor.predict_proba(X_test)

fpr_train, tpr_train, _ = roc_curve(y_train, y_train_pred_proba[:,1])
fpr_test, tpr_test, _ = roc_curve(y_test, y_test_pred_proba[:,1])

plt.title('ROC with Linear Discriminant Analysis')
plt.plot(fpr_train, tpr_train, label='roc train')
plt.plot(fpr_test, tpr_test, label='roc test')
plt.xlabel('False Positive Rate (FP/ GroundTruth_N)')
plt.ylabel('True Positive Rate (TP/ GroundTruth_P)')
plt.legend()
plt.show()
threshold = 0.5
y_trest_pred = np.where(y_test_pred_proba[:,1]>threshold, 1, 0)
cm = confusion_matrix(y_test, y_trest_pred)

cm_display = ConfusionMatrixDisplay(cm).plot()
ax = cm_display.ax_
ax.set_title('LDA on test set (threshold: 0.5)')
ax.set_xlabel('Predicted Labels')
ax.set_ylabel('Ground Truth Labels')
plt.show()

As we saw before, the outcome of LDA and Logistic Regression are usually almost identical. In our case, they are exactly the same when they are used to predict the y of test set.

>>>LDA on two features (Lag1 and Lag2)

To keep things simple, we avoid using feature selection here. For now, we assume we alrady have done some feature selection and we selected Lag1 and Lag2 as the reults of feature-selection process.

X = df[['Lag1','Lag2']].to_numpy()
y = df.iloc[:,-1].to_numpy()

label_encoder = LabelEncoder().fit(y)
y = label_encoder.transform(y)
#####################################

X_train, X_test, y_train ,y_test = train_test_split(X, y, test_size=0.25, random_state=0)
sc = StandardScaler().fit(X_train)
X_train = sc.transform(X_train)
X_test = sc.transform(X_test)

classifier = LDA().fit(X_train, y_train)
y_train_pred_proba = classifier.predict_proba(X_train)
y_test_pred_proba = classifier.predict_proba(X_test)

####################################

fpr_train, tpr_train, _ = roc_curve(y_train, y_train_pred_proba[:,1])
fpr_test, tpr_test, _ = roc_curve(y_test, y_test_pred_proba[:,1])

plt.title('ROC with Linear Discriminant Analysis')
plt.plot(fpr_train, tpr_train, label='roc train')
plt.plot(fpr_test, tpr_test, label='roc test')
plt.xlabel('False Positive Rate (FP/ GroundTruth_N)')
plt.ylabel('True Positive Rate (TP/ GroundTruth_P)')
plt.legend()
plt.show()

#let us plot the confusion matrix
threshold = 0.5
y_trest_pred = np.where(y_test_pred_proba[:,1]>threshold, 1, 0)
cm = confusion_matrix(y_test, y_trest_pred)

cm_display = ConfusionMatrixDisplay(cm).plot()
ax = cm_display.ax_
ax.set_title('LDA on test set (threshold: 0.5)')
ax.set_xlabel('Predicted Labels')
ax.set_ylabel('Ground Truth Labels')
plt.show()
classifier.coef_
array([[-0.06404648, -0.06901402]])

We can see that both coefs are negative... Therefore, if Lag1 and Lag2 are positive, the function $\delta(X)$ inclines to become smaller for positive-values X, and thus it assign class 0 to X. However, if the value of features are both negative....then $\delta(X)$ becomes larger, which can result in assigning class 1 to X.

>>> Quadratic Discriminant Analysis (QDA)

# we assume we have done some feature selection and we noticed Lag1 and Lag2 are selected features.

X = df[['Lag1','Lag2']].to_numpy()
y = df.iloc[:,-1].to_numpy()

label_encoder = LabelEncoder().fit(y)
y = label_encoder.transform(y)
#####################################

X_train, X_test, y_train ,y_test = train_test_split(X, y, test_size=0.25, random_state=0)
sc = StandardScaler().fit(X_train)
X_train = sc.transform(X_train)
X_test = sc.transform(X_test)

classifier = QDA().fit(X_train, y_train)
y_train_pred_proba = classifier.predict_proba(X_train)
y_test_pred_proba = classifier.predict_proba(X_test)

####################################

fpr_train, tpr_train, _ = roc_curve(y_train, y_train_pred_proba[:,1])
fpr_test, tpr_test, _ = roc_curve(y_test, y_test_pred_proba[:,1])

plt.title('ROC with Quadratic Discriminant Analysis')
plt.plot(fpr_train, tpr_train, label='roc train')
plt.plot(fpr_test, tpr_test, label='roc test')
plt.xlabel('False Positive Rate (FP/ GroundTruth_N)')
plt.ylabel('True Positive Rate (TP/ GroundTruth_P)')
plt.legend()
plt.show()

#let us plot the confusion matrix
threshold = 0.5
y_trest_pred = np.where(y_test_pred_proba[:,1]>threshold, 1, 0)
cm = confusion_matrix(y_test, y_trest_pred)

cm_display = ConfusionMatrixDisplay(cm).plot()
ax = cm_display.ax_
ax.set_title('QDA on test set (threshold: 0.5)')
ax.set_xlabel('Predicted Labels')
ax.set_ylabel('Ground Truth Labels')
plt.show()

print('with accuracy on test: ', accuracy_score(y_test, y_test_pred))
with accuracy on test:  0.5079872204472844

>>> NaiveBayes

# we assume we have done some feature selection and we noticed Lag1 and Lag2 are selected features.

X = df[['Lag1','Lag2']].to_numpy()
y = df.iloc[:,-1].to_numpy()

label_encoder = LabelEncoder().fit(y)
y = label_encoder.transform(y)
#####################################

X_train, X_test, y_train ,y_test = train_test_split(X, y, test_size=0.25, random_state=0)
sc = StandardScaler().fit(X_train)
X_train = sc.transform(X_train)
X_test = sc.transform(X_test)

classifier = GaussianNB().fit(X_train, y_train)
y_train_pred_proba = classifier.predict_proba(X_train)
y_test_pred_proba = classifier.predict_proba(X_test)

####################################

fpr_train, tpr_train, _ = roc_curve(y_train, y_train_pred_proba[:,1])
fpr_test, tpr_test, _ = roc_curve(y_test, y_test_pred_proba[:,1])

plt.title('ROC with NaiveBayes (Gaussian)')
plt.plot(fpr_train, tpr_train, label='roc train')
plt.plot(fpr_test, tpr_test, label='roc test')
plt.xlabel('False Positive Rate (FP/ GroundTruth_N)')
plt.ylabel('True Positive Rate (TP/ GroundTruth_P)')
plt.legend()
plt.show()

#let us plot the confusion matrix
threshold = 0.5
y_trest_pred = np.where(y_test_pred_proba[:,1]>threshold, 1, 0)
cm = confusion_matrix(y_test, y_trest_pred)

cm_display = ConfusionMatrixDisplay(cm).plot()
ax = cm_display.ax_
ax.set_title('NaiveBayes(Gaussian) on test set (threshold: 0.5)')
ax.set_xlabel('Predicted Labels')
ax.set_ylabel('Ground Truth Labels')
plt.show()

print('with accuracy on test: ', accuracy_score(y_test, y_test_pred))
with accuracy on test:  0.5079872204472844

K-NearestNeighbor (k=1)

k = 1
# we assume we have done some feature selection and we noticed Lag1 and Lag2 are selected features.

X = df[['Lag1','Lag2']].to_numpy()
y = df.iloc[:,-1].to_numpy()

label_encoder = LabelEncoder().fit(y)
y = label_encoder.transform(y)
#####################################

X_train, X_test, y_train ,y_test = train_test_split(X, y, test_size=0.25, random_state=0)
sc = StandardScaler().fit(X_train)
X_train = sc.transform(X_train)
X_test = sc.transform(X_test)

classifier = KNNclassifier(n_neighbors=k).fit(X_train, y_train)
y_train_pred_proba = classifier.predict_proba(X_train)
y_test_pred_proba = classifier.predict_proba(X_test)

####################################

fpr_train, tpr_train, _ = roc_curve(y_train, y_train_pred_proba[:,1])
fpr_test, tpr_test, _ = roc_curve(y_test, y_test_pred_proba[:,1])

plt.title('ROC with KNN(k=1)')
plt.plot(fpr_train, tpr_train, label='roc train')
plt.plot(fpr_test, tpr_test, label='roc test')
plt.xlabel('False Positive Rate (FP/ GroundTruth_N)')
plt.ylabel('True Positive Rate (TP/ GroundTruth_P)')
plt.legend()
plt.show()

#let us plot the confusion matrix
threshold = 0.5
y_trest_pred = np.where(y_test_pred_proba[:,1]>threshold, 1, 0)
cm = confusion_matrix(y_test, y_trest_pred)

cm_display = ConfusionMatrixDisplay(cm).plot()
ax = cm_display.ax_
ax.set_title(f'KNN(k={k}) on test set (threshold: 0.5)')
ax.set_xlabel('Predicted Labels')
ax.set_ylabel('Ground Truth Labels')
plt.show()

print('with accuracy on test: ', accuracy_score(y_test, y_test_pred))
with accuracy on test:  0.5079872204472844

while k=1 provides a very good performance on train set, it behaves almost randomly on test set!

K-NearestNeighbor (k=3)

k = 3
# we assume we have done some feature selection and we noticed Lag1 and Lag2 are selected features.

X = df[['Lag1','Lag2']].to_numpy()
y = df.iloc[:,-1].to_numpy()

label_encoder = LabelEncoder().fit(y)
y = label_encoder.transform(y)
#####################################

X_train, X_test, y_train ,y_test = train_test_split(X, y, test_size=0.25, random_state=0)
sc = StandardScaler().fit(X_train)
X_train = sc.transform(X_train)
X_test = sc.transform(X_test)

classifier = KNNclassifier(n_neighbors=k).fit(X_train, y_train)
y_train_pred_proba = classifier.predict_proba(X_train)
y_test_pred_proba = classifier.predict_proba(X_test)

####################################

fpr_train, tpr_train, _ = roc_curve(y_train, y_train_pred_proba[:,1])
fpr_test, tpr_test, _ = roc_curve(y_test, y_test_pred_proba[:,1])

plt.title('ROC with KNN(k=1)')
plt.plot(fpr_train, tpr_train, label='roc train')
plt.plot(fpr_test, tpr_test, label='roc test')
plt.xlabel('False Positive Rate (FP/ GroundTruth_N)')
plt.ylabel('True Positive Rate (TP/ GroundTruth_P)')
plt.legend()
plt.show()

#let us plot the confusion matrix
threshold = 0.5
y_trest_pred = np.where(y_test_pred_proba[:,1]>threshold, 1, 0)
cm = confusion_matrix(y_test, y_trest_pred)

cm_display = ConfusionMatrixDisplay(cm).plot()
ax = cm_display.ax_
ax.set_title(f'KNN(k={k}) on test set (threshold: 0.5)')
ax.set_xlabel('Predicted Labels')
ax.set_ylabel('Ground Truth Labels')
plt.show()

print('with accuracy on test: ', accuracy_score(y_test, y_test_pred))
with accuracy on test:  0.5079872204472844

According to the book, there has been some improvement in performance when KNN is chosen as classifier (k=1,3). However, our results on test test do not show such thing!)