ISLChapter04 (Interactive Learning)
Chapter 04  Classification
 import libraries
 Why not using linear regression?
 Logistic Regression
 Generative Model for Classification
 "counts"type data
 4.7.1: Stock Market Data
This notebook covers chapter 04: classification of the book ISL.
Classification: Predicting the qualitative (categorical) response variable based on a set of predictors
How does classification work? The classification methods usually predict the probability of a sample (observation) belongs to each class, and then, based on a (userdefine) threshold, they try to predict its class.
NOTE: This chapter covers some of widelyused classification methods. (In later chapters, some other classification methods will be covered as well.)
In this chapter, the dataset Default
will be used, let us take a look at it...
import math
from collections import Counter
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
#sklearn
from sklearn.model_selection import train_test_split
from sklearn.utils import resample
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import PoissonRegressor
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis as QDA
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier as KNNclassifier
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score, roc_curve,roc_auc_score
#statsmodel
import statsmodels.api as sm
data: default
df = pd.read_csv('./datasets/Default.csv', index_col=0)
df.head(3)
The column default is the response variable and the goal is to predict Yes/No
based on its predictors.
NOTE: a person with "default: yes" is a person who cannot pay back their debt.
_ = df.hist()
plt.figure()
plt.title('histogram of student')
student_tags_counts = Counter(df['student'])
plt.bar(student_tags_counts.keys(), student_tags_counts.values())
plt.show()
num_features = df._get_numeric_data().columns
fig, axs = plt.subplots(nrows=1, ncols=len(num_features), figsize=(15,5))
for i_axs, feature in enumerate(num_features):
groups = df[feature].groupby(df['default'])
for i, (key, item) in enumerate(groups):
axs[i_axs].boxplot(item, positions=[i], labels=[key])
axs[i_axs].set_xlabel('response variable: default')
axs[i_axs].set_ylabel(feature)
plt.show()
The figure shows that there is strong relationship between the predictor balance
and the response variable default
. In other words, people with higher balance are more likely to have default=yes
.
NOTE: As discussed in the book, we rarely see such clear relationship in realworld problems)
BONUS:
How about the categorical predictor student
? How can we see its relationship with the response variable?
One solution is to show the relationship with help of bar charts...
groups = df['default'].groupby(df['student'])
fig, ax = plt.subplots()
bar_width = 0.8
xticklabels_lst = []
for i, (key, item) in enumerate(groups):
xticklabels_lst.append(key)
counts = Counter(item)
ax.bar(5*i  0.5*bar_width, counts['No'], color = 'r', label='default: No' if i == 0 else "")
ax.bar(5*i + 0.5*bar_width, counts['Yes'], color='b', label='default: Yes' if i == 0 else "")
ax.set_xticks([5*x for x in range(len(groups))])
ax.set_xticklabels(xticklabels_lst)
plt.xlabel('student')
plt.ylabel('default')
plt.legend()
plt.show()
# Show the relationship between two numerical predictors and
# ... respones variable
num_features = df._get_numeric_data().columns
#here, num stands for numerical
colors_lst = ['b', 'r']
groups = df[num_features].groupby(df['default'])
for i, (key, item) in enumerate(groups):
plt.scatter(item.iloc[:,0], item.iloc[:,1], color=colors_lst[i], label=f'default: {key}')
plt.xlabel(num_features[0])
plt.ylabel(num_features[1])
plt.legend()
plt.show()
We can notice that people with higher balance, are probably have default=yes
(when we move from left to right on xaxis). However, if the income value changes from one person to another, it cannot help us that much to predict the response variable default
(when we move from bottom to top on yaxis).
Why not using linear regression?
To use linear regression, we first need to consider some ordinal numbers for the classes. This can become a problem if we have more than two classes. For instance, let us assume we have four classes in our response variable: Programming
, Machine Learning
, Statistics
, and others
.
We can convert these tags to numbers 1,2,3, and 4, and then apply linear regression. But, how should we convert these tags to ordinal numbers? Should we assign 1 to programming
or to Machine Learning
? Different arrangement can lead to different results. Also, if we go with this approach, we are making the assumption that the difference between label1 and label2, is the same as label2 and label3, and so on. However, such assumption is usually not correct in the classification context.
If we have two classes (i.e. binary classes), we can use dummy variable and then we can use linear regression (and, if we swap 0 and 1, it results in an equivalent model). There is still one problem: the value of y (from the fitted model) should represent the probability of an observaton being a member of a class given its predictors. However, the y value in linear regression in not restricted. In other words, we might get some value below 0 or above 1, both of which are not meaningful when they are treated as the probability. Although we can restrict them, it would be better if we could have a method that is designed for classification and produces probability between 0 and 1 in the first place.
Thus, we should use methods that are designed for qualitative response variable.
Logistic Regression
 A method that is good for binary classification (and it can be extended for morethantwoclasses problems)
 It calculates the probability of response variable given its predictors. For example: the goal might be calculatig
P(Y=yes  balance)
, which means the probability of an observation being in classyes
given the predictorbalance
. (If this probability is bigger than a threshold, say 0.5, then we can say the observation belongs to the classyes
.) A company might prefer to lower False Negative Error. (i.e. the class is actually positive, but it is predicted as negative.) In this case, the user might choose small threshold, like 0.1. Then, ifP(Y=yes  balance)
becomes bigger than 0.1, the prediction isyes
.  P(X) denotes P(Y=1X) in this chapter unless stated otherwise.
 Goal: to model P(X) such that it gives value between 0 and 1
 Solution: we can use logistic function: $P(X) = \frac{1}{1 + e^{(w^{T}X+b)}}$: In contrast to linear regression where the slope was constant, the $\frac{dP}{dX}$ is not constast (it depends on the value of X).
 In singlepredictor case: If w>0 (w<0), then P(x) is increasing (decreasing): You can get the derivative of P(X) and check it out yourself
 We want to find parameters w and b such that P(X) becomes close to 1 when y=1, and becomes close to 0 when y=0. In regression, we had MSE and minimizing MSE gave us the estimation of w and b. In classification, we need to define an objective first: in logistic regression, the idea is to maximimze the following objective function: $L(w,b) = \prod\limits_{\forall y_{i}=1}{P(X_{i})} \prod\limits_{\forall y_{i}=0}{(1  P(X_{i}))}$ > NOTE: this is the overal likelihood based on a binomial distribution (read more here: Logistic_regression).
 we can use libraries to fit this model and find the parameters.
 For a probability p, its corresponding odd is defined as: $\frac{p}{1p}$
 Also note that: $\log(\frac{P(X)}{1P(X)}) = wX + b$ (single predictor case), where the lefthand side is called log odds (or logit).
#preparing feature
X = df['balance'].to_numpy()
X = sm.add_constant(X)
#preparing response variable
y = df['default'].to_numpy()
label_encoder = LabelEncoder().fit(y)
y = label_encoder.transform(y)
print('classes transformed by label_encoder: ', label_encoder.classes_)
classifier = sm.Logit(y, X).fit()
print(classifier.summary())
Cool! We reproduced Table 4.1! Now, we can see $w>0$, which means P(X) is an increasing function, and thus higher balance means higher probability for an observation to be in class Y=1 (i.e. default=Yes
). Now, you may want to take a look again at the scatterplot and boxplot provdided earlier in this notebook :)
Practice: We can use the feature student
as the only predictor and see the result when is used in logsitic regression for classifying the response variable default
(try to reproduce table 4.2)
How about using all predictors? (Multivariate Logistic Regression)
X = df.iloc[:,1:] #skipping the first column as the first column is response variable `default`
X['student'] = label_encoder.fit_transform(X['student']) #encoding student to binary
X = sm.add_constant(X.to_numpy())
#preparing response variable
y = df['default'].to_numpy()
label_encoder = LabelEncoder().fit(y)
y = label_encoder.transform(y)
classifier = sm.Logit(y, X).fit()
print(classifier.summary())
Note that the order of columns is different here:
predictors > x1: student, x2: balance, x3: income
Brain teaser: Try to use x1: student
as the only predictor and see its coefficient. You will notice that it is positive. How is that possible? (because in the table above, in multivariate logistic regression, the coefficient of x1:student
is negative.
Answer: checkout the book!
Please note that here we would like to talk about cases that have more than two classes (Do not confuse it with multivariate case where we have more than one predictors)
In the book, you can see how P(X) of twoclass case can be extended to Multiclass Logistic Regression. Here, I skip that part and just review the softmax method. (Note that the extended version of Logistic Regression and softmax are equivalent in terms of predicted value, log odds, and key outputs)
So, let us see how it works:
Let us suppose that we have K classes:
$ P(Y=k  X) = \frac{e^{z_{k}}}{\sum\limits_{i=1}^{K}{e^{z_{i}}}}$
And, we can easily see that: $\sum\limits_{i=1}^{K}{P(Y=iX)} = 1$
$z_{i} = w_{i}^{T}X + b_{i}$ > so, we use a linear regression to predict z, that will be used in softmax to predict P(Y=yX).
Practice: Try to calculate log odds: $\frac{P(Y=jX=x)}{P(Y=iX=x)}$
In previous part, we introduced the likelihood function, where the goal was to estimate parameters $w$ and $b$ by maximizing the objective function, and then, for a given point X, predict its probability of being in class 1 as: $P(Y=1X)=P(X)=\frac{1}{1+e^{z}}$, where, $z = \hat{w}^{T}X + \hat{b}$. And, in Kclass case, we have K pairs of parameters $(w,b)$ that should be estimated.
Now, we would like to propose an alternative approach for predicting $P(Y=iX)$ for Kclass classification problem. Instead of using $\frac{1}{1+e^{(w^{T}X + b)}}$ as the P(X), we now try to calculate it based on the bayes theorem:
$P(Y=kX) = \frac{P(XY=k) P(Y=k)}{P(X)} = \frac{P(XY=k) P(Y=k)}{\sum\limits_{i=1}^{K}{P(XY=i) P(Y=i)}}$
So, for a given X, we would like to calculate P(Y=iX) for $i \in [1,K]$. The class with maximum P(Y=?X) is the class of X.
Let us dig into the terms appeared in the fraction above:
 $P(XY=k)$: likelihood, the probability of X belongs to the kth class (this is density function of X if it comes from kth class)
 $P(Y=k)$: prior probability, the probability of a randomlychosen observation comes from class k
 Denominator: Normalizing constant
Therefore, to compare probability of X belonging to a particular class we need to compare the numerator. The prior probability $P(Y=k)$ is simple: we can say, $P(Y=k) = \frac{\text{# of observations with label k}}{\text{total # of observations}}$. The challenge is $P(XY=k)$? How can I calculate probability of x given that it belongs to class k? (In other words, how can I model the distribution of X in kth class?)
Here are a few methods:
 Linear Discriminant Analysis
 Quadratic Discriminant Analysis
 Naive Bayes
Each of the methods above has its own assumption based on which the $P(XY=k)$ can be obtained, then we can calculate the posterior probability $P(Y=kX)$.
Let's go through them!
Note: Why should I use these instead of logistic regression? these methods are good when there is a substantial seperation between classes (in this case, logisitc regresstion gives unstable parameters)
Linear Discriminant Analysis (following discussion is for p=1, where p
is the number of predictors)
Assumption: data points in the kth class are from $N(\mu_{k}, \sigma)$. $\mu_{k} = avg(X \in \text{class k})$, which is classspecific and $\sigma$ is the standard deviation common in all classes, and it is:
$\sigma^{2} = \frac{1}{nK} \times \text{sum(sum of squares of deviation of X from its corresponding }\mu)$.
Now, we can plug in these values in the PDF of normal distribution and calculate the posterior probability for each class. Why do we call it Linear Discriminant Analysis
? Because, if we break down the posterior probability, we can see $P(Y=kX)$ becomes maximum when the discriminant function $\sigma_{k}(X)$ becomes maximum (see book for formula of $\sigma_{k}(X)$). $\sigma_{k}(X)$ is LINEARLY related to $X$. That is why it is called Linear Discriminant Analysis.
How about p>1 (p: number of predictors)? In that case, we use multivariate normal distribution, where $\mu$ is a vector that should be separately calculated for each class, and covariance matrix,$\text{$\sum$}$, is common for all classes. And:
$\text{$\sum[i,j]$} = COV(X_{i},X_{j})$
mean_vec = np.array([5,5])
cov_mat = np.array([[0.01, 0.08],[0.08, 1]])
#why these numbers?
#I decided to consider same value for mean of each variable. mean shifts the distribution and I do not care about that here.
#I just want to see how the shape looks like...
#In 2by2 cov matrix, the diagonal values are variance of variables (features). So, cov[0,0] is variance of first feature
#cov[1,1] is variance of second feature.
#cov[0,1]=cov[1,0]=E[(X0  X0_mu)(X1 X1_mu)] = corr*std(X0)*std(X1) => I decided to choose the value 0.8 for corr, so I
#can see the behavior when there is high correlation, and then based on that, I found the cov[0,1]...
np.random.seed(seed=0)
toy_data = np.random.multivariate_normal(mean_vec, cov_mat, size=100000)
plt.scatter(toy_data[:,0], toy_data[:,1])
plt.xlabel('X_0') #first feature
plt.ylabel('X_1') #second feature
plt.show()
we can see that there is a good linear correlation between X_0 and X_1...
def multi_nomial_dist_pdf(X, mean_vec, cov_mat):
"""
This function returns f(x) for each row x in data, where f is PDF of mutltinomial distribution
with mean vector `mean_vec` and covariance matrix `cov_mat`
Parameters

data: numpy.ndarray
a array with shape (m,) where m is the dimension of data.
mean_vec: numpy.ndarray
array with shape (m,): the mean parameter of multinomial distribution
cov_mat: numpy.ndarray
array with shape (m,m): the `cov matrix` parameter of multinomial distribution
Returns

likelihood: scalar
the value of f(data[i]), where f is the multinomial distribution
"""
m = X.shape[0]
if (len(mean_vec) != m):
raise ValueError(f"length of mean_vec, {len(mean_vec)}, is not equal to the dimension(i.e number of features) of data, {m}")
if cov_mat.ndim != 2:
raise ValueError(f"size of cov_mat, {cov_mat.ndim}, is not 2.")
if not np.allclose(cov_mat, cov_mat.T):
raise ValueError(f"cov_mat is not symmetic")
if cov_mat.shape[0] != m:
raise ValueError(f"shape of cov_mat, {cov_mat.shape}, is not {(m,m)}")
#eq 4.23 of book
coef = math.sqrt(1/(np.linalg.det(cov_mat) * math.pow(2*math.pi, m)))
X_offset_removed = X  mean_vec
exp_term = math.exp(1/2 * np.matmul(np.matmul(X_offset_removed.T, np.linalg.inv(cov_mat)), X_offset_removed))
return coef * exp_term
data_likelihood = np.array([multi_nomial_dist_pdf(x, mean_vec, cov_mat) for x in toy_data])
data_likelihood.shape
fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(projection='3d')
ax.scatter(toy_data[:,0], toy_data[:,1], data_likelihood)
plt.xlabel('X_0')
plt.ylabel('X_1')
plt.show()
Note that the current graph gives likelihood. It gives $P(XY=i)$. In other words, it shows pdf of X in the class i
. However, what we are looking for is $P(Y=iX)$. Thus, we can use bayes theorem to calculate the posterior probability $P(Y=iX)$.
The relationship between $\delta(x)$, discriminant function (see eq 4.24 in the book), and x is still linear when p>1
, and that is why it is called Linear Discriminant Analysis.
#note: student is qualitative and violates the normality assumption of LDA.
#However, LDA is prettery robust to model violation.
#alternative option: Naive Bayes, which does NOT require normality assumption
y = df['default']
X = df[['student','balance']]
###############################
le = LabelEncoder().fit(y)
y = le.transform(y)
print('classes of target encoded by LabelEncoder: ', le.classes_)
print('')
ct = ColumnTransformer([('ohe',OneHotEncoder(drop='first'),['student'])], remainder = 'passthrough')
X = ct.fit_transform(X)
print('the slices each transformer affect: ', ct.output_indices_)
print(X) #see X
NOTE: If you get confused by slice(), take a look at the document: Builtin functions: slice
NOTE: default=yes
means failure to repay debt
transformer = ct.transformers_ #this is list of transformers for OneHotEncoder
transformer
ohe = transformer[0][1]
ohe.categories_ #this shows `No` is converted to 0 and `Yes` is converted to 1...
X_scaled = StandardScaler().fit_transform(X) #not required in LDA classification...
#However, I prefer to standardize it anyway unless I know I must not!!!
classifier = LDA().fit(X_scaled,y)
y_pred = classifier.predict(X_scaled)
conf_mat = confusion_matrix(y, y_pred)
conf_mat
here, we reproduced TABLE 4.4. Please note that the confusion_matrix, c
, is constructed as follows:
c[i,j]
: number of observations that have true label of i
but predicted as j
. That is why we see the two numbers 252 and 23 are swapped compared to what provied in TABLE 4.4 of the book (In the TABLE 4.4 of the book, the predicted labels are in rows and the true labels are in columns.)
SIDE NOTE: please note that we can measure different kinds of error here. A quick search gives you their definition...
 error
 Type I error
 Type II error
 Precision
 Recall
 Sensitiviy
$error = 1  Accuracy = 1  \frac{\text{TP}+\text{TN}}{\text{Total # of samples}}$, where TP (TN) is the number of true positive (true negative) prediction of samples.
error = 1  accuracy_score(y, y_pred)
print('error is: ', error) #so, approx. 2.75%
What kind of issues we should be worry about?

error on test set => please note that the error we got is the error of model on train set. However, we know that the accuracy of model on test set is what actually matters! And, we know that the error of model on test set is ususally higher than the error of train set! Of course, one way to see whether or not we need to worry about the model's performance on test set is to do train/test split and measure the performance of the model on the test set. In this case, we can see we should not be worry about overfitting problem. The
number of parameters (p)=2
, and thenumber of samples (n) = 10,000
. In such cases wherep << n
, the ratio ofp/n
is vary small that we do not expect to see an over fitting problem. Therefore, if we have a good performance on train set, we might have a good perfomance on test set as well. 
unbalanced classes: we have
333 out of 10000 samples
withdefault='yes'
. In other words, only 3.33% of samples havedefault='yes'
, and others havedefault='No'
. So, if we use atrivial null classifier
that simply predictdefault=No
for any sample, it can get an error of 3.33%, which is just a little higher than the one achieved by LDA.
In binary classification, we usually have two types of errors: FP (type I)
and FN (type II)
. Let us take at confusion matrix again:
cm_display = ConfusionMatrixDisplay(conf_mat).plot()
Recall that label 0 refers to not default
while 1 refers to default
. According to the confusion matrix above, 23 are False Positive, which is the number of observations that are predicted to be default
by mistake. So, out of 9667 sampels with true label 0
, 23 of them got predicted wrongly, which is low error.
Now, if we take a look at the ones who are default
ed, we can see that 252 out of 333 observations have wrong prediction. That is a huge error. Therefore, the total error is low, but the error of predicting observations who are default
is high.
So, for a credit card company who is interested in identifying (predicting) the individuals who will default
(i.e. default=yes
), having high error in predicting default
is not good.
$ sensitivity = \frac{TP}{TP+FN} = \frac{\text{predicted }TP}{true\text{ }P}$.
In this example, sensitivity is $\frac{81}{333}$, which is equivelent to 24.3%.
How can we improve this? Recall that LDA assign label 1 if P(Y=1X)>0.5. We want to be more sensitive in a sense that if we, for instance, get even 20% probability, we say it is label 1! So, we can use lower threshold, say 0.2.
classifier = LDA().fit(X_scaled,y)
y_predict_proba = classifier.predict_proba(X_scaled)
sensitivity_threshold = 0.2
y_pred = np.where(y_predict_proba[:,1]>sensitivity_threshold, 1, 0)
print(y_pred)
conf_mat = confusion_matrix(y,y_pred)
cm_display = ConfusionMatrixDisplay(conf_mat).plot()
We reproduced Table 4.5 of the book. Now, we can see the sensitivity is now $\frac{195}{333}$, which is about 58.56%. So, we only missed 41.44% in predicting the default
.
However, we pay it off somewhere else.If you look at the FP, you can see that it increases from 23 to 235. The total error also increase to 3.73%, which is acceptable by company as, in return, it gets a much better prediction for those who are default.
Let us see how the error changes w.r.t the threshold...
classifier = LDA().fit(X_scaled,y)
y_predict_proba = classifier.predict_proba(X_scaled)
threshold_values = np.arange(0,0.51,0.01)
err_total=[]
err_type_I = [] #fp/actually negative
err_type_II = [] #fn/actually positive
for threshold in threshold_values:
y_pred = np.where(y_predict_proba[:,1]>threshold, 1, 0)
tn, fp, fn, tp = confusion_matrix(y, y_pred).ravel()
err_total.append(1  accuracy_score(y, y_pred))
err_type_I.append(fp/(fp+tn))
err_type_II.append(fn/(fn+tp))
plt.plot(threshold_values, err_total, color='k', label='err_total')
plt.scatter(threshold_values, err_type_I, color='r', marker='.', label='err_type_I')
plt.scatter(threshold_values, err_type_II, color='b', marker='.', label='err_type_II')
plt.title('performance of classifer with different threshold')
plt.xlabel('threshold')
plt.ylabel('error')
plt.legend()
plt.show()
Lowering the threshold upto ~0.1 has minor impact on the total error and errtypeI while it hugely reduces the errtypeII. The errtypeII shows the ratio of $\frac{FN}{\text{actual }P}$. In other words, it shows of the all samples with true label 1, how many are predicted wrong. So, in our data default
, the label 1 refers to default
. So, the error shows the misclassification error on samples whose default
is 1. If the goal is to better detect people with default
and we do not care that much abou misclassification on notdefault
people (i.e. we do not care about fp), then we may prefer lower threshold. The optimal threshold can be calculated by considering the domain knowledge and the cost of each error.
Some measures in classification:
 TypeIerror = $\frac{FP}{\text{actually N}}$ = False Positive Rate > Specificity = 1  TypeIerror
 TypeIIerror = $\frac{FN}{\text{actually P}}$> Power = Recall = Sensitivity = 1  TypeIIerr = $\frac{TP}{actually P}$ = True Positive Rate
 Precision = p(TPpredicted P) = $\frac{TP}{predicted P}$
we would like to have high True Positive Rate
and low False positive Rate
. ROC curve is a good way to get some sense about the performance of classifier on these two measures (note that you may use any other measures to plot ROC).
In ROC, we change the threshold, and for each value of threshold, we calculate True Positive Rate
and False positive Rate
and plot them.
classifier = LDA().fit(X_scaled,y)
y_predict_proba = classifier.predict_proba(X_scaled)
# note that we get the posterier probailty of X as the roc_curve function (seee below)
# needs probability so it can predict labels for different values of threshold.
y_score = y_predict_proba[:,1] #prob of samples on positive class
fpr, tpr, _ = roc_curve(y, y_score)
plt.plot(fpr, tpr)
plt.xlabel('false positive rate')
plt.ylabel('true positive rate')
plt.title(f'with auc score: {roc_auc_score(y, y_score)}')
plt.show()
The area under curve (AUC) shows the performance of classifier overl all thresholds. it should be larger than a randomlyassigning classifier which has AUC of 0.5. In our example, the roc_auc of classifier is ~0.95 which shows the model has good performance.
Practice: Find the threshold that results in the elbow of the curve.
Quadratic Discriminant Analysis: QDA
In contrast to LDA that assume a single covariance matrix for all classes, QDA assumes each class has its own covariance matrix. Why "Quadratic"? because the discriminant funciton, $\sigma(x)$, in QDA has some form of $x^{2}$.
LDAvsQDA:
This is biasvariane tradeoff. QDA needs to estimate more parameters, therefore it is more flexible. It has lower bias but higher variance.
Naive Bayes:
we assume all predictors are indepenedent from each other. such assumption can simplify the caculation of $f_{k}(X)$...because, we can now simply say: $f_{k}(X) = f_{k}(X_{0})...f_{k}(X_{P})$, where $X_{i}$ refers to the ith feature. Although we are ignoring the marginal distribution, the classifier can still do a good job particularly when n (number of samples) is small. (Because, we need large number of samples to take into account marginal distribution when estimating the multivariate pdf $f_{k}(X)$. So, when n is small, our estimation might have a significant error and therefore, we may be better off by just ignoring the existance of marginal distribution.)
How should we estimate $f_{k}(X_{i})$ in Naive Bayes Classifer? (see page 156 of the book for more details)

We can say: $X^{(k)}_{i}$ (i.e. the ith feature data in class k) is from $N(\mu, \sigma)$, where $\mu$ and $\sigma$ can be calculated using the $X_{i}$ data of samples in class k.

Another option is to use historgram. Then, we need to choose number/size of bins (hyperparameter). Then, we can see for a new data X, and for each class k, find the bin $X^{(k)}_{j}$ belongs to and calculate the fraction of observations of that class k that is in that bin. Another way is to use KDE (Kernel Density Estimation) to estimate histogram of jth feature in class k with pdf f and then calculate likelihood as $f(X^{(k)}_{j})$

How about qualitative feature? Similar to previous approach, but instead of histogram, we have bar charts that show the counts of each tag for $X^{(k)}_{j}$
classifier_LDA = LDA().fit(X_scaled, y)
classifier_QDA = QDA().fit(X_scaled, y)
classifier_GaussianNB = GaussianNB().fit(X_scaled, y)
classifier_dict = {'LDA': classifier_LDA, 'QDA':classifier_QDA, 'NB':classifier_GaussianNB}
plt.figure()
for key, classifier in classifier_dict.items():
y_predict_proba = classifier.predict_proba(X_scaled)
y_score = y_predict_proba[:,1] #prob of samples on positive class
fpr, tpr, _ = roc_curve(y, y_score)
auc_score = np.round(roc_auc_score(y, y_score), decimals=3)
plt.plot(fpr, tpr, label=f"{key}, auc={auc_score}")
plt.xlabel('false positive rate')
plt.ylabel('true positive rate')
plt.legend()
plt.show()
While they show similar ROC curve, their confusion matrix is different...
threshold = 0.2
for key, classifier in classifier_dict.items():
y_predict_proba = classifier.predict_proba(X_scaled)
y_score = y_predict_proba[:,1] #prob of samples on positive class
y_pred = np.where(y_score>threshold, 1, 0)
cm = confusion_matrix(y, y_pred)
cm_display = ConfusionMatrixDisplay(cm).plot()
ax = cm_display.ax_
ax.set_title(key)
In this data set n>>p, so we should not be worried about variance problem. Therefore, it is not worth it to reduce variance with help of Naive Bayes.
Side Note: LDA, QDA, Naive Baysian are all parametric models. In nonparametric models, such as KNN, we need a lot of observations (n >> p), otherwise (i.e. when n is small, p is high), the nonparametric model can result in highvariance model.
Section 4.5 is important and should be reviewed from the book. We will get back to it when when study crossvalidation in chapter 5.
This is when the target value is neither quantitative nor qualitative! It is counts!
Example: let us take a look at bikeshare data: to predict # of people rent a bike
Here the data is not continuous (not regression) but it is also not qualitative! When type of target value (i.e. response) is counts
, that means target value is nonnegative integer numbers... Possible ways to predict y for a given X:

Linear Regression: we can do it...however, the challenge is that we might get some negative prediction for some X which is meaningless. Also, if, for some X, number of counts is small, that means few people would rent a bike, and thus std should be small (In oppositve, when number of counts is large, we should expect higher std). This is violation of normallydistirbuted error assumption in linear regression.

Another solution: we can do
log(y) = wx+b
. So, instead of modeling y with linear regression, we model log(y). Thus, we can avoid negative prediction and we may statisfy the normallydistribution error assumption. HOWEVER, this method is hard to interpret and it cannot work when y is 0. 
Alternative solution? See below!
Alternative solution: using Poission Regression
First, let us take a look at Poission Distribution:
$P(Y=k) = \frac{e^{\lambda}\lambda^k}{k!}$ (k is nonnegative integer).
In poission distribution, $\mu = \sigma = \lambda$. In other words, the mean ($\mu$) and standard deviation ($\sigma$) are requal. So, if mean gets higher, the standard deviation is getting higher. To better understand how poission works, let us take a look at linear regression again:
In linear regression, we have $y = wx+b + \epsilon$, where $\epsilon \in ~ N(0,\sigma)$. In other words, what we predict with our model (i.e. $\hat{y} = wx+b$ is expected value of response. The response comes from a distribution $N(\hat{y}, \sigma)$.
How can we use "Poission Regression" model?
We fit this model by maximimzing likelihood below:
$likelihood = \prod\limits_{i=0}^{K}{P(Y=i)} = \prod {\frac{e^{\lambda}\lambda^i}{i!}}$.
So,we can say each point is coming from poission distribution where we expect $\lambda$ to be the expected value of counts we want to predict. Since the counts change for different X, we would like to consider diffent value of $\lambda$. So, we model $\lambda$ based on data: $log(\lambda) = wx + b$. why log? so we don't get negative value for $\lambda$.
After fitting model to data (i.e. finding w and b such that likelihood becomes maximum), then for a given X, we predict the expected value of response as: $E(\hat{y}) = \lambda = e^{wX+b}$
df = pd.read_csv('./datasets/Bikeshare_hour.csv', index_col = 0)
df.head()
predictors = ['mnth','hr','workingday','weathersit','temp']
response = ['cnt']
X = df[predictors]
y = df[response].to_numpy().ravel()
As opposed to the book, we only consider weathersit
as qualitative feature.
ct = ColumnTransformer([('ohe',OneHotEncoder(),['weathersit'])], remainder='passthrough')
ct.fit(X)
X_transformed = ct.transform(X)
X_transformed
ct.transformers_
ct.transformers_[0][1].categories_
X_transformed = StandardScaler().fit_transform(X_transformed)
regressor = PoissonRegressor(alpha=0).fit(X_transformed, y)
y_pred = regressor.predict(X_transformed)
#let us plot y_pred and y together:
plt.scatter(y, y_pred)
plt.xlabel('y groundtruth')
plt.ylabel('y predict')
plt.show()
Instead of seeing a linear line $y=x$, it seems our model could not predict of $y>600$. (Why?! Maybe..we should have considered month
and hr
as qualitative features... Try it!)
df = pd.read_csv('./datasets/Smarket.csv', index_col=0)
df.head()
The target is "Direction" that has labels Up
and Down
. Also, note that we cannot use the feature Today
as it is the same as Direction
when we encode positive values to Up
, and negative ones to Down
. Therefore, we assume our features are Lag1...Lg5 and volume.
X = df.iloc[:,:2].to_numpy()
y = df.iloc[:,1].to_numpy()
np.corrcoef(X.T) #note that in np.corrcoef, each column (not each row) is one observation
plt.matshow(np.corrcoef(X.T))
plt.colorbar()
plt.show()
This is just to give us some idea about how features correlated with each other.
What now? Let us fit differenet models to predict Direction
>>> preprocessing
label_encoder = LabelEncoder()
label_encoder.fit(y)
y = label_encoder.transform(y)
print('classes in y that would be encoded: ', label_encoder.classes_)
X_train, X_test, y_train ,y_test = train_test_split(X, y, test_size=0.25, random_state=0)
sc = StandardScaler().fit(X_train)
X_train = sc.transform(X_train)
X_test = sc.transform(X_test)
>>>Logistic Regression
regressor = LogisticRegression(penalty='none', solver='newtoncg').fit(X_train, y_train)
y_train_pred_proba = regressor.predict_proba(X_train)
y_test_pred_proba = regressor.predict_proba(X_test)
fpr_train, tpr_train, _ = roc_curve(y_train, y_train_pred_proba[:,1])
fpr_test, tpr_test, _ = roc_curve(y_test, y_test_pred_proba[:,1])
plt.title('ROC with Logistic Regression')
plt.plot(fpr_train, tpr_train, label='roc train')
plt.plot(fpr_test, tpr_test, label='roc test')
plt.xlabel('False Positive Rate (FP/ GroundTruth_N)')
plt.ylabel('True Positive Rate (TP/ GroundTruth_P)')
plt.legend()
plt.show()
The ROC shows that the classes are being predict almost on a random basis! We might be able to improve the accuracy to some extent with help of feature selection. Say, we could improve it upto 60% for TP rate. One needs to dig further and make sure that the obtained result is not by chance (e.g. we can perform crossvalidation and take average of accuracy across several data sets)
threshold = 0.5
y_test_pred = np.where(y_test_pred_proba[:,1]>threshold, 1, 0)
cm = confusion_matrix(y_test, y_test_pred)
cm_display = ConfusionMatrixDisplay(cm).plot()
ax = cm_display.ax_
ax.set_title('Logistic Regression on test set (threshold: 0.5)')
ax.set_xlabel('Predicted Labels')
ax.set_ylabel('Ground Truth Labels')
plt.show()
print('accuracy: ', accuracy_score(y_test, y_test_pred))
regressor = LDA().fit(X_train, y_train)
y_train_pred_proba = regressor.predict_proba(X_train)
y_test_pred_proba = regressor.predict_proba(X_test)
fpr_train, tpr_train, _ = roc_curve(y_train, y_train_pred_proba[:,1])
fpr_test, tpr_test, _ = roc_curve(y_test, y_test_pred_proba[:,1])
plt.title('ROC with Linear Discriminant Analysis')
plt.plot(fpr_train, tpr_train, label='roc train')
plt.plot(fpr_test, tpr_test, label='roc test')
plt.xlabel('False Positive Rate (FP/ GroundTruth_N)')
plt.ylabel('True Positive Rate (TP/ GroundTruth_P)')
plt.legend()
plt.show()
threshold = 0.5
y_trest_pred = np.where(y_test_pred_proba[:,1]>threshold, 1, 0)
cm = confusion_matrix(y_test, y_trest_pred)
cm_display = ConfusionMatrixDisplay(cm).plot()
ax = cm_display.ax_
ax.set_title('LDA on test set (threshold: 0.5)')
ax.set_xlabel('Predicted Labels')
ax.set_ylabel('Ground Truth Labels')
plt.show()
As we saw before, the outcome of LDA and Logistic Regression are usually almost identical. In our case, they are exactly the same when they are used to predict the y of test set.
>>>LDA on two features (Lag1 and Lag2)
To keep things simple, we avoid using feature selection here. For now, we assume we alrady have done some feature selection and we selected Lag1 and Lag2 as the reults of featureselection process.
X = df[['Lag1','Lag2']].to_numpy()
y = df.iloc[:,1].to_numpy()
label_encoder = LabelEncoder().fit(y)
y = label_encoder.transform(y)
#####################################
X_train, X_test, y_train ,y_test = train_test_split(X, y, test_size=0.25, random_state=0)
sc = StandardScaler().fit(X_train)
X_train = sc.transform(X_train)
X_test = sc.transform(X_test)
classifier = LDA().fit(X_train, y_train)
y_train_pred_proba = classifier.predict_proba(X_train)
y_test_pred_proba = classifier.predict_proba(X_test)
####################################
fpr_train, tpr_train, _ = roc_curve(y_train, y_train_pred_proba[:,1])
fpr_test, tpr_test, _ = roc_curve(y_test, y_test_pred_proba[:,1])
plt.title('ROC with Linear Discriminant Analysis')
plt.plot(fpr_train, tpr_train, label='roc train')
plt.plot(fpr_test, tpr_test, label='roc test')
plt.xlabel('False Positive Rate (FP/ GroundTruth_N)')
plt.ylabel('True Positive Rate (TP/ GroundTruth_P)')
plt.legend()
plt.show()
#let us plot the confusion matrix
threshold = 0.5
y_trest_pred = np.where(y_test_pred_proba[:,1]>threshold, 1, 0)
cm = confusion_matrix(y_test, y_trest_pred)
cm_display = ConfusionMatrixDisplay(cm).plot()
ax = cm_display.ax_
ax.set_title('LDA on test set (threshold: 0.5)')
ax.set_xlabel('Predicted Labels')
ax.set_ylabel('Ground Truth Labels')
plt.show()
classifier.coef_
We can see that both coefs are negative... Therefore, if Lag1 and Lag2 are positive, the function $\delta(X)$ inclines to become smaller for positivevalues X, and thus it assign class 0 to X. However, if the value of features are both negative....then $\delta(X)$ becomes larger, which can result in assigning class 1 to X.
>>> Quadratic Discriminant Analysis (QDA)
# we assume we have done some feature selection and we noticed Lag1 and Lag2 are selected features.
X = df[['Lag1','Lag2']].to_numpy()
y = df.iloc[:,1].to_numpy()
label_encoder = LabelEncoder().fit(y)
y = label_encoder.transform(y)
#####################################
X_train, X_test, y_train ,y_test = train_test_split(X, y, test_size=0.25, random_state=0)
sc = StandardScaler().fit(X_train)
X_train = sc.transform(X_train)
X_test = sc.transform(X_test)
classifier = QDA().fit(X_train, y_train)
y_train_pred_proba = classifier.predict_proba(X_train)
y_test_pred_proba = classifier.predict_proba(X_test)
####################################
fpr_train, tpr_train, _ = roc_curve(y_train, y_train_pred_proba[:,1])
fpr_test, tpr_test, _ = roc_curve(y_test, y_test_pred_proba[:,1])
plt.title('ROC with Quadratic Discriminant Analysis')
plt.plot(fpr_train, tpr_train, label='roc train')
plt.plot(fpr_test, tpr_test, label='roc test')
plt.xlabel('False Positive Rate (FP/ GroundTruth_N)')
plt.ylabel('True Positive Rate (TP/ GroundTruth_P)')
plt.legend()
plt.show()
#let us plot the confusion matrix
threshold = 0.5
y_trest_pred = np.where(y_test_pred_proba[:,1]>threshold, 1, 0)
cm = confusion_matrix(y_test, y_trest_pred)
cm_display = ConfusionMatrixDisplay(cm).plot()
ax = cm_display.ax_
ax.set_title('QDA on test set (threshold: 0.5)')
ax.set_xlabel('Predicted Labels')
ax.set_ylabel('Ground Truth Labels')
plt.show()
print('with accuracy on test: ', accuracy_score(y_test, y_test_pred))
>>> NaiveBayes
# we assume we have done some feature selection and we noticed Lag1 and Lag2 are selected features.
X = df[['Lag1','Lag2']].to_numpy()
y = df.iloc[:,1].to_numpy()
label_encoder = LabelEncoder().fit(y)
y = label_encoder.transform(y)
#####################################
X_train, X_test, y_train ,y_test = train_test_split(X, y, test_size=0.25, random_state=0)
sc = StandardScaler().fit(X_train)
X_train = sc.transform(X_train)
X_test = sc.transform(X_test)
classifier = GaussianNB().fit(X_train, y_train)
y_train_pred_proba = classifier.predict_proba(X_train)
y_test_pred_proba = classifier.predict_proba(X_test)
####################################
fpr_train, tpr_train, _ = roc_curve(y_train, y_train_pred_proba[:,1])
fpr_test, tpr_test, _ = roc_curve(y_test, y_test_pred_proba[:,1])
plt.title('ROC with NaiveBayes (Gaussian)')
plt.plot(fpr_train, tpr_train, label='roc train')
plt.plot(fpr_test, tpr_test, label='roc test')
plt.xlabel('False Positive Rate (FP/ GroundTruth_N)')
plt.ylabel('True Positive Rate (TP/ GroundTruth_P)')
plt.legend()
plt.show()
#let us plot the confusion matrix
threshold = 0.5
y_trest_pred = np.where(y_test_pred_proba[:,1]>threshold, 1, 0)
cm = confusion_matrix(y_test, y_trest_pred)
cm_display = ConfusionMatrixDisplay(cm).plot()
ax = cm_display.ax_
ax.set_title('NaiveBayes(Gaussian) on test set (threshold: 0.5)')
ax.set_xlabel('Predicted Labels')
ax.set_ylabel('Ground Truth Labels')
plt.show()
print('with accuracy on test: ', accuracy_score(y_test, y_test_pred))
KNearestNeighbor (k=1)
k = 1
# we assume we have done some feature selection and we noticed Lag1 and Lag2 are selected features.
X = df[['Lag1','Lag2']].to_numpy()
y = df.iloc[:,1].to_numpy()
label_encoder = LabelEncoder().fit(y)
y = label_encoder.transform(y)
#####################################
X_train, X_test, y_train ,y_test = train_test_split(X, y, test_size=0.25, random_state=0)
sc = StandardScaler().fit(X_train)
X_train = sc.transform(X_train)
X_test = sc.transform(X_test)
classifier = KNNclassifier(n_neighbors=k).fit(X_train, y_train)
y_train_pred_proba = classifier.predict_proba(X_train)
y_test_pred_proba = classifier.predict_proba(X_test)
####################################
fpr_train, tpr_train, _ = roc_curve(y_train, y_train_pred_proba[:,1])
fpr_test, tpr_test, _ = roc_curve(y_test, y_test_pred_proba[:,1])
plt.title('ROC with KNN(k=1)')
plt.plot(fpr_train, tpr_train, label='roc train')
plt.plot(fpr_test, tpr_test, label='roc test')
plt.xlabel('False Positive Rate (FP/ GroundTruth_N)')
plt.ylabel('True Positive Rate (TP/ GroundTruth_P)')
plt.legend()
plt.show()
#let us plot the confusion matrix
threshold = 0.5
y_trest_pred = np.where(y_test_pred_proba[:,1]>threshold, 1, 0)
cm = confusion_matrix(y_test, y_trest_pred)
cm_display = ConfusionMatrixDisplay(cm).plot()
ax = cm_display.ax_
ax.set_title(f'KNN(k={k}) on test set (threshold: 0.5)')
ax.set_xlabel('Predicted Labels')
ax.set_ylabel('Ground Truth Labels')
plt.show()
print('with accuracy on test: ', accuracy_score(y_test, y_test_pred))
while k=1 provides a very good performance on train set, it behaves almost randomly on test set!
KNearestNeighbor (k=3)
k = 3
# we assume we have done some feature selection and we noticed Lag1 and Lag2 are selected features.
X = df[['Lag1','Lag2']].to_numpy()
y = df.iloc[:,1].to_numpy()
label_encoder = LabelEncoder().fit(y)
y = label_encoder.transform(y)
#####################################
X_train, X_test, y_train ,y_test = train_test_split(X, y, test_size=0.25, random_state=0)
sc = StandardScaler().fit(X_train)
X_train = sc.transform(X_train)
X_test = sc.transform(X_test)
classifier = KNNclassifier(n_neighbors=k).fit(X_train, y_train)
y_train_pred_proba = classifier.predict_proba(X_train)
y_test_pred_proba = classifier.predict_proba(X_test)
####################################
fpr_train, tpr_train, _ = roc_curve(y_train, y_train_pred_proba[:,1])
fpr_test, tpr_test, _ = roc_curve(y_test, y_test_pred_proba[:,1])
plt.title('ROC with KNN(k=1)')
plt.plot(fpr_train, tpr_train, label='roc train')
plt.plot(fpr_test, tpr_test, label='roc test')
plt.xlabel('False Positive Rate (FP/ GroundTruth_N)')
plt.ylabel('True Positive Rate (TP/ GroundTruth_P)')
plt.legend()
plt.show()
#let us plot the confusion matrix
threshold = 0.5
y_trest_pred = np.where(y_test_pred_proba[:,1]>threshold, 1, 0)
cm = confusion_matrix(y_test, y_trest_pred)
cm_display = ConfusionMatrixDisplay(cm).plot()
ax = cm_display.ax_
ax.set_title(f'KNN(k={k}) on test set (threshold: 0.5)')
ax.set_xlabel('Predicted Labels')
ax.set_ylabel('Ground Truth Labels')
plt.show()
print('with accuracy on test: ', accuracy_score(y_test, y_test_pred))
According to the book, there has been some improvement in performance when KNN is chosen as classifier (k=1,3). However, our results on test test do not show such thing!)