Machine Learning is a branch of Artificial Intelligence in which computer systems are given the ability to learn from data and make predictions without being programmed explicitly or any need for human intervention.
I’ve discussed Machine Learning deeply in this post and regression in this post.
In this post, I would like to brush over common Machine Learning Classification Techniques. However, first, let’s answer
In classification, we predict a category of a data point, unlike regression where we predict real constant values.
There are multiple classification techniques, but in this article, we will look into the following techniques viz,
Let’s look into each one of them individually beginning with
Logistic Regression is a linear classifier build on following formula,
Above equation is the output of applying Sigmoid function in Linear Equation (Details of which is outside the scope of this tutorial)
Let’s understand this concept intuitively by taking an example.
Suppose we have Age
vs. Action
data and we want to predict whether an intended action (e.g., clicking on mail), is performed by a customer of a particular age
Plotting the data it looks something like,
Based on the age we want to predict the probability or likelihood, whether a person will click on mail (action) or not.
Applying logistic regression formula to the above data, we get a curve which looks like
The Green curve is the best fitting line from the logistic regression equation which best fits the data set. We can use that line to predict the probability for a particular age.
Lets try predicting the probabilies of x=20,30,40,50
Based on projection of x=20,30,40,50 we get probabilities as .07, .2, .85, .9 simultaneously.
To convert those probabilities to predictions, let’s choose a threshold of .5 (i.e., 50%). So the probability of
Based on the above threshold x=20,30 will not click on email and x=40,50 will click on email. We tried to predict the likeliness of a user to click on a mail.
To further dive into Logistic Regression let’s create a model, Where we would be predicting the whether a person will buy a particular product (e.g., a newly launched SUV Car) based on Age and Estimated Salary.
Sample from complete data set looks like
User ID | Gender | Age | EstimatedSalary | Purchased |
---|---|---|---|---|
15624510 | Male | 19 | 19000 | 0 |
15810944 | Male | 35 | 20000 | 0 |
15668575 | Female | 26 | 43000 | 0 |
15603246 | Female | 27 | 57000 | 0 |
15804002 | Male | 19 | 76000 | 0 |
15728773 | Male | 27 | 58000 | 0 |
15598044 | Female | 27 | 84000 | 0 |
15694829 | Female | 32 | 150000 | 1 |
15600575 | Male | 25 | 33000 | 0 |
15727311 | Female | 35 | 65000 | 0 |
15570769 | Female | 26 | 80000 | 0 |
15606274 | Female | 26 | 52000 | 0 |
15746139 | Male | 20 | 86000 | 0 |
15704987 | Male | 32 | 18000 | 0 |
Click here to get Full data.
Data Credits All the data used in this tutorial is take from Superdatascience data set
Below is a step by step approach for analyzing the data. Beginning with
Step 1: Loading and processing the data
# Logistic Regression
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values # Extracting Age and Estimated Salary
y = dataset.iloc[:, 4].values # Extracting Purchased
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
# Feature scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
Step 2: Fitting Logistic Regression to training data
# Fitting Logistic Regression to Training set
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)
Step 3: Predicting the Test set results
# Predicting the Test set results
y_pred = classifier.predict(X_test)
Step 4: Making Confusion Matrix
Before diving into Confusion Matrix, let’s understand following terms
Confusion Matrix basically tells us about True Positive(s), False Positive(s), True Negative(s) and False Negative(s).
Code to calculate Confusion Matrix
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
O/P of Confusion Matrix looks like
array([[65, 3],
[ 8, 24]])
Above array can also be visualized as
Y Predicted = 0 | Y Predicted = 1 | |
Y Actual = 0 | True Negative | False Positive (Type I) |
Y Actual = 1 | False Negative (Type II) | True Positive |
Hence, we have 65 + 24 = 89 correct predictions and 8+3 = 11 incorrect predictions.
We can also calculate a few important Metrics like
Since we are discussing Metrics, let’s look into
AUC - Area Under Curve
F1 Score
Step 5: Visualisation of Logistic Regression
Step 5a: Visualising Training Data
# Visualising the Training set results
from matplotlib.colors import ListedColormap
X_set, y_set = X_train, y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.45, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Logistic Regression (Training Set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
Step 5b: Visualising Test Data
# Visualising the Test set results
from matplotlib.colors import ListedColormap
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.45, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Logistic Regression (Test Set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
The goal of Logistic Regression classifier is to classify users into the right category. For a new user, our classifier will predict whether it belongs to the green region or red region based on its age and estimated salary.
Below is the detailed explanation of the above graphs
I think that is enough information for Logistic Regression, let’s look into
Suppose our data points look like,
Also, we have a new point marked as blue as shown below.
We want to predict whether it belongs to the Red set or Green set, we will apply the K-NN algorithm in following steps.
Step 1: Choose the K number of neighbors. Suppose K = 5
Step 2: Take K-Nearest neighbor of the new data point, according to Euclidean distance
Step 3: Among these K neighbors, count the number of data points in each category
Step 4: Assign the new data point to the category where we have counted the most neighbors
Analyzing the above data, the blue dot has three neighbors in the Red category and two neighbors in Green category. Hence, by applying the K-Nearest Neighbors algorithm, it belongs to the Red category as shown below.
Let’s try analyzing the dataset of Logistic Regression in K-Nearest Neighbors Regression, code of which looks like
# K-Nearest Neighbors Regression (K-NN)
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values # Extracting Age and Estimated Salary
y = dataset.iloc[:, 4].values # Extracting Purchased
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
# Feature scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
# Fitting K-Nearest Neighbors to Training set
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
## p = 2 for euclidean distance
classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict(X_test)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
# Visualising the Training set results
from matplotlib.colors import ListedColormap
X_set, y_set = X_train, y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.45, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('K-Nearest Neighbors (Training Set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
# Visualising the Test set results
from matplotlib.colors import ListedColormap
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.45, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('K-Nearest Neighbors (Test Set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
Confusion Matrix looks liks
array([[64, 4],
[ 3, 29]])
Where we have only 8 (4+3) incorrect predictions, Which is an improvement over Logistic Regression of 11 wrong predictions.
Visualising Training Data
Visualising Test Data
Observations from the above graph
Moving on let’s look into
Suppose our data points look like,
We want to separate the data points linearly by drawing a line between them. However, wait there can be N number of lines that can be drawn between Category 1 and Category 2. Which one to Choose?
Here SVM comes to rescue, SVM helps us in finding the best line Or best decision boundary which will help us in separating our space into classes.
SVM searches the line having Maximum margin from each point; graphically it looks like
In the above graph,
Let’s try analyzing the dataset of Logistic Regression in Support Vector Machine, code of which looks like
# Support Vector Machine (SVM)
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values # Extracting Age and Estimated Salary
y = dataset.iloc[:, 4].values # Extracting Purchased
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
# Feature scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
# Fitting Support Vector Machine (SVM) to Training set
from sklearn.svm import SVC
classifier = SVC(kernel = 'linear', random_state = 0)
classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict(X_test)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
# Visualising the Training set results
from matplotlib.colors import ListedColormap
X_set, y_set = X_train, y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.45, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Support Vector Machine (SVM) (Training Set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
# Visualising the Test set results
from matplotlib.colors import ListedColormap
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.45, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Support Vector Machine (SVM) (Test Set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
Confusion Matrix looks liks
array([[66, 2],
[ 8, 24]])
Where we have only 10 (8+2) incorrect predictions which are pretty decent.
Visualising Training Data
Visualising Test Data
Observations from the above graph
Can we improve the prediction accuracy of SVM?
Maybe we can by trying different Kernel(s). Let’s do that in
Suppose our data points look like,
It can’t be separated by a straight line implying our data is not linearly separable. However, the assumption or prerequisite of using SVM is that data is linearly separable.
So, how can we apply SVM in this scenario?
A simple way out will be to add an extra dimension to our data to make it linearly separable. We can do that in the following two steps
Applying both steps, we have a cirle separating our data which looks like,
Caveat of using above approach is, Mapping to Higher Dimensional Space can be compute intensive
Can we do better, i.e., can we apply SVM without going to a higher dimension?
Yes, we can by applying The Kernel Trick.
The detailed explanation of The Kernel Trick is beyond the scope of this tutorial. Just understand we take a Gaussian RBF Kernel, Formula of which looks like
and separate our dataset, i.e., build the decision boundary.
Basically, by figuring out optimum values of and we will plot a figure which will separate our dataset. (Detailed explanation of how it will achieve it is beyond the scope of this tutorial)
Remember by using Gaussian RBF Kernel trick we aren’t doing any computation in Higher dimension space.
Also, Gaussian RBF Kernel is not the only kernel that we can choose. We have other options as well viz,
I guess that’s enough of theory.
Let’s try analyzing the dataset of Logistic Regression in Kernel SVM, the code of which looks like
# Kernel SVM
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values # Extracting Age and Estimated Salary
y = dataset.iloc[:, 4].values # Extracting Purchased
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
# Feature scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
# Fitting Kernel SVM to Training set
from sklearn.svm import SVC
classifier = SVC(kernel = 'rbf', random_state = 0)
## RBF is a Gaussian Kernel
classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict(X_test)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
# Visualising the Training set results
from matplotlib.colors import ListedColormap
X_set, y_set = X_train, y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.45, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Kernel SVM (Training Set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
# Visualising the Test set results
from matplotlib.colors import ListedColormap
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.45, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Kernel SVM (Test Set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
Confusion Matrix looks liks
array([[64, 4],
[ 3, 29]])
Where we have only 7 (4+3) incorrect predictions which are a fantastic improvement over linear SVM, even better than Logistic Regression (having 11 wrong predictions) and KNN (having eight wrong predictions)
Visualising Training Data
Visualising Test Data
Observations from the above graph
Moving on let’s explore more and dig into
To understand Naive Bayes theorem let’s look at the data in below graph,
We have Salary
Vs. Age
plot in which Red dots are for people who walk to the office and Green dots are people who drive to the office.
Suppose we have a new data point (marked as grey in the above graph) we want to classify whether it walks
to work or drives
to work.
Let’s try solving this problem using Naive Bayes theorem. However, before that a quick.
Side Note: Bayes Theorem is represented by equation, (Detailed explaination of how we came up with the equation is beyond the scope of this tutorial)
Naive Bayes theorem will broadly perform three steps according to our problem. Let’s look into it, starting with
Step 1 Finding likelihood for a person to Walk to office Given X Condition
Where represents features of any person, in our scenario it is Age
and Salary
Where,
= Prior Probability
It represents the probability of a person walking to work i.e.
= Marginal Likelihood
To calculate Marginal Likelihood, let’s draw a circle of arbitrary radius of our choice and draw a circle around observation. Like in below figure
Look at all the points inside the radius, in our case they are 4 (3 Red, 1 Green). We will deem all those points similar in feature to the data point (Grey Point) that we have.
will be the probability of any random point to fall into this circle i.e.
= Likelihood
means what is the probability that a person who walks to office exhibits feature X.
In the same circle, will be the probability of any random point to fall into this circle given that person walks i.e.
In essence, Number of Similar Observations Among Those Who Walk = Red Points falling inside the circle.
We are only considering people who walk to work, i.e., Red Points
= Posterior Probability
Which is equal to
Step 2 Finding likelihood for a person to Drive to office Given X
Calculating all the parts of RHS it becomes,
Step 3 Comapre both the probabilities and Decide
i.e. Vs
Vs
>
>
Hence, we will classify new data point as Person will walk to work.
That’s much theory. I know.
Let’s try analyzing the dataset of Logistic Regression in Naive Bayes, code of which looks like
# Naive Bayes
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values # Extracting Age and Estimated Salary
y = dataset.iloc[:, 4].values # Extracting Purchased
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
# Feature scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
# Fitting Naive Bayes to Training set
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict(X_test)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
# Visualising the Training set results
from matplotlib.colors import ListedColormap
X_set, y_set = X_train, y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.45, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Naive Bayes (Training Set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
# Visualising the Test set results
from matplotlib.colors import ListedColormap
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.45, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Naive Bayes (Test Set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
Confusion Matrix looks liks
array([[65, 3],
[ 7, 25]])
Where we have 10 (7+3) incorrect predictions.
Visualising Training Data
Visualising Test Data
Observations from the above graph
Smooth Right. Only two more Algos left.
Let’s look into the penultimate algorithm, which is
Decision Tree is very good for interpretation and can be used to analyse data which looks like,
Decision Tree basically cuts our data into slices which looks like,
But how those splits are selected ??
Splits are decided in a way to maximize the number of data points in a certain category with each split. The split is trying to * minimize entropy* (Details of which are beyond the scope of this post)
Let’s construct a decision tree based on the above split, which looks like
Based on the above graph, Terminal leaves will predict whether it belongs to Red class or Green class.
Let’s try analyzing the dataset of Logistic Regression in Decision Tree classification, code of which looks like
# Decision Tree Classification
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values # Extracting Age and Estimated Salary
y = dataset.iloc[:, 4].values # Extracting Purchased
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
# Feature scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
## We don't need feature scaling in case of Decision Tree because it's not based on Euclidean distance.
## But since we are plotting with the higher resolution we will keep this step
# Fitting Decision Tree to Training set
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion= 'entropy', random_state=0)
classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict(X_test)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
# Visualising the Training set results
from matplotlib.colors import ListedColormap
X_set, y_set = X_train, y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.45, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Decision Tree (Training Set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
# Visualising the Test set results
from matplotlib.colors import ListedColormap
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.45, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Decision Tree (Test Set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
Confusion Matrix looks liks
array([[62, 6],
[ 3, 29]])
Where we have 9 (6+3) incorrect predictions.
Visualising Training Data
Visualising Test Data
Observations from the above graph
Can we improve this model further ??
Imagine if we would have used 500 Decision Trees for prediction.
Let’s look into the last algorithm, which is
Random Forest is a version of Ensembled Learning.
In Ensembled Learning we use several algorithms or same algorithm multiple times to build something more powerful.
Below are the steps of building a Random Forest
Step 1: Pick Random K data points from the training set.
Step 2: Build Decision Tree associated with these K data points.
Step 3: Repeat Step 1 and Step 2 N times and build N trees.
Step 4: Use all N Trees to predict the category of the new data point. Then choose the category that wins the majority vote.
We can see it improves the accuracy of our prediction because we are taking into consideration predictions from N trees.
Microsoft Kinect is a great example that uses Random forest algorithm to sense the movement of body parts.
Let’s try analyzing the dataset of Logistic Regression in Random Forest classification, code of which looks like
# Random Forest Classification
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values # Extracting Age and Estimated Salary
y = dataset.iloc[:, 4].values # Extracting Purchased
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
# Feature scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
## We don't need feature scaling in case of Random Forest because it's not based on Euclidean distance.
## But since we are plotting with the higher resolution we will keep this step
# Fitting Random Forest to Training set
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 10, criterion= 'entropy', random_state=0)
## n_estimators => Number of trees in forest
classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict(X_test)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
# Visualising the Training set results
from matplotlib.colors import ListedColormap
X_set, y_set = X_train, y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.45, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Random Forest (Training Set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
# Visualising the Test set results
from matplotlib.colors import ListedColormap
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.45, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('Random Forest (Test Set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
Confusion Matrix looks liks
array([[63, 5],
[ 3, 29]])
Where we have 8 (5+3) incorrect predictions which are an improvement over Decision Tree Classification.
Visualising Training Data
Visualising Test Data
Observations from the above graph
Conclusion
In conclusion we need to choose a model with maximum correct predictions while preventing overfitting.
Looking at all the classification models and analyzing their confusion Matrix for accuracy and decision boundaries. It seems we should choose Kernel SVM for our problem.
Last but not the least we may consider following guidelines before picking a model
That’s it folks, that how classification in Machine Learning Rocks.