Machine Learning a term coined by Arthur Samuel in 1959, is a branch of Artificial Intelligence in which computer systems are given ability to learn from data and make predictions without being programmed explicitly or any need of human intervention.
In simple words, Machine Learning is a science to device models on data which is then used by data scientist, research engineers etc. to make predictions.
Machine Learning can be applied in a variety of fields viz,
The purpose of this Machine Learning tutorial is to provide a brief machine learning introduction and equip the readers with basic machine learning tools. We will use Python as programming language to process data and build models.
But before moving any further let’s ponder over
We need machine learning in computing tasks where designing and programming algorithms based on static program instructions are infeasible if not impossible.
Classical example of such task will be Email Filtering. Each individual has its own ways of classifying emails as trash or important (there can be other categories also). So an App having hard coded filters will not work for all the individuals.
A good app/software should learn from how a user is interacting with his inbox and then assist him in classifying future mails.
Sounds interesting !!!
Let’s quickly look into
A machine learning algorithm will learn from the user behaviour and dynamically help him in classifying future emails in appropriate categories.
Moving on !!!
Based on data being fed to a machine learning model, we have following
Supervised Machine Learning In this method computer is given complete and labeled training set of inputs and desired outputs. Computer then derives a corelation between input and output to make predictions. There is a complete list of supervised learning algorithms which is outside the scope of this tutorial.
Semi-supervised Machine Learning In this method computer is given incomplete training set of inputs where outputs are missing for few (sometimes most) records.
Unsupervised Machine Learning In this method computer is given incomplete and unlabeled training set of inputs. Computer is left on it’s own to find structure in that data.
Reinforcement Machine Learning In this method computer is given feedback for his predictions which is either reward points or punishment. This feedback is helpful to improve the accuracy of future predictions made by machine learning model.
I guess that’s enough of theory let’s quickly look into building a machine learning model. But before that we need to install basic tools/softwares. Starting with,
Installing Anaconda
Anaconda comes with a bundle of useful tools. It installs Python (I’m using Python 3), IDE’s for Python e.g. spyder, useful packages like numpy, pandas etc that we will use for Machine Learning and what not.
You can use this link to install Anaconda
After installing, Type for Anaconda Navigator to start it.
It looks like the following,
After we are done installing basic tools, lets launch Spyder from Anaconda Navigator,
This is how it looks like in default Layout
Tips
Spyder Default Layout
from View > Windows layout
Preferences > Editor
view > panes
This is what I have for myself,
Let’s test everything by running a basic print command,
print("Hello from Spyder")
in Test editor and save it.Shift+Enter
(in mac)Hello from Spyder
in IPython Console
Moving On …
Lets learn the basics of how to load, process data before fitting it into a machine learning model.
The data looks something like this,
Country Age Salary Purchased
France 44 72000 No
Spain 27 48000 Yes
Germany 30 54000 No
Spain 38 61000 No
Germany 40 Yes
France 35 58000 Yes
Spain 52000 No
France 48 79000 Yes
Germany 50 83000 No
France 37 67000 Yes
Data Credits All the data used in this tutorial is take from Superdatascience data set
In above data,
Country, Age and Salary
are the details of the customer. They are called Independent Variables because predictions will be made by analysing them.Purchased
tells us wheather the customer bought the product of the company. This is Dependent Variable or Predicted outputIn Machine Learning models, we use independent variables to predict dependent variables
So, using Country, Age and Salary
we are going to predict wheather the customer bought the product of the company.
Once we have the data in hand, we will now process it step by step. Beginning with,
Step 1: Importing the Libraries
import numpy as np # Contains mathematical tools
import matplotlib.pyplot as plt # Tools for plotting Charts
import pandas as pd # Helps in importing data sets and managing data sets
To load libraries, Select the code & Execute it (Shift + Enter for MAC)
Step 2: Importing the Data Sets
file explorer
.dataset = pd.read_csv('Data.csv') # read_csv is a function from pandas which we have used to import data set.
In variable explorer you can see something like this,
Step 3: Segregate Independent Variables and Dependent Variables
X = dataset.iloc[:, :-1].values
"""
In iloc, left of comma are line and (:) implies we are taking all the lines
Right of comma are colums and (:-1) implies we are taking all the columns except last one
"""
Y = dataset.iloc[:, 3].values
# 3 is index for Purchased Column
Above code will create two variable X and Y.
X will contain data of Independent Variables and Y will contain data for Dependent Variables.
If we type X in IPython Console and press enter, we will see something like
array([['France', 44.0, 72000.0],
['Spain', 27.0, 48000.0],
['Germany', 30.0, 54000.0],
['Spain', 38.0, 61000.0],
['Germany', 40.0, nan],
['France', 35.0, 58000.0],
['Spain', nan, 52000.0],
['France', 48.0, 79000.0],
['Germany', 50.0, 83000.0],
['France', 37.0, 67000.0]], dtype=object)
In Y, we have
array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
dtype=object)
Step 4: Fixing the missing data
We have missing data in both Age and Salary Column. So we have two options,
# Fixing the missing data
from sklearn.preprocessing import Imputer # importing Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis=0)
imputer = imputer.fit(X[:, 1:3])
"""
: => all the rows
1:3 => Age and Salary column, 3 is upper bound and is excluded
"""
X[:, 1:3] = imputer.transform(X[:, 1:3])
After above code run,
X variable looks like
Out[4]:
array([['France', 44.0, 72000.0],
['Spain', 27.0, 48000.0],
['Germany', 30.0, 54000.0],
['Spain', 38.0, 61000.0],
['Germany', 40.0, 63777.77777777778],
['France', 35.0, 58000.0],
['Spain', 38.77777777777778, 52000.0],
['France', 48.0, 79000.0],
['Germany', 50.0, 83000.0],
['France', 37.0, 67000.0]], dtype=object)
Step 5: Encoding the categorical variables
We have two categorical data in our dataset
Since machine learning models are based on mathematical equations. So kepping text in categorical variable will create problems in the equations as we only want numbers in the equations.
So we need to encode the categoriacal variables into numbers.
Below code does exactly that
# Encoding the categorical variables
from sklearn.preprocessing import LabelEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
"""
0 is index of country column
fit_transform() will return the encoded version of country column
"""
Above code assign encoding like,
Which is deeply problematic.
Assigning 2 to Germany gives it more precedence over France and Spain which is so not the case. We need to make sure that encoding variables should not attribute an order into categorical variables.
We can achieve it by creating three separate columns for Germany, France and Spain and use 1 Or 0 to denote that this row belongs to which category.
Something like
France Germany Spain
1 0 0
0 0 1
0 1 0
0 0 1
0 1 0
1 0 0
0 0 1
1 0 0
0 1 0
1 0 0
Below code does exactly that
# Encoding the categorical variables
from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
X looks like
France Germany Spain Age Salary
1 0 0 44 72000
0 0 1 27 48000
0 1 0 30 54000
0 0 1 38 61000
0 1 0 40 63777.8
1 0 0 35 58000
0 0 1 38.7778 52000
1 0 0 48 79000
0 1 0 50 83000
1 0 0 37 67000
To encode Y we can still use LabelEncoder. As it is a dependent Variable and Machine learning model will know that it’s a category and there is no order between the two
labelencoder_Y = LabelEncoder()
Y = labelencoder_X.fit_transform(Y)
Y looks like
Purchased
0
1
0
0
1
1
0
1
0
1
Step 6: Splitting data set into Training set and Test set
We want to create training set and test set from our data set to check the correctness and performance of our model.
Training set is defined as data on which we build the machine learning model.
Test set is defined as data on which we test the performance of machine learning model.
We build machine learning model on training set by establishing correlation between independent variable and dependent variable in train set.
Once our machine learning model understands the correlation between independent variable and dependent variable. We will test if the machine learning model can apply the correlations learned from training set on the test set i.e. we will check the accuracy or correctness of the predictions on test set.
Below code does exactly that
# splitting the dataset into Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)
"""
.02 => we have 80% train set and 20% test set
"""
Result form above code looks like,
X_train
France Germany Spain Age Salary
0 1 0 40 63777.8
1 0 0 37 67000
0 0 1 27 48000
0 0 1 38.7778 52000
1 0 0 48 79000
0 0 1 38 61000
1 0 0 44 72000
1 0 0 35 58000
X_test
France Germany Spain Age Salary
0 1 0 30 54000
0 1 0 50 83000
Similarly we have 8 and 2 observations in Y_train, Y_test resprectively.
Step 6: Variable Scaling
Variables in Age and salary column contains numerical numbers which are not in same scale. We have age which goes from 27 to 50
and Salary from 40k to 90k
A lot of machine learning models are based on Euclidean Distance
. The euclidean distance of Salary between two data points will dominate the euclidean distance of Age as the range of salary is much higher than Age.
To mitigate this situation we need to bring them both in same scale / range for e.g. range of -1 to +1 etc.
We can achive it by following methods
X(stand) = (X - mean(X)) / standard deviation (X)
X(norm) = (X - min(X)) / (max(X) - min(X))
Below code does exactly that
# Feature scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
# we don't need to fit test set as it is already fitted for training set
Result form above code looks like, X_train
France Germany Spain Age Salary
-1 2.64575 -0.774597 0.263068 0.123815
1 -0.377964 -0.774597 -0.253501 0.461756
-1 -0.377964 1.29099 -1.9754 -1.53093
-1 -0.377964 1.29099 0.0526135 -1.11142
1 -0.377964 -0.774597 1.64059 1.7203
-1 -0.377964 1.29099 -0.0813118 -0.167514
1 -0.377964 -0.774597 0.951826 0.986148
1 -0.377964 -0.774597 -0.597881 -0.482149
X_test
France Germany Spain Age Salary
-1 2.64575 -0.774597 -1.45883 -0.901663
-1 2.64575 -0.774597 1.98496 2.13981
NOTE:
Full Code for Data Preprocessing
# Improting the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing dataset
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, :-1].values
Y = dataset.iloc[:, 3].values
# Fixing the missing data
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis=0)
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
# Encoding the categorical variables
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
labelencoder_Y = LabelEncoder()
Y = labelencoder_X.fit_transform(Y)
# splitting the dataset into Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)
# Feature scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
Phew !! That’s it !!!
We have looked into all the data processing steps.
One thing to note is we may not be doing all the steps on a available dataset. It highly depends on the format of data in deciding which all steps of processing is required before applying any machine learning algorithm.
Let’s move on in building models beginning with
Regression model is used to predict real values like salary (dependent variable) with time (independent variable). There are multiple regression techniques,
A Simple Liner Regression Technique is basically following formula,
y = b0 + b1*x1
where,
y = Dependent Variable (something we are trying to explain)
x1 = Independent Variable
b1 = Coefficient which determines how a unit change in x1 will cause a change in y
b0 = constant
Suppose we have a Salary vs Experience
data and we want to predict Salary
based on Experience
. Plotting the data it looks something like,
In our scenario regression equation looks like
salary = b0 + b1*Experience
, where
b0 = salary at zero experience
b1 = change in salary with increase in experience. Higher the b1 (slope) it will yield more salary with increase in experience
We want to find best fit line from that best fits the observations marked as (+).
How to find that best fit line?
In above diagram, let L1 be the line representing simple linear regression model. We have drawn green lines from the actual observation (+) to the model.
a1 = tell us where the person should be sitting according to the model in terms of salary i.e. model observation
a2 = Actual salary of the person
green line = Difference between what he’s actually earning and what he should earn according to model.
To find best fitting line, we do the following
(a1-a2)²
Σ(a1-a2)²
min(Σ(a1-a2)²)
Let’s quickly create a model based on data, which looks like
YearsExperience Salary
1.5 37731.0
1.1 39343.0
2.2 39891.0
2.0 43525.0
1.3 46205.0
3.2 54445.0
4.0 55794.0
2.9 56642.0
4.0 56957.0
4.1 57081.0
3.7 57189.0
3.0 60150.0
4.5 61111.0
3.9 63218.0
3.2 64445.0
5.1 66029.0
4.9 67938.0
5.9 81363.0
5.3 83088.0
6.8 91738.0
6.0 93940.0
7.1 98273.0
7.9 101302.0
9.0 105582.0
8.7 109431.0
9.6 112635.0
8.2 113812.0
9.5 116969.0
10.5 121872.0
10.3 122391.0
Step 1: Loading and processing the data
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Salary_Data.csv')
X = dataset.iloc[:, :-1].values # Independent Variable
y = dataset.iloc[:, 1].values # Dependent Variable
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 0)
# We don't need feature scaling because the LinearRegression library will take care of it
Step 2: Fitting Simple Linear Regression to training data
# Fitting Simple Linear Regression to the Training set
from sklearn.linear_model import LinearRegression
linearRegressor = LinearRegression() # creating linearRegressor object
linearRegressor.fit(X_train, y_train) # Fitting model with training set
Next step is to check how our Simple Linear Regression machine learned the corelation in a training by looking into predictions on test set observations
Step 3: Creating a Vector of Predicted Values
# Predicting the Test set results
prediction = linearRegressor.predict(X_test)
Finally lets plot the predictions of linear regression model w.r.t. real observations.
Step 4: Visualization of Model w.r.t. training set
plt.scatter(X_train, y_train, color = 'red')
plt.plot(X_train, linearRegressor.predict(X_train), color = 'blue')
# Y coordinate is the prediction of train set
plt.title('Salary vs Experience (Training set)')
plt.xlabel('Experience')
plt.ylabel('Salary')
plt.show()
It looks something like
In above graph, real values are red dots and predicted values are in blue simple linear regression line
Step 5: Visualization of Model w.r.t. test set
plt.scatter(X_test, y_test, color = 'red')
plt.plot(X_train, linearRegressor.predict(X_train), color = 'blue')
# We will obtain same linear regression line by plotting it with either train set or test set
plt.title('Salary vs Experience (Test set)')
plt.xlabel('Experience')
plt.ylabel('Salary')
plt.show()
It looks something like
In above graph, red dots are observations of test set and predicted values are in blue simple linear regression line.
That’s all folks, we have now build our very first machine learning model and made same decent predictions.
Conclusion
I would like to conclude this artice by highlighting the fact that Machine Learning is truely opening new prospects from petabytes of data that organisations have piled up.
One fine example would be Amazon Recommendation system. This article published on Forbes beautifully highlights the potential of Machine Learning in driving revenue in the enterprise.