Amazon SageMaker

Machine Learning is a branch of Artificial Intelligence in which computer systems are given the ability to learn from data and make predictions without being programmed explicitly or any need for human intervention.


I’ve discussed Machine Learning deeply in this post, regression algorithms in this post and classification algorithms in this post.


In my previous post we looked into how Amazon Machine Learning works. In this post, I would like to go over how we can use Amazon SageMaker to train, evaluate and deploy our model for batch and real-time predictions.


As per Amazon, Amazon SageMaker Service is

A fully managed machine learning service for data scientists and developers where they can quickly build and train machine learning models and deploy them into a production-ready hosted environment.


Amazon SageMaker comes with Jupyter notebook instance to access data for exploration and analysis. It supports a variety of Algorithms viz,

You may look into this page for the list of algorithms currently supported by SageMaker.


Before moving any further let’s define a few ML terminologies that we will use in this post.


I think that’s enough of introduction, let’s look into the real question,

How Amazon SageMaker works

To answer it, let’s deep dive into Amazon SageMaker and build a model from scratch.


Below is a step by step guide on how to use the Amazon SageMaker Service.


Step 1: Preparing Data Source

Let’s create a model, where we would be predicting the class of Iris Flower based on sepal length, sepal width, petal length, and petal width.


Sample from complete data set looks like

sepal_length sepal_width petal_length petal_width class
5.1 3.5 1.4 0.2 Iris-setosa
4.9 3 1.4 0.2 Iris-setosa
4.7 3.2 1.3 0.2 Iris-setosa
7 3.2 4.7 1.4 Iris-versicolor
6.4 3.2 4.5 1.5 Iris-versicolor
6.3 3.3 6 2.5 Iris-virginica
5.8 2.7 5.1 1.9 Iris-virginica
7.1 3 5.9 2.1 Iris-virginica


Click here to get Full data.
Data Credits The data used in this tutorial is take from UCI Machine Learning Repository.


Amazon SageMaker Service needs a data source to train a model. So let’s upload iris_all.csv data to S3 bucket. (You may refer to this AWS Tutorial for it)


Let’s name our bucket as bornshrewd-aws-sagemaker-demo.


It has iris_all.csv as shown in the following picture. Amazon Data Bucket


We will use the same bucket to upload the Model Artifact as well.


In our case, we will be using the XGBoost library to build a model.


XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework.


A summary of how it works is as follows,

You may look into this page for more details on XGBoost.


On a side note, Amazon supports a variety of data formats for training the model, depending on the algorithm type. For example

You may look into this page for the list of algorithms with there supported datatype.


Similarly, for prediction from a client application, supported output formats are as follows


Step 2: Launching Notebook Instance
Go to Amazon SageMaker Service; You will see a screen like

Amazon SageMaker Homepage


Click on Notebook instances.


You will see the Notebook instances Dashboard. On top right corner you will see Create Notebook Instance, Click on it you will be directed to Create Notebook instance page.


Fill the form details viz instace name, instance type, IAM role and click on Create Notebook Instance as seen below Create Notebook Instance


PS: Let’s create a new IAM role to give permissions to notebook instance to call other services e.g. SageMaker and S3 as shown below Create Role


We will be redirected to Notebook Instance Dashboard. Once the status says, InService Click on Open Jupyter as seen below

Open Jupyter


Jupyter instance will look like Jupyter


Click on New > Folder. It will create a new folder as Untitled Folder, rename it to iris.


Step 3: Creating Ipython Notebook and Installing XGBoots locally

Let’s first make predictions by installing XGBoost locally in notebook instance and building the model.


Click on New > conda_python3. It will create a new notebook as Untitled, rename it to iris-classification-local.ipynb.


Installing XGBoost locally

!conda install -y -c conda-forge xgboost


Step 4: Importing Libraries and Iris Dataset from S3 bucket

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
import xgboost as xgb

# Defining utility methods
# Reference: http://boto3.readthedocs.io/en/latest/guide/s3.html
# bucket: Name of bucket
# key: File name stored in S3

def write_to_s3(filename, bucket, key):
    with open(filename,'rb') as f: # Read in binary mode
        return boto3.Session().resource('s3').Bucket(bucket).Object(key).upload_fileobj(f)


def download_from_s3(filename, bucket, key):
    with open(filename,'wb') as f:
        return boto3.Session().resource('s3').Bucket(bucket).Object(key).download_fileobj(f)


# Downloading file from S3
download_from_s3('iris_all.csv', 'bornshrewd-aws-sagemaker-demo', 'iris_all.csv')

# Reading CSV File
df = pd.read_csv('iris_all.csv')

# Let's see how our data looks like
df.head(2)


Step 5: Data Preprocessing
We need to predict class of iris (Dependent Variable) from sepal_length, sepal_width, petal_length, petal_width (independent variables). As we can see the class is a categorical variable having values Iris-setosa, Iris-versicolor and Iris-virginica. We need to convert it to equivalent numerical values.


We will use a preprocessing module from sklearn to achieve this


le = preprocessing.LabelEncoder()
le.fit(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'])
df['encoded_class'] = le.transform(df['class'])
df.head(2)


  sepal_length sepal_width petal_length petal_width class encoded_class
0 5.1 3.5 1.4 0.2 Iris-setosa 0
1 4.9 3.0 1.4 0.2 Iris-setosa 0


Step 6: Splitting data into training, test and validation set

X = df.iloc[:, :4]
y = df.iloc[:, 5]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=1)


Step 7: Creating XGBoost Classifier and Fitting the model

classifier = xgb.XGBClassifier(max_depth=5, objective="multi:softmax", num_class=3)
classifier.fit(X_train,y_train, eval_set = [(X_train, y_train), (X_val, y_val)], eval_metric=['merror','mlogloss'])


While fitting the model, it will give us details about merror and mlogloss for the training and validation set. With each iteration, it should decrease implying we are improving our model.


We can even plot mlogloss for training and validation.

eval_result = classifier.evals_result()
training_rounds = range(len(eval_result['validation_0']['mlogloss']))

# Plotting
plt.scatter(x=training_rounds,y=eval_result['validation_0']['mlogloss'],label='Training Error')
plt.scatter(x=training_rounds,y=eval_result['validation_1']['mlogloss'],label='Validation Error')
plt.grid(True)
plt.xlabel('Iteration')
plt.ylabel('LogLoss')
plt.title('Training Vs Validation Error')
plt.legend()


Graphs looks like, Training VS Validation mlogloss

We can also plot the importance of features

xgb.plot_importance(classifier)


Graphs looks like, Feature Importance


Note: Click here, for detailed information about hyperparameters and other details like merror and mlogloss etc


Step 8: Running Predictions
Let’s run predictions on our test set.

y_pred = classifier.predict(X_test)

# Converting y_pred and y_test to original class names
pred = le.inverse_transform(y_pred)
test = le.inverse_transform(y_test)


Let’s print the confusion metrics to see how our model is faring

# Printing confusion metrics
pd.crosstab(test, pred)


col_0 Iris-setosa Iris-versicolor Iris-virginica
row_0      
Iris-setosa 11 0 0
Iris-versicolor 0 12 1
Iris-virginica 0 0 6


Observations:


Let’s run predictions on entire data set and print confusion metrics

# Let's run prediction for entire dataset
df = pd.read_csv('iris_all.csv')
X = df.iloc[:,:-1] # Taking all independent variable
prediction = classifier.predict(X)
df['predicted_class'] = le.inverse_transform(prediction)

# Printing confusion metrix
pd.crosstab(df['class'], df['predicted_class'])


predicted_class Iris-setosa Iris-versicolor Iris-virginica
class      
Iris-setosa 50 0 0
Iris-versicolor 0 49 1
Iris-virginica 0 0 50


Observations:


Last but not the least, lets print classification report

# Priting Classification Report
import sklearn.metrics as metrics
print(metrics.classification_report(df['class'], df['predicted_class']))


It looks like,

  precision recall f1-score support
Iris-setosa 1.00 1.00 1.00 50
Iris-versicolor 1.00 0.98 0.99 50
Iris-virginica 0.98 1.00 0.99 50
         
avg / total 0.99 0.99 0.99 150


So far we have used XGBoost locally and verified that it solves our problem well.


Let’s first make predictions by using the cloud version of XGBoost for building the model.
We will follow the following steps.


Step 1: Creating Ipython Notebook
Click on New > conda_python3. It will create a new notebook as Untitled, rename it to iris-classification-cloud.ipynb.


Step 2: Importing Libraries and Iris Dataset from S3 bucket

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
import xgboost as xgb

# Defining utility methods
# Reference: http://boto3.readthedocs.io/en/latest/guide/s3.html
# bucket: Name of bucket
# key: File name stored in S3

def write_to_s3(filename, bucket, key):
    with open(filename,'rb') as f: # Read in binary mode
        return boto3.Session().resource('s3').Bucket(bucket).Object(key).upload_fileobj(f)


def download_from_s3(filename, bucket, key):
    with open(filename,'wb') as f:
        return boto3.Session().resource('s3').Bucket(bucket).Object(key).download_fileobj(f)


# Downloading file from S3
download_from_s3('iris_all.csv', 'bornshrewd-aws-sagemaker-demo', 'iris_all.csv')

# Reading CSV File
df = pd.read_csv('iris_all.csv')

# Let's see how our data looks like
df.head(2)


Step 3: Data Preprocessing
We need to predict class of iris (Dependent Variable) from sepal_length, sepal_width, petal_length, petal_width (independent variables). As we can see the class is a categorical variable having values Iris-setosa, Iris-versicolor and Iris-virginica. We need to convert it to equivalent numerical values.


We will use the preprocessing module from sklearn to achieve this


le = preprocessing.LabelEncoder()
le.fit(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'])
df['encoded_class'] = le.transform(df['class'])
df.head(2)


  sepal_length sepal_width petal_length petal_width class encoded_class
0 5.1 3.5 1.4 0.2 Iris-setosa 0
1 4.9 3.0 1.4 0.2 Iris-setosa 0


Step 4: Splitting data into training and validation set and uploading it to S3

Splitting the data into Training and Validation Set

columns = ['encoded_class','sepal_length','sepal_width','petal_length','petal_width']

# Randomising the dataset
np.random.seed(5)
l = list(df.index)
np.random.shuffle(l)
df = df.iloc[l]

# Generating training and Validation rows
rows = df.shape[0]
train = int(.7 * rows)
test = int(.3 * rows)

# Write Training Set to file without header
df[:train].to_csv('iris_train.csv'
                          ,index=False,header=False
                          ,columns=columns)

# Write Validation Set to file without header
df[train:].to_csv('iris_validation.csv'
                          ,index=False,header=False
                          ,columns=columns)

# Write Column List
with open('iris_train_column_list.txt','w') as f:
    f.write(','.join(columns))


Sagemaker cloud instance takes data from S3 so let’s upload our test and validation data to S3.


bucket_name = 'bornshrewd-aws-sagemaker-demo'
training_file_key = 'iris/iris_train.csv'
validation_file_key = 'iris/iris_validation.csv'

s3_model_output_location = r's3://{0}/iris/model'.format(bucket_name)
s3_training_file_location = r's3://{0}/{1}'.format(bucket_name,training_file_key)
s3_validation_file_location = r's3://{0}/{1}'.format(bucket_name,validation_file_key)

# Uploading data to S3
write_to_s3('iris_train.csv',bucket_name,training_file_key)
write_to_s3('iris_validation.csv',bucket_name,validation_file_key)


After above step, we can see our data on S3 as shown below,

Data On S3


Step 5: Creating XGBoost Classifier and Fitting the model

AWS Team has packaged Machine Learning algorithms as Docker containers which are stored in the container registry. Each container has a unique entry known as container Registry Path. We need to provide container Registry Path to sagemaker training job to indicate what algorithm to use for training


Let’s create a SageMaker Estimator

# containers is a dictionary mapping the region with Registry Path
containers = {'us-west-2': '433757028032.dkr.ecr.us-west-2.amazonaws.com/xgboost:latest',
              'us-east-1': '811284229777.dkr.ecr.us-east-1.amazonaws.com/xgboost:latest',
              'us-east-2': '825641698319.dkr.ecr.us-east-2.amazonaws.com/xgboost:latest',
              'eu-west-1': '685385470294.dkr.ecr.eu-west-1.amazonaws.com/xgboost:latest'}

# The role that we gave while launching the notebook instance to grant required permission to the instance
role = get_execution_role()

# Establishing a SageMaker Session
sess = sagemaker.Session()

# Creating the estimator: Reference: http://sagemaker.readthedocs.io/en/latest/estimators.html
# role: passing the role that estimator can assume so that it can access our data files and resources
# train_instance_count: Specifying how many instances to use for distributed training 
# train_instance_type: what type of machine to use
# output_path: specify where the trained model artifacts need to be stored
# base_job_name: Giving a name to the training job

estimator = sagemaker.estimator.Estimator(containers[boto3.Session().region_name],
                                       role, 
                                       train_instance_count=1, 
                                       train_instance_type='ml.m4.xlarge',
                                       output_path=s3_model_output_location,
                                       sagemaker_session=sess,
                                       base_job_name ='xgboost-iris-v1')


# Specifying hyperparameters that appropriate for the training algorithm
# XGBoost Training Parameter Reference: https://github.com/dmlc/xgboost/blob/master/doc/parameter.md

estimator.set_hyperparameters(max_depth=5,
                              objective="multi:softmax",
                              num_class=3,
                              num_round=50)

estimator.hyperparameters()

# Creating content variable which can be libsvm or csv for XGBoost
training_input_config = sagemaker.session.s3_input(s3_data=s3_training_file_location,content_type="csv")
validation_input_config = sagemaker.session.s3_input(s3_data=s3_validation_file_location,content_type="csv")

# XGBoost supports "train", "validation" channels
# Reference: https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html

# Training the model
estimator.fit({'train':training_input_config, 'validation':validation_input_config})


After we issued the fit command to estimator, training process is initiated and we can see a training job in progess on a new compute instance as shown below Training Job


When it completes, we can see a completed status as shown below Training Job Completed


We can also see a model is generated and uploaded to S3 bucket Model on S3


Step 6: Running Predictions
To make predictions, we need to first deploy the model.


Estimator helps us in deploying the model using following code.

# Ref: http://sagemaker.readthedocs.io/en/latest/estimators.html
# initial_instance_count: Number of compute instance for hosting the model
# instance_type: Type of instance
# endpoint_name: Name of endpoint to be created

# Deploying the model
predictor = estimator.deploy(initial_instance_count=1,
                             instance_type='ml.m4.xlarge',
                             endpoint_name = 'xgboost-iris-v1')


When we deploy the model, Sagemaker launch the instance(s) that we have requested and host the model in it.


We can see created model as shown below Model on Sagemaker


If you click on the model, you can see the model configuration


You can also see, Endpoint Configuration as shown below Endpoint Configuration on Sagemaker


And an endpoint creation in progress as shown below, Endpoint on Sagemaker


Once endpoint is created you can see InService status Endpoint Creation Completed


Once the model is deployed, let’s make predictions using the following code

# Run Predictions
from sagemaker.predictor import csv_serializer

predictor.content_type = 'text/csv'
predictor.serializer = csv_serializer
predictor.deserializer = None

predictor.predict([[4.8,3.4,1.6,0.2],[5.8,2.7,4.1,1.0]])


Output of Prediction

b'0.0,1.0'

So for the first data sample ([4.8,3.4,1.6,0.2]), it predicts 0, which is Iris-setosa and for a second data sample ([5.8,2.7,4.1,1.0]) it predicts 1, which is Iris-versicolor.


Don’t forget to do the clean up after predictions. First select a resource, From Actions > Delete.


You may look at my notebook using this link.


Conclusion
In conclusion, I would like to say Amazon Sagemaker service can accelerate the delivery period of a Machine Learning Project by building an end to end pipeline in minimal time.


It gives us much control as compared to the Amazon Machine Learning service, which has limited options. Do check it out.

Reference

Stay in Touch


Receive Email Notification of Latest Tutorials.

Loading comments...