In addition to powerful manifold learning and network graphing algorithms, the SliceMatrix-IO platform contains serveral classification algorithms. Classification is one of the foundational tasks of machine learning: given an input data vector, a classifier attempts to guess the correct class label. Today we will look at two supervised classifiers: the K-Nearest Neighbor Classifier and Probabilistic Neural Network Classifier.
This dataset we are going to examine today was originally generated to extract rules for determining the conditions under which an autolanding would be preferable to manual control of a spacecraft.
The task is to decide what type of control of the vessel should be employed.
The shuttle dataset contains 9 attributes all of which are numerical. There are 7 possible values for the class label:
To do this, we begin by importing the SliceMatrix-IO Python client.
If you haven't installed the client yet, the easiest way is with pip:
pip install slicematrixIO
Next, lets import slicematrixIO and create our client which will do the heavy lifting. Make sure to replace the api key below with your own key.
Don't have a key yet? Get your api key here
from slicematrixIO import SliceMatrix
api_key = "insert your api key here"
sm = SliceMatrix(api_key)
Next we'll import pandas and numpy
import pandas as pd
import numpy as np
The dataset is broken into two parts: training and test. The training data
training_data = pd.read_csv("notebook_files/shuttle.trn",index_col = 0)
testing_data = pd.read_csv("notebook_files/shuttle.tst",index_col = 0)
training_data.head()
Note: this dataset can be obtained from the UCI Machine Learning Repo
Now we can train our two classifiers. While these algorithms have different architectures, both operate on the principal of looking at the similarity between training data points and new input vectors. Which one will triumph in this case?
knn = sm.KNNClassifier(K = 10, dataset = training_data, class_column = "class")
pnn = sm.PNNClassifier(sigma = 0.12, dataset = training_data, class_column = "class")
After this line we have two fully trained models living in the cloud that we can use to predict the class of new data points. In order to test how good the models are, however, we will now use the testing dataset to predict some test classes. Then we can compare the predictions against the ground truth and see how well the models perform against each other
SliceMatrix-IO was designed for both online and batch processing. The code below uses a single loop to make the predictions, but since this is a validation task (i.e. off-line / historical) we could easily make this parallel using Python's Multiprocessing (mp) package. That would result in big speed gains but unfortunately the way i use mp doesn't play nice with the notebook it would seem...
testing_predictions = {'knn':[], 'pnn':[]}
chunk = 1000
cindex = 0
while cindex < testing_data.shape[0]:
eindex = cindex + chunk
print cindex, eindex
c_features = testing_data.drop("class", axis = 1).values[cindex:eindex].tolist()
knn_preds = knn.predict(c_features)
pnn_preds = pnn.predict(c_features)
testing_predictions['knn'].extend(knn_preds)
testing_predictions['pnn'].extend(pnn_preds)
cindex += chunk
Finally let's compare the model's predictions to the ground truth:
knn_correct = 1. * np.sum(np.equal(testing_predictions['knn'], testing_data['class'])) / len(testing_predictions['knn'])
print "knn % correct = ", knn_correct
pnn_correct = 1. * np.sum(np.equal(testing_predictions['pnn'], testing_data['class'])) / len(testing_predictions['pnn'])
print "pnn % correct = ", pnn_correct
Note: to run these examples you'll need an api key. Don't have a SliceMatrix-IO api key yet? Get your api key here