SliceMatrix-IO

KNN vs PNN Classification: Shuttle Dataset

In addition to powerful manifold learning and network graphing algorithms, the SliceMatrix-IO platform contains serveral classification algorithms. Classification is one of the foundational tasks of machine learning: given an input data vector, a classifier attempts to guess the correct class label. Today we will look at two supervised classifiers: the K-Nearest Neighbor Classifier and Probabilistic Neural Network Classifier.

This dataset we are going to examine today was originally generated to extract rules for determining the conditions under which an autolanding would be preferable to manual control of a spacecraft.

The task is to decide what type of control of the vessel should be employed.

The shuttle dataset contains 9 attributes all of which are numerical. There are 7 possible values for the class label:

  • 1: Rad Flow
  • 2: Fpv Close
  • 3: Fpv Open
  • 4: High
  • 5: Bypass
  • 6: Bpv Close
  • 7: Bpv Open

To do this, we begin by importing the SliceMatrix-IO Python client.

If you haven't installed the client yet, the easiest way is with pip:

pip install slicematrixIO

Next, lets import slicematrixIO and create our client which will do the heavy lifting. Make sure to replace the api key below with your own key.

Don't have a key yet? Get your api key here

In [1]:
from slicematrixIO import SliceMatrix

api_key = "insert your api key here"
sm = SliceMatrix(api_key)

Next we'll import pandas and numpy

In [2]:
import pandas as pd
import numpy as np

The dataset is broken into two parts: training and test. The training data

In [3]:
training_data = pd.read_csv("notebook_files/shuttle.trn",index_col = 0)
testing_data  = pd.read_csv("notebook_files/shuttle.tst",index_col = 0)
In [4]:
training_data.head()
Out[4]:
1 2 3 4 5 6 7 8 class
0
50 21 77 0 28 0 27 48 22 2
55 0 92 0 0 26 36 92 56 4
53 0 82 0 52 -5 29 30 2 1
37 0 76 0 28 18 40 48 8 1
37 0 79 0 34 -26 43 46 2 1

Note: this dataset can be obtained from the UCI Machine Learning Repo

Now we can train our two classifiers. While these algorithms have different architectures, both operate on the principal of looking at the similarity between training data points and new input vectors. Which one will triumph in this case?

In [5]:
knn = sm.KNNClassifier(K = 10, dataset = training_data, class_column = "class")
In [6]:
pnn = sm.PNNClassifier(sigma = 0.12,  dataset = training_data, class_column = "class")

After this line we have two fully trained models living in the cloud that we can use to predict the class of new data points. In order to test how good the models are, however, we will now use the testing dataset to predict some test classes. Then we can compare the predictions against the ground truth and see how well the models perform against each other

SliceMatrix-IO was designed for both online and batch processing. The code below uses a single loop to make the predictions, but since this is a validation task (i.e. off-line / historical) we could easily make this parallel using Python's Multiprocessing (mp) package. That would result in big speed gains but unfortunately the way i use mp doesn't play nice with the notebook it would seem...

In [7]:
testing_predictions = {'knn':[], 'pnn':[]}
chunk = 1000
cindex = 0
while cindex < testing_data.shape[0]:
    eindex = cindex + chunk
    print cindex, eindex
    c_features = testing_data.drop("class", axis = 1).values[cindex:eindex].tolist()
    knn_preds = knn.predict(c_features)
    pnn_preds = pnn.predict(c_features)
    testing_predictions['knn'].extend(knn_preds)
    testing_predictions['pnn'].extend(pnn_preds)
    cindex += chunk
0 1000
1000 2000
2000 3000
3000 4000
4000 5000
5000 6000
6000 7000
7000 8000
8000 9000
9000 10000
10000 11000
11000 12000
12000 13000
13000 14000
14000 15000

Finally let's compare the model's predictions to the ground truth:

In [8]:
knn_correct = 1. * np.sum(np.equal(testing_predictions['knn'], testing_data['class'])) / len(testing_predictions['knn'])
print "knn % correct = ", knn_correct

pnn_correct = 1. * np.sum(np.equal(testing_predictions['pnn'], testing_data['class'])) / len(testing_predictions['pnn'])
print "pnn % correct = ", pnn_correct
knn % correct =  0.99924137931
pnn % correct =  0.868827586207

Note: to run these examples you'll need an api key. Don't have a SliceMatrix-IO api key yet? Get your api key here

In [ ]: