SliceMatrix-IO

KNN vs PNN Classification: Breast Cancer Image Dataset

In addition to powerful manifold learning and network graphing algorithms, the SliceMatrix-IO platform contains serveral classification algorithms. Classification is one of the foundational tasks of machine learning: given an input data vector, a classifier attempts to guess the correct class label. Today we will look at two supervised classifiers: the K-Nearest Neighbor Classifier and Probabilistic Neural Network Classifier.

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. For example:

alt text

Attributes:

  1. Sample code number: id number
  2. Clump Thickness: 1 - 10
  3. Uniformity of Cell Size: 1 - 10
  4. Uniformity of Cell Shape: 1 - 10
  5. Marginal Adhesion: 1 - 10
  6. Single Epithelial Cell Size: 1 - 10
  7. Bare Nuclei: 1 - 10
  8. Bland Chromatin: 1 - 10
  9. Normal Nucleoli: 1 - 10
  10. Mitoses: 1 - 10
  11. Class: (2 for benign, 4 for malignant)

The goal in this example is to use the input features to train a machine learning model to predict whether each image presented is benign or malignant. We will then validate the predictive power of the model using out of sample data.

To do this, we begin by importing the SliceMatrix-IO Python client.

If you haven't installed the client yet, the easiest way is with pip:

pip install slicematrixIO

Next, lets import slicematrixIO and create our client which will do the heavy lifting. Make sure to replace the api key below with your own key.

Don't have a key yet? Get your api key here

In [1]:
from slicematrixIO import SliceMatrix

api_key = "insert your api key here"
sm = SliceMatrix(api_key)

To begin, let's import some useful libraries

In [2]:
import pandas as pd
import numpy as np
np.random.seed(98765) #reproducability

Let's load the full dataset...

In [3]:
training_data = pd.read_csv("notebook_files/breast-cancer-wisconsin.data",index_col = 0, header = None)
training_data = training_data[training_data.ix[:,6] != '?']
training_data.index = np.arange(0,training_data.shape[0], 1)

...then shuffle and split the data into training and testing sets...

In [4]:
shuffled = training_data.reindex(np.random.permutation(training_data.index))
shuffled.index = np.arange(0, training_data.shape[0], 1)
cols = shuffled.columns.values.tolist()
cols[-1] = "class"
shuffled.columns = cols
In [5]:
data = shuffled.ix[0:training_data.shape[0]/2,:]
out  = shuffled.ix[training_data.shape[0]/2:,:]
In [6]:
data.head()
Out[6]:
1 2 3 4 5 6 7 8 9 class
0 5 3 4 3 4 5 4 7 1 2
1 1 1 1 1 2 1 3 1 1 2
2 5 2 2 2 2 1 1 1 2 2
3 3 2 2 1 4 3 2 1 1 2
4 1 1 1 1 2 1 2 1 1 2

Note: This dataset can be found here Original paper here

In this example we'll use SliceMatrix-IO to train a K Nearest Neighbor classifier using the testing data:

In [7]:
knn = sm.KNNClassifier(dataset = data, class_column = "class")

Now we can make predictions using the out of sample (testing / validation) data

In [8]:
validation_preds = knn.predict(out.drop("class",axis =1).values.tolist())

Finally we can calculate the percentage of out of sample predictions which were correct...

In [9]:
pct_correct = 1. * np.sum(np.equal(validation_preds, out['class'])) / len(validation_preds)
print pct_correct
0.976608187135

In addition, we could compare performance against another classifier: the Probabilistic Neural Network

In [10]:
pnn = sm.PNNClassifier(dataset = data, class_column = "class")
In [11]:
validation_preds2 = pnn.predict(out.drop("class",axis =1).values.tolist())
In [12]:
pct_correct2 = 1. * np.sum(np.equal(validation_preds2, out['class'])) / len(validation_preds)
print pct_correct2
0.964912280702

Don't have a SliceMatrix-IO api key yet? Get your api key here