slicematrixIO package

Submodules

slicematrixIO.bayesian_filters module

class slicematrixIO.bayesian_filters.KalmanOLS(dataset=None, name=None, pipeline=None, init_alpha=None, init_beta=None, trans_cov=None, obs_cov=None, init_cov=None, optimizations=[], client=None)

Train / Reload a Kalman Filter model for online estimation of the parameters of Ordinary Least Squares (KalmanOLS)

Parameters:

dataset: pandas.DataFrame

Input DataFrame. shape = (nrows, 2) where the first column is Y and the second is X in OLS model

init_alpha : float, optional

Initial value for alpha in OLS model (ignored if optimizations are enabled)

init_beta : float, optional

Initial value for beta in OLS model (ignored if optimizations are enabled)

trans_cov : array-like, optional

Transition covariance, shape = (2, 2)

init_cov : array-like, optional

Initial covariance, shape = (2, 2)

optimizations : list, optional

List of optimizations. Can include multiple optimizations. Default includes all:

  • ‘transition_covariance’
  • ‘observation_covariance’
  • ‘initial_state_mean’
  • ‘initial_state_covariance’

name : string, optional

The desired name of the model. If None then a random name will be generated

client : slicematrixIO.connect.ConnectIO

Low level client for dispatching requests to SliceMatrix-IO

pipeline : BasePipeline, optional

Pipeline to use. Defaults to None. If None then a pipeline will be created for use in creating the model

Returns:

model : :class`.KalmanOLS`

Trained Kalman Filter model

Examples

Create a KalmanOLS model for a given dataset

>>> sm = SliceMatrix(api_key)
>>> kf = sm.KalmanOLS(dataset = dataframe)

Get the current internal state of the model (i.e. current alpha and beta and covariance)

>>> kf.getState()

Update the model will new information, and get updated state

>>> kf.update(X = 128.17, Y = 45.85)

Methods

getState()

Get the current internal state of the Kalman Filter OLS model

Returns:

state : dict

Dictionary with the current state of model i.e. - means (Beta and Alpha, respectively) - covariance

getTrainingData()

Get the historical state of the model over time

Returns:

history : dict

Historical state of both mean and covariance of the model over time.

update(X, Y)

Step the model through a new learning iteration with new datapoints for input (X) and output (Y)

This will permanently change the state of the model as it adjusts to new information.

In a distributed setting, updates to the same KalmanOLS model are not guaranteed to be atomic

Parameters:

X : float

The newly observed value for the input of the OLS model (X)

Y : float

The newly observed value for the output of the OLS model (Y)

Returns:

state : dict

Dictionary with the current state of model i.e. - means (Beta and Alpha, respectively) - covariance

class slicematrixIO.bayesian_filters.KalmanOLSPipeline(name, init_alpha=None, init_beta=None, trans_cov=None, obs_cov=None, init_cov=None, optimizations=[], client=None)

Bases: slicematrixIO.core.BasePipeline

Create a Pipeline for training KalmanOLS models from input datasets

Parameters:

name : string

The desired name of the Pipeline

init_alpha : float, optional

Initial value for alpha in OLS model (ignored if optimizations are enabled)

init_beta : float, optional

Initial value for beta in OLS model (ignored if optimizations are enabled)

trans_cov : array-like, optional

Transition covariance, shape = (2, 2)

init_cov : array-like, optional

Initial covariance, shape = (2, 2)

optimizations : list, optional

List of optimizations. Can include multiple optimizations. Default includes all:

  • ‘transition_covariance’
  • ‘observation_covariance’
  • ‘initial_state_mean’
  • ‘initial_state_covariance’

client : slicematrixIO.connect.ConnectIO

Low level client for dispatching requests to SliceMatrix-IO

Returns:

response : dict

success or failure response to Pipeline creation request

Examples

Create a KalmanOLSPipeline for processing multiple datasets

>>> io = ConnectIO(api_key)
>>> pipe = KalmanOLSPipeline(client = io)
>>> for dataframe in dataframes:
>>>     current_model = pipe.run(dataset = dataframe, name = slicematrixIO.utils.rando_name())

Methods

run(dataset, model)

Run the Pipeline and create a new KalmanOLS model

Parameters:

dataset : pandas.DataFrame

The dataset to pass into the Pipeline which will train a KernelPCA model using the parameters defined upon Pipeline creation. Pipelines are reusable sets of instructions to train a machine learning model.

Returns:

response : dict

success or failure response to model creation request

slicematrixIO.classifiers module

Classifier models are examples of supervised machine learning techniques which aim to predict the class label of a given input datapoint

class slicematrixIO.classifiers.KNNClassifier(dataset=None, class_column=None, name=None, pipeline=None, K=5, kernel='euclidean', algo='auto', weights='uniform', kernel_params={}, client=None)

Train / Reload a KNNClassifier model

Parameters:

dataset : pandas.DataFrame

Input DataFrame. shape = (n_rows, n_features + 1) where each row is a data point and the columns are numeric features and a column with the class labels

name : string

The desired name of the Pipeline.

class_column : string

The name of the column in the input dataset which describes the class labels

K : integer, optional

The desired K in the Nearest Neighbor classifier model

kernel : string [ ‘euclidean’ | ‘minkowski’ | ‘hammond’ | ‘etc...’], optional

The desired kernel for defining distance in our classifier. Default is ‘euclidean’

algo : string [‘auto’ | ‘ball’ | ‘kd_tree’ | ‘brute’], optional

The algorithm to use in determining Nearest Neighbors. Default is ‘auto’

weights : string [‘uniform’ | ‘weighted’], optional

Should voting be uniform (i.e. independent of distance) or weighted by distance (i.e. closer neighbor’s have higher weighted votes)

kernel_params : dict, optional

Any parameters specific to the chosen kernel

client : slicematrixIO.connect.ConnectIO

Low level client for dispatching requests to SliceMatrix-IO

pipeline : string, optional

An extant Pipeline to use for model creation. If None then one will be created

Returns:

model : KNNClassifier

KNNClassifier model object

Examples

Create a KNNClassifier model for a given dataset

>>> sm = SliceMatrix(api_key)
>>> knn = sm.KNNClassifier(dataset = dataframe, K = 5)

Predict the class of some new data

>>> knn.predict([...])

Methods

predict(point)

Predict the class of new input datapoints

Parameters:

point : list

A list of new datapoints. Shape = (n_points, n_features)

Returns:

prediction : list

A list of new predictions for each input datapoint. Shape = (n_points, 1)

score()

Get the training prediction R^2

Returns:

r2 : float

The R^2 of the training predictions

training_data()

Get the input data used to train the model

Returns:

data : list

The training data

training_preds()

Get the training predictions

Returns:

prediction : list

A list of the training predictions

class slicematrixIO.classifiers.KNNClassifierPipeline(name, K=5, kernel='euclidean', algo='auto', weights='uniform', kernel_params={}, client=None)

Bases: slicematrixIO.core.BasePipeline

Create a Pipeline for training KNNClassifier models from input datasets

Parameters:

name : string

The desired name of the Pipeline.

K : integer, optional

The desired K in the Nearest Neighbor classifier model

kernel : string [ ‘euclidean’ | ‘minkowski’ | ‘hammond’ | ‘etc...’], optional

The desired kernel for defining distance in our classifier. Default is ‘euclidean’

algo : string [‘auto’ | ‘ball’ | ‘kd_tree’ | ‘brute’], optional

The algorithm to use in determining Nearest Neighbors. Default is ‘auto’

weights : string [‘uniform’ | ‘weighted’], optional

Should voting be uniform (i.e. independent of distance) or weighted by distance (i.e. closer neighbor’s have higher weighted votes)

kernel_params : dict, optional

Any parameters specific to the chosen kernel

client : slicematrixIO.connect.ConnectIO

Low level client for dispatching requests to SliceMatrix-IO

Returns:

response : dict

success or failure response to Pipeline creation request

Examples

Create a Pipeline for training multiple KNNClassifier models

>>> io = ConnectIO(api_key)
>>> pipe = KNNClassifierPipeline(K = 7, client = io)
>>> for dataframe in dataframes:
>>>     current_model = pipe.run(dataset = dataframe, name = slicematrixIO.utils.rando_name())

Methods

run(dataset, model, class_column)

Run the Pipeline and create a new KNNClassifier model

Parameters:

dataset : pandas.DataFrame

The dataset to pass into the Pipeline which will train a KNNClassifier model using the parameters defined upon Pipeline creation. Pipelines are reusable sets of instructions to train a machine learning model.

Returns:

response : dict

success or failure response to model creation request

class slicematrixIO.classifiers.PNNClassifier(dataset, class_column, name=None, pipeline=None, sigma=0.1, client=None)

Train / Reload a PNNClassifier model

Parameters:

dataset : pandas.DataFrame

Input DataFrame. shape = (n_rows, n_features + 1) where each row is a data point and the columns are numeric features and a column with the class labels

name : string

The desired name of the Pipeline.

class_column : string

The name of the column in the input dataset which describes the class labels

client : slicematrixIO.connect.ConnectIO

Low level client for dispatching requests to SliceMatrix-IO

pipeline : string, optional

An extant Pipeline to use for model creation. If None then one will be created

Returns:

model : PNNClassifier

PNNClassifier model object

Examples

Create a PNNClassifier model for a given dataset

>>> sm = SliceMatrix(api_key)
>>> pnn = sm.PNNClassifier(dataset = dataframe, sigma = 0.12)

Predict the class of some new data

>>> pnn.predict([...])

Methods

predict(point)

Predict the class of new input datapoints

Parameters:

point : list

A list of new datapoints. Shape = (n_points, n_features)

Returns:

prediction : list

A list of new predictions for each input datapoint. Shape = (n_points, 1)

score()

Get the training prediction R^2

Returns:

r2 : float

The R^2 of the training predictions

training_data()

Get the input data used to train the model

Returns:

data : list

The training data

training_preds()

Get the training predictions

Returns:

prediction : list

A list of the training predictions

class slicematrixIO.classifiers.PNNClassifierPipeline(name, sigma=0.1, client=None)

Bases: slicematrixIO.core.BasePipeline

Create a Pipeline for training PNNClassifier models from input datasets

Parameters:

name : string

The desired name of the Pipeline.

sigma : float in (0., 1.), optional

The desired smoothing parameter for the PNN model. Default is 0.1

client : slicematrixIO.connect.ConnectIO

Low level client for dispatching requests to SliceMatrix-IO

Returns:

response : dict

success or failure response to Pipeline creation request

Examples

Create a Pipeline for training multiple PNNClassifier models

>>> io = ConnectIO(api_key)
>>> pipe = PNNClassifierPipeline(sigma = 0.05, client = io)
>>> for dataframe in dataframes:
>>>     current_model = pipe.run(dataset = dataframe, name = slicematrixIO.utils.rando_name())

Methods

run(dataset, model, class_column)

Run the Pipeline and create a new PNNClassifier model

Parameters:

dataset : pandas.DataFrame

The dataset to pass into the Pipeline which will train a PNNClassifier model using the parameters defined upon Pipeline creation. Pipelines are reusable sets of instructions to train a machine learning model.

Returns:

response : dict

success or failure response to model creation request

slicematrixIO.client module

High Level Python Client for the SliceMatrix-IO Machine Learning PaaS

class slicematrixIO.client.SliceMatrix(api_key, region='us-east-1')

Main business object for slicematrixIO-python

Builds upon low level api (ConnectIO) to create high level objects for each model type.

The models are meant to be created by the client, as opposed to instantiated directly.

Parameters:

api_key: string

A Valid SliceMatrix-IO API Key

region : string [‘us-east-1’, ‘us-west-1’, ‘eu-central-1’, ‘ap-southeast-1’]

Data center of choice. API Key must be valid for that specific data center. Latency will be lowest if client is closest to data center.

‘us-east-1’: US East Coast Data Center

‘us-west-1’: US West Coast Data Center

‘eu-central-1’: Continental Europe Data Center

‘ap-southeast-1’: South-East Asian Data Center

Examples

Create a Kernel Density Estimator model that lives in the cloud

>>> kde = sm.KernelDensityEstimator(dataset=df) 

Score a new data point

>>> kde.score(10325632)

Simulate 1000 new data points

>>> kde.simulate(1000) 

Manifold Learning:

>>> iso = sm.Isomap(dataset=prices)

Get statistics / factors related to internal graph structure of each node

>>> iso.rankNodes("pagerank")

Find low dimensional embedding of input data

>>> iso.embedding() 

See www.slicematrix.com/use-cases for more in-depth examples

Attributes

client (ConnectIO) Low level SliceMatrix-IO Python client

Methods

BasicA2D(dataset=None, retrain=True, name=None, pipeline=None)

Create a BasicA2D model (basic automatic anomaly detection)

Parameters:

dataset : pandas.DataFrame

Input DataFrame. shape = (n_rows, n_features + 1) where each row is a data point and the columns are numeric features and a column with the class labels

name : string, optional

The desired name of the model. If None then a random name will be generated

pipeline : string, optional

An extant Pipeline to use for model creation. If None then one will be created

retrain : boolean, optional

Whether to automatically retrain the model upon a remote call to the update method. The BasicA2D is a window detector which can be retrained in an online fashion where new data is used to update the model’s understanding of the world and influence future anomaly scoring.

Returns:

model : slicematrixIO.distributions.BasicA2D

CorrelationFilteredGraph(dataset=None, K=3, name=None, pipeline=None)

Create a Correlation Filtered Graph

CFG are similar to MST’s, in that both graph’s begin with a distance matrix, but whereas MST’s are limited to constructing a tree, CFG’s draw links between a node and its closests K neighbors based on correlation distance. CFG’s are like KNN networks, but optimized for using correlation distance.

Parameters:

dataset : pandas.DataFrame

Input DataFrame. shape = (n_rows, n_features) where each row is a data point and the columns are numeric features

K : integer greater than 1, optional

The number of nearest neighbors to use for constructing the CFG

name : string, optional

The desired name of the model. If None then a random name will be generated

pipeline : string, optional

An extant Pipeline to use for model creation. If None then one will be created

Returns:

model : slicematrixIO.graphs.CorrelationFilteredGraph

DistanceMatrix(dataset=None, K=5, kernel='euclidean', kernel_params={}, geodesic=False, name=None, pipeline=None)

Create a Distance Matrix model

Parameters:

dataset : pandas.DataFrame

Input DataFrame. shape = (n_rows, n_features + 1) where each row is a data point and the columns are numeric features and a column with the class labels

name : string, optional

The desired name of the model. If None then a random name will be generated

pipeline : string, optional

An extant Pipeline to use for model creation. If None then one will be created

K : integer greater than 1, optional, ignored if geodesic == False

The number of neighbors to use in building the geodesic distance matrix. Geodesic distance is constructed by computing the K Nearest Neighbors graph for the input dataset, then constructing all pairwise distances using the geodesic distance, i.e. the number of edges in a shortest path between two points on the graph.

kernel : string, optional

The distance kernel / metric to use in constructing the distance matrix. Default is euclidean.

kernel_params : dict, optional

Any extra parameters specific to the chosen kernel

geodesic : boolean, optional

Whether to create the geodesic distance matrix or the brute force pairwise distance matrix. Default is False

Returns:

model : slicematrixIO.matrices.DistanceMatrix

IsolationForest(dataset=None, rate=0.1, n_trees=100, name=None, pipeline=None)

Create an Isolation Forest model for automatic anomaly detection

Parameters:

dataset : pandas.DataFrame

Input DataFrame. shape = (n_rows, n_features + 1) where each row is a data point and the columns are numeric features and a column with the class labels

name : string, optional

The desired name of the model. If None then a random name will be generated

pipeline : string, optional

An extant Pipeline to use for model creation. If None then one will be created

rate : float in (0., 0.5), optional

The desired rate of anomaly detection in training data. Default is 0.1 i.e. 10%

n_trees : integer greater than 1, optional

The number of trees to use in construction of the Isolation Forest model. Default is 100 trees

Returns:

model : slicematrixIO.distributions.IsolationForest

Isomap(dataset=None, D=2, K=3, name=None, pipeline=None)

Create Isomap model for non-linear dimensonality reduction

Parameters:

dataset : pandas.DataFrame

Input DataFrame. shape = (n_rows, n_features) where each row is a data point and the columns are numeric features

D : int, optional

The desired embedding dimension. Defaults to 2-D

K : integer greater than 1, optional, ignored if geodesic == False

The number of neighbors to use in building the geodesic distance matrix. Geodesic distance is constructed by computing the K Nearest Neighbors graph for the input dataset, then constructing all pairwise distances using the geodesic distance, i.e. the number of edges in a shortest path between two points on the graph.

name : string, optional

The desired name of the model. If None then a random name will be generated

pipeline : string, optional

An extant Pipeline to use for model creation. If None then one will be created

Returns:

model : slicematrixIO.manifolds.Isomap

KNNClassifier(dataset=None, class_column=None, name=None, pipeline=None, K=5, kernel='euclidean', algo='auto', weights='uniform', kernel_params={})

Create K Nearest Neighbors Classifier model

Parameters:

dataset : pandas.DataFrame

Input DataFrame. shape = (n_rows, n_features + 1) where each row is a data point and the columns are numeric features and a column with the class labels

class_column : string

The name of the column in the input dataset which describes the class labels

name : string, optional

The desired name of the model. If None then a random name will be generated

pipeline : string, optional

An extant Pipeline to use for model creation. If None then one will be created

K : integer, optional

The desired K in the Nearest Neighbor classifier model

kernel : string [ ‘euclidean’ | ‘minkowski’ | ‘hammond’ | ‘etc...’], optional

The desired kernel for defining distance in our classifier. Default is ‘euclidean’

algo : string [‘auto’ | ‘ball’ | ‘kd_tree’ | ‘brute’], optional

The algorithm to use in determining Nearest Neighbors. Default is ‘auto’

weights : string [‘uniform’ | ‘weighted’], optional

Should voting be uniform (i.e. independent of distance) or weighted by distance (i.e. closer neighbor’s have higher weighted votes)

kernel_params : dict, optional

Any parameters specific to the chosen kernel

Returns:

model : slicematrixIO.classifiers.KNNClassifier

KNNRegressor(X=None, Y=None, K=5, kernel='euclidean', algo='auto', weights='uniform', kernel_params={}, name=None, pipeline=None)

Create a K Nearest Neighbors Regressor model

Multi-output regression finds function from input space Y to lower dimension output space

Parameters:

X : pandas.DataFrame

Input DataFrame. shape = (n_rows, input_features) where each row is a data point and the columns are numeric features

Y : pandas.DataFrame

Output DataFrame. shape = (n_rows, output_features) where output_features < input_features and each row is a data point and the columns are numeric features

name : string, optional

The desired name of the model. If None then a random name will be generated

pipeline : string, optional

An extant Pipeline to use for model creation. If None then one will be created

K : integer, optional

The desired K in the Nearest Neighbor classifier model

kernel : string [ ‘euclidean’ | ‘minkowski’ | ‘hammond’ | ‘etc...’], optional

The desired kernel for defining distance in our classifier. Default is ‘euclidean’

algo : string [‘auto’ | ‘ball’ | ‘kd_tree’ | ‘brute’], optional

The algorithm to use in determining Nearest Neighbors. Default is ‘auto’

weights : string [‘uniform’ | ‘weighted’], optional

Should voting be uniform (i.e. independent of distance) or weighted by distance (i.e. closer neighbor’s have higher weighted predictions)

kernel_params : dict, optional

Any parameters specific to the chosen kernel

Returns:

model : slicematrixIO.regressors.KNNRegressor

KalmanOLS(dataset=None, init_alpha=None, init_beta=None, trans_cov=None, obs_cov=None, init_cov=None, optimizations=[], name=None, pipeline=None)

Create slicematrixIO.bayesian_filters.KalmanOLS object with current client

The KalmanOLS model

Parameters:

dataset: pandas.DataFrame

Input DataFrame. shape = (nrows, 2) where the first column is Y and the second is X in OLS model

init_alpha : float, optional

Initial value for alpha in OLS model (ignored if optimizations are enabled)

init_beta : float, optional

Initial value for beta in OLS model (ignored if optimizations are enabled)

trans_cov : array-like, optional

Transition covariance, shape = (2, 2)

init_cov : array-like, optional

Initial covariance, shape = (2, 2)

optimizations : list, optional

List of optimizations. Can include multiple optimizations. Default includes all:

  • ‘transition_covariance’
  • ‘observation_covariance’
  • ‘initial_state_mean’
  • ‘initial_state_covariance’

name : string, optional

The desired name of the model. If None then a random name will be generated

pipeline : BasePipeline, optional

Pipeline to use. Defaults to None. If None then a pipeline will be created for use in creating the model

Returns:

model : slicematrixIO.bayesian_filters.KalmanOLS

KernelDensityEstimator(dataset=None, bandwith='scott', kernel_params={}, name=None, pipeline=None)

Train a Kernel Density Estimator model

Parameters:

dataset : pandas.DataFrame

Input DataFrame. shape = (n_rows, n_features + 1) where each row is a data point and the columns are numeric features and a column with the class labels

name : string, optional

The desired name of the model. If None then a random name will be generated

pipeline : string, optional

An extant Pipeline to use for model creation. If None then one will be created

bandwidth : str [‘scott’ | ‘silverman’], optional

The method for bandwidth selection in the KDE model

kernel_params : dict, optional

Any parameters specific to the chosen kernel

Returns:

model : slicematrixIO.distributions.KernelDensityEstimator

KernelPCA(dataset=None, D=2, kernel='linear', alpha=1.0, invert=False, kernel_params={}, name=None, pipeline=None)

Create a Kernel Principal Components Analysis model for non-linear dimensionality reduction. Applies the kernel trick to PCA

Parameters:

dataset : pandas.DataFrame

Input DataFrame. shape = (n_rows, n_features) where each row is a data point and the columns are numeric features

D : int, optional

The desired embedding dimension. Defaults to 2-D

kernel : string, optional

The distance kernel / metric to use in constructing the distance matrix. Default is euclidean.

alpha : float, optional

Parameter of ridge regression which learns the inverse transform. Ignored if invert == False

invert : boolean, optional

Whether to learn the inverse transform (from low dimension space back to high dimension space)

kernel_params : dict, optional

Any extra parameters specific to the chosen kernel

name : string, optional

The desired name of the model. If None then a random name will be generated

pipeline : string, optional

An extant Pipeline to use for model creation. If None then one will be created

Returns:

model : slicematrixIO.manifolds.KernelPCA

KernelRidgeRegressor(X=None, Y=None, kernel='linear', alpha=1.0, kernel_params={}, name=None, pipeline=None)

Create a Kernel Ridge Regressor model

Parameters:

X : pandas.DataFrame

Input DataFrame. shape = (n_rows, input_features) where each row is a data point and the columns are numeric features

Y : pandas.DataFrame

Output DataFrame. shape = (n_rows, output_features) where output_features < input_features and each row is a data point and the columns are numeric features

alpha : float, optional

Kernel Ridge Regressor model alpha value. Default 1.0

kernel : string [‘linear’, ‘rbf’, ‘poly’]

Kernel to use in regression. Linear is default. For nonlinear datasets, consider rbf or poly

‘linear’ : linear kernel

‘rbf’ : radial basis function kernel

‘poly’ : polynomial kernel

kernel_params : dict

Kernel specific parameters

name : string, optional

The desired name of the model. If None then a random name will be generated

pipeline : string, optional

An extant Pipeline to use for model creation. If None then one will be created

Returns:

model : slicematrixIO.regressors.KernelRidgeRegressor

LaplacianEigenmapper(dataset=None, D=2, affinity='knn', K=5, gamma=1.0, name=None, pipeline=None)

Create Laplacian Eiegenmapper model aka spectral embedder for non-linear dimensonality reduction

Parameters:

dataset : pandas.DataFrame

Input DataFrame. shape = (n_rows, n_features) where each row is a data point and the columns are numeric features

D : int, optional

The desired embedding dimension. Defaults to 2-D

affinity : string [“knn” | “rbf”], optional

How should we construct the affinity matrix?

“knn” : use k nearest neighbors graph

“rbf” : use radial basis function kernel

K : integer greater than 1, optional

The K to use if affinity is “knn”.

gamma : float, optional

Kernel coefficient for affinity “rbf”

name : string, optional

The desired name of the model. If None then a random name will be generated

pipeline : string, optional

An extant Pipeline to use for model creation. If None then one will be created

Returns:

model : slicematrixIO.manifolds.LaplacianEigenmapper

LocalLinearEmbedder(dataset=None, D=2, K=3, method='standard', name=None, pipeline=None)

Create a Local Linear Embedder model for non-linear dimensonality reduction

Parameters:

dataset : pandas.DataFrame

Input DataFrame. shape = (n_rows, n_features) where each row is a data point and the columns are numeric features

D : int, optional

The desired embedding dimension. Defaults to 2-D

K : integer greater than 1, optional

The number of neighbors to use in building the embedding. Default is 3

method : string [‘standard’ | ‘hessian’ | ‘modified’ | ‘ltsa’]

Which LLE algorithm should we use?

‘standard’ : standard LLE method

‘hessian’: hessian eigenmap LLE method, requires that K > D * (1 + (D + 1) / 2

‘modified’ : modified LLE method

‘ltsa’: local tangent space alignment LLE method

name : string, optional

The desired name of the model. If None then a random name will be generated

pipeline : string, optional

An extant Pipeline to use for model creation. If None then one will be created

Returns:

model : slicematrixIO.manifolds.LocalLinearEmbedder

MatrixAgglomerator(label_dataset=None, alpha=0.1, matrix=None, matrix_name=None, matrix_type=None, name=None, pipeline=None)

Create a Matrix Agglomerator model. Essential for supervised manifold learning, this model takes a previously created matrix model as input and applies class label information to the similarity matrix. In a nutshell, this model pulls data points of the same class closer together, increasing the separability of the dataset.

Parameters:

label_dataset : pandas.DataFrame

The class label information. shape = (n_rows, 1). n_rows should be same dimension as input matrix

alpha : float in (0., 1.)

The agglomeration factor, i.e. how much does class label effect input distances. An alpha of 0 will have no effect, while 1.0 will pull data points of the same class completely together. The higher the value of alpha, the more information will be transfered to the distance matrix from the class labels. Higher alphas increase the in-sample performance but also increase the chance of over-fitting

matrix : object

Matrix model object; i.e. a class from slicematrixIO.matrices

matrix_name : string, optional

The name of the existing matrix model. Optional if matrix is not None

matrix_type :

The type of the matrix model. Optional if matrix is not None. Required if using matrix_name

name : string, optional

The desired name of the model. If None then a random name will be generated

pipeline : string, optional

An extant Pipeline to use for model creation. If None then one will be created

Returns:

model : slicematrixIO.matrix_models.MatrixAgglomerator

MatrixKernelPCA(D=2, matrix=None, matrix_name=None, matrix_type=None, name=None, pipeline=None)

Decompose the input matrix and embed the input into a lower dimension space

Parameters:

D : int, optional

The desired embedding dimension. Defaults to 2-D

matrix : object

Matrix model object; i.e. a class from slicematrixIO.matrices

matrix_name : string, optional

The name of the existing matrix model. Optional if matrix is not None

matrix_type :

The type of the matrix model. Optional if matrix is not None. Required if using matrix_name

name : string, optional

The desired name of the model. If None then a random name will be generated

pipeline : string, optional

An extant Pipeline to use for model creation. If None then one will be created

Returns:

model : slicematrixIO.matrix_models.MatrixKernelPCA

MatrixMinimumSpanningTree(matrix=None, matrix_name=None, matrix_type=None, name=None, pipeline=None)

Create a Minimum Spanning Tree model from a Distance Matrix model

This is an example of a Matrix Model, which creates a machine learning model using another already trained model as its input. You can think of this as model chaining.

In this case, this function takes a previously created Distance Matrix model and uses it to construct a network graph model called a Minimum Spanning Tree.

Parameters:

matrix : object

Matrix model object; i.e. a class from slicematrixIO.matrices

matrix_name : string, optional

The name of the existing matrix model. Optional if matrix is not None

matrix_type :

The type of the matrix model. Optional if matrix is not None. Required if using matrix_name

name : string, optional

The desired name of the model. If None then a random name will be generated

pipeline : string, optional

An extant Pipeline to use for model creation. If None then one will be created

Returns:

model : slicematrixIO.matrix_models.MatrixMinimumSpanningTree

MinimumSpanningTree(dataset=None, corr_method='pearson', name=None, pipeline=None)

Create a Minimum Spanning Tree graph model

MST models transform the input dataset into a distance matrix then construct a graph with the shortest possible total distance which visits all nodes without cycling (i.e. it creates a tree)

In particular, this model constructs the graph using the correlation matrix. For more flexible options in creating a MST graph, use slicematrixIO.matrix_models.MatrixMinimumSpanningTree in combination with a distance matrix

Parameters:

dataset : pandas.DataFrame

Input DataFrame. shape = (n_rows, n_features) where each row is a data point and the columns are numeric features

corr_method : string [“pearson” | “spearman” | “kendall” ]

Which method should we use for computing the correlation matrix?

“pearson” : use the Pearson correlation coefficient

“spearman” : use Spearman’s rho

“kendall” : use Kendall’s tau

name : string, optional

The desired name of the model. If None then a random name will be generated

pipeline : string, optional

An extant Pipeline to use for model creation. If None then one will be created

Returns:

model : slicematrixIO.graphs.MinimumSpanningTree

NeighborNetworkGraph(dataset=None, K=3, kernel='euclidean', name=None, pipeline=None)

Create a K Nearest Neighbor Graph for the given dataset

Parameters:

dataset : pandas.DataFrame

Input DataFrame. shape = (n_rows, n_features) where each row is a data point and the columns are numeric features

K : integer greater than 1, optional

The number of nearest neighbors to use for constructing the CFG

kernel : string, optional

The distance kernel / metric to use in constructing the distance matrix. Default is euclidean.

name : string, optional

The desired name of the model. If None then a random name will be generated

pipeline : string, optional

An extant Pipeline to use for model creation. If None then one will be created

Returns:

model : slicematrixIO.graphs.NeighborNetworkGraph

PNNClassifier(dataset=None, class_column=None, name=None, pipeline=None, sigma=0.1)

Create a Probabilistic Neural Network Classifier model

Parameters:

dataset : pandas.DataFrame

Input DataFrame. shape = (n_rows, n_features + 1) where each row is a data point and the columns are numeric features and a column with the class labels

class_column : string

The name of the column in the input dataset which describes the class labels

name : string, optional

The desired name of the model. If None then a random name will be generated

pipeline : string, optional

An extant Pipeline to use for model creation. If None then one will be created

sigma : float in (0., 1.), optional

The desired smoothing parameter for the PNN model. Default is 0.1

Returns:

model : slicematrixIO.classifiers.PNNClassifier

RFRegressor(X=None, Y=None, n_trees=8, name=None, pipeline=None)

Create a Random Forest Regressor model

A Random Forest Regressor finds a function which maps the input space (X) to the lower dimension output space (Y) using decision trees

Parameters:

X : pandas.DataFrame

Input DataFrame. shape = (n_rows, input_features) where each row is a data point and the columns are numeric features

Y : pandas.DataFrame

Output DataFrame. shape = (n_rows, output_features) where output_features < input_features and each row is a data point and the columns are numeric features

n_trees : integer greater than 1, optional

The number of trees to use in construction of the Random Forest model. Default is 8 trees

name : string, optional

The desired name of the model. If None then a random name will be generated

pipeline : string, optional

An extant Pipeline to use for model creation. If None then one will be created

Returns:

model : slicematrixIO.regressors.RFRegressor

slicematrixIO.connect module

low level SliceMatrix-IO API client

class slicematrixIO.connect.ConnectIO(api_key, region='us-east-1')

Low Level Connection to SliceMatrix-IO

Implements basic interface

Parameters:

api_key : string

Valid SliceMatrix-IO API Key

region : string [‘us-east-1’, ‘us-west-1’, ‘eu-central-1’, ‘ap-southeast-1’]

Data center of choice. API Key must be valid for that specific data center. Latency will be lowest if client is closest to data center.

‘us-east-1’: US East Coast Data Center

‘us-west-1’: US West Coast Data Center

‘eu-central-1’: Continental Europe Data Center

‘ap-southeast-1’: South-East Asian Data Center

Examples

>>> from slicematrixIO.connect import ConnectIO
>>> io = ConnectIO(api_key)
>>> io.create_pipeline(...)
>>> io.run_pipeline(...)
>>> io.call_model(...)

Attributes

uploader (object) Convienence class for uploading data to SliceMatrix-IO
region (string)

Methods

call_model(model, type, method, extra_params={}, memory='large')

Remotely call a method in a machine learning model

Parameters:

model : string

The name of the model

type : string

The type of the model

method: string

The name of the model method to call remotely. Acceptable inputs vary by Pipeline type. See Pipeline docs for more information

extra_params : dict

Any extra parameters to pass as key / values to the Pipeline

memory: string [ ‘large’]

The size of the container (always set to large for beta)

Returns:

model_output: dict

create_pipeline(name, type, params={})

Create a new Analytical Pipeline for distributed computation

Parameters:

name : string

The desired name of the new Pipeline

type : string [ ‘raw_isomap’ | ‘raw_mst’ | ‘raw_lle’ | ‘raw_cfg’ | ‘raw_kde’ | ‘raw_knn_net’ | ‘raw_knn_classifier’ | ‘raw_knn_regressor’ | ‘raw_kpca’ | ‘raw_krr’ | ‘raw_rfr’ | ‘raw_laplacian’ | ‘raw_pnn’ | ‘matrix_mst’ | ‘matrix_kpca’ | ‘matrix_agg’ | ‘kalman_ols’ | ‘basic_a2d’ | ‘isolation_forest’ | ‘dist_matrix’]

The type of the Pipeline

params: dict

Any type specific parameters to the Pipeline in the key/val dictionary

Returns:

response: dict

Notes

The basic structure of computation in SliceMatrix-IO starts with the Pipeline.

Pipelines can be thought of as analytical assembly lines, running code which transforms a dataset from raw input data into a meaningful machine learning model. Each pipeline can be reused to process multiple datasets. Pipelines can also be run in parallel.

list_files()

Get a list of the files previously uploaded

Returns:file_list : list
put_df(name, dataframe)

Upload the DataFrame with desired name and get response (success | failure)

Parameters:

name : string

The desired name of the DataFrame

dataframe: pandas.DataFrame

The DataFrame for uploading to the SliceMatrix-IO backend

Returns:

response : dict

run_pipeline(name, model, type=None, dataset=None, matrix_name=None, matrix_type=None, X=None, Y=None, extra_params={}, memory='large')

Run a Pipeline with the given dataset

Parameters:

name : string

The name of the target Pipeline

model : string

The desired name of the model

type : string [ ‘raw_isomap’ | ‘raw_mst’ | ‘raw_lle’ | ‘raw_cfg’ | ‘raw_kde’ |

‘raw_knn_net’ | ‘raw_knn_classifier’ | ‘raw_knn_regressor’ | ‘raw_kpca’ | ‘raw_krr’ | ‘raw_rfr’ | ‘raw_laplacian’ | ‘raw_pnn’ | ‘matrix_mst’ | ‘matrix_kpca’ | ‘matrix_agg’ | ‘kalman_ols’ | ‘basic_a2d’ | ‘isolation_forest’ | ‘dist_matrix’]

The type of the Pipeline

dataset : string

The name of the dataset to run through the Pipeline

matrix_name : string

The name of the matrix model to run through the Pipeline (for Matrix Models)

matrix_type : string [ ‘dist_matrix’ | ‘matrix_agg’ ]

The type of matrix

X : string

The name of the X input (for multi-output regression models)

Y : string

The name of the Y input (for multi-output regression models)

extra_params : dict

Any extra parameters to pass as key / values to the Pipeline

memory: string [ ‘large’]

The size of the container (always set to large for beta)

Returns:

response : dict

Notes

This is a very flexible function for running any Pipeline in the SliceMatrix-IO platform.

Most Pipelines will take a single dataset name as input (such as raw_isomap and raw_knn_classifier), whereas others will have more complex inputs. Matrix Models will take matrix_name and matrix_type parameters and regression models will require the names of input (X) and output (Y) training sets.

class slicematrixIO.connect.Uploader(api_key, region, api)

Object to handle uploads to SliceMatrix-IO backend

Parameters:

api_key : string

Valid SliceMatrix-IO API Key

region : string [‘us-east-1’, ‘us-west-1’, ‘eu-central-1’, ‘ap-southeast-1’]

Data center of choice. API Key must be valid for that specific data center. Latency will be lowest if client is closest to data center.

‘us-east-1’: US East Coast Data Center

‘us-west-1’: US West Coast Data Center

‘eu-central-1’: Continental Europe Data Center

‘ap-southeast-1’: South-East Asian Data Center

api : string

API ID

Examples

>>> uploader = Uploader(api_key)
>>> uploader.put_df("my_dataframe", df)   

Methods

get_upload_url(file_name)
list_files()

Get a list of the files previously uploaded

Returns:file_list : list
put_df(name, df)

Upload the DataFrame with desired name and get response (success | failure)

Parameters:

name : string

The desired name of the DataFrame

df: pandas.DataFrame

The DataFrame for uploading to the SliceMatrix-IO backend

Returns:

response : dict

slicematrixIO.core module

Core classes

class slicematrixIO.core.BasePipeline(name, type, client=None, params={})

The base class for every Pipeline

Parameters:

name : string

The desired name of the Pipeline

type : string

The type of the Pipeline

client : slicematrixIO.connect.ConnectIO

Low level client for dispatching requests to SliceMatrix-IO

Returns:

response : dict

success or failure response to Pipeline creation request

Methods

run(model, type=None, dataset=None, matrix_name=None, matrix_type=None, X=None, Y=None, extra_params={})

Run the Pipeline and create a new model

Parameters:

dataset : pandas.DataFrame

The dataset to pass into the Pipeline which will train a model using the parameters defined upon Pipeline creation. Pipelines are reusable sets of instructions to train a machine learning model.

slicematrixIO.distributions module

class slicematrixIO.distributions.BasicA2D(dataset=None, name=None, pipeline=None, retrain=True, client=None)

Methods

getState()
score(value)
update(value)
class slicematrixIO.distributions.BasicA2DPipeline(name, retrain=True, client=None)

Bases: slicematrixIO.core.BasePipeline

Methods

run(dataset, model)
class slicematrixIO.distributions.IsolationForest(dataset=None, name=None, pipeline=None, rate=0.1, n_trees=100, client=None)

Methods

score(points)
training_scores()
class slicematrixIO.distributions.IsolationForestPipeline(name, rate=0.1, n_trees=100, client=None)

Bases: slicematrixIO.core.BasePipeline

Methods

run(dataset, model)
class slicematrixIO.distributions.KernelDensityEstimator(dataset=None, name=None, pipeline=None, bandwidth='scott', client=None)

Methods

hypercube(lower_bounds, upper_bounds)
simulate(N=1)
class slicematrixIO.distributions.KernelDensityEstimatorPipeline(name, bandwidth='scott', client=None)

Bases: slicematrixIO.core.BasePipeline

Methods

run(dataset, model)

slicematrixIO.graphs module

Classes for creating network graph models

class slicematrixIO.graphs.CorrelationFilteredGraph(dataset=None, name=None, pipeline=None, K=3, client=None)

Methods

edges()

Get a list of all the edges in the graph model

Returns:

edges : list

list of all edge / link tuples. Source is edge[0] Target is edge[1]

neighborhood(node)

Get the nearest neighbors of the given node

Parameters:

node : string

The name of the target node we want to find the neighbors (shared edges)

Returns:

neighbors : dict

Dictionary of nearest neighbors with distances to target node

nodes()

Get the names of the data points / nodes that make of the training dataset

Returns:

nodes : list

Data point names / indices

Rank the links by weight, if applicable

Returns:

links : dict

dictionary of links with associated weight, if applicable

rankNodes(statistic='closeness_centrality')

Rank the model’s nodes by the given network graph statistic / factor

Parameters:

statistic : string [‘degree_centrality’ | ‘eigen_centrality’ | ‘closeness_centrality’ | ‘betweenness_centrality’ | ‘is_connected’ |

‘curr_flow_centrality’ | ‘pagerank’ | ‘hits’ | ‘communicability’ | ‘clustering’ | ‘square_clustering’ | ‘greedy_colors’ | ‘eccentricity’ | ‘clique_numbers’ | ‘number_of_cliques’ | ‘estrada_index’ | ‘assortivity’ | ‘transitivity’ | ‘avg_clustering’ | ‘maximal_matching’ | ‘max_weight_matching’ | ‘dispersion’]

The desired graph statistic

Returns:

stats : array-like

Depending on the statistic this will be an array or a single float value

class slicematrixIO.graphs.CorrelationFilteredGraphPipeline(name, K=3, client=None)

Bases: slicematrixIO.core.BasePipeline

Create a Pipeline for training CorrelationFilteredGraph models.

CFG’s are similar to MST’s, in that both graph’s begin with a distance matrix, but whereas MST’s are limited to constructing a tree, CFG’s draw links between a node and its closests K neighbors based on correlation distance. CFG’s are like KNN networks, but optimized for using correlation distance.

Parameters:

name : string

The desired name of the Pipeline.

K : integer greater than 1, optional

The number of nearest neighbors to use for constructing the CFG

client : slicematrixIO.connect.ConnectIO

Low level client for dispatching requests to SliceMatrix-IO

Returns:

response : dict

success or failure response to Pipeline creation request

Examples

Create a Pipeline for training multiple CorrelationFilteredGraph models

>>> io = ConnectIO(api_key)
>>> pipe = CorrelationFilteredGraphPipeline(client = io)
>>> for dataframe in dataframes:
>>>     current_model = pipe.run(dataset = dataframe, name = slicematrixIO.utils.rando_name())

Methods

run(dataset, model)
class slicematrixIO.graphs.MinimumSpanningTree(dataset=None, name=None, pipeline=None, corr_method='pearson', client=None)

Methods

edges()

Get a list of all the edges in the graph model

Returns:

edges : list

list of all edge / link tuples. Source is edge[0] Target is edge[1]

neighborhood(node)

Get the nearest neighbors of the given node

Parameters:

node : string

The name of the target node we want to find the neighbors (shared edges)

Returns:

neighbors : dict

Dictionary of nearest neighbors with distances to target node

nodes()

Get the names of the data points / nodes that make of the training dataset

Returns:

nodes : list

Data point names / indices

Rank the links by weight, if applicable

Returns:

links : dict

dictionary of links with associated weight, if applicable

rankNodes(statistic='closeness_centrality')

Rank the model’s nodes by the given network graph statistic / factor

Parameters:

statistic : string [‘degree_centrality’ | ‘eigen_centrality’ | ‘closeness_centrality’ | ‘betweenness_centrality’ | ‘is_connected’ |

‘curr_flow_centrality’ | ‘pagerank’ | ‘hits’ | ‘communicability’ | ‘clustering’ | ‘square_clustering’ | ‘greedy_colors’ | ‘eccentricity’ | ‘clique_numbers’ | ‘number_of_cliques’ | ‘estrada_index’ | ‘assortivity’ | ‘transitivity’ | ‘avg_clustering’ | ‘maximal_matching’ | ‘max_weight_matching’ | ‘dispersion’]

The desired graph statistic

Returns:

stats : array-like

Depending on the statistic this will be an array or a single float value

class slicematrixIO.graphs.MinimumSpanningTreePipeline(name, corr_method='pearson', client=None)

Bases: slicematrixIO.core.BasePipeline

Create a Pipeline for training MinimumSpanningTree models.

Parameters:

name : string

The desired name of the Pipeline.

corr_method : string [“pearson” | “spearman” | “kendall” ]

Which method should we use for computing the correlation matrix?

“pearson” : use the Pearson correlation coefficient

“spearman” : use Spearman’s rho

“kendall” : use Kendall’s tau

client : slicematrixIO.connect.ConnectIO

Low level client for dispatching requests to SliceMatrix-IO

Returns:

response : dict

success or failure response to Pipeline creation request

Examples

Create a Pipeline for training multiple MinimumSpanningTree models

>>> io = ConnectIO(api_key)
>>> pipe = MinimumSpanningTreePipeline(client = io)
>>> for dataframe in dataframes:
>>>     current_model = pipe.run(dataset = dataframe, name = slicematrixIO.utils.rando_name())

Methods

run(dataset, model)
class slicematrixIO.graphs.NeighborNetworkGraph(dataset=None, name=None, pipeline=None, K=3, kernel='euclidean', client=None)

Methods

edges()

Get a list of all the edges in the graph model

Returns:

edges : list

list of all edge / link tuples. Source is edge[0] Target is edge[1]

neighborhood(node)

Get the nearest neighbors of the given node

Parameters:

node : string

The name of the target node we want to find the neighbors (shared edges)

Returns:

neighbors : dict

Dictionary of nearest neighbors with distances to target node

nodes()

Get the names of the data points / nodes that make of the training dataset

Returns:

nodes : list

Data point names / indices

Rank the links by weight, if applicable

Returns:

links : dict

dictionary of links with associated weight, if applicable

rankNodes(statistic='closeness_centrality')

Rank the model’s nodes by the given network graph statistic / factor

Parameters:

statistic : string [‘degree_centrality’ | ‘eigen_centrality’ | ‘closeness_centrality’ | ‘betweenness_centrality’ | ‘is_connected’ |

‘curr_flow_centrality’ | ‘pagerank’ | ‘hits’ | ‘communicability’ | ‘clustering’ | ‘square_clustering’ | ‘greedy_colors’ | ‘eccentricity’ | ‘clique_numbers’ | ‘number_of_cliques’ | ‘estrada_index’ | ‘assortivity’ | ‘transitivity’ | ‘avg_clustering’ | ‘maximal_matching’ | ‘max_weight_matching’ | ‘dispersion’]

The desired graph statistic

Returns:

stats : array-like

Depending on the statistic this will be an array or a single float value

class slicematrixIO.graphs.NeighborNetworkGraphPipeline(name, K=3, kernel='euclidean', client=None)

Bases: slicematrixIO.core.BasePipeline

Create a Pipeline for training NeighborNetworkGraph models.

Parameters:

name : string

The desired name of the Pipeline.

K : integer greater than 1, optional

The number of nearest neighbors to use for constructing the CFG

kernel : string, optional

The distance kernel / metric to use in constructing the distance matrix. Default is euclidean.

client : slicematrixIO.connect.ConnectIO

Low level client for dispatching requests to SliceMatrix-IO

Returns:

response : dict

success or failure response to Pipeline creation request

Examples

Create a Pipeline for training multiple NeighborNetworkGraph models

>>> io = ConnectIO(api_key)
>>> pipe = NeighborNetworkGraphPipeline(K = 5, client = io)
>>> for dataframe in dataframes:
>>>     current_model = pipe.run(dataset = dataframe, name = slicematrixIO.utils.rando_name())

Methods

run(dataset, model)

slicematrixIO.manifolds module

Manifold Learning Pipelines and Models

class slicematrixIO.manifolds.Isomap(dataset, name=None, pipeline=None, D=2, K=3, client=None)

Train / Reload an Isomap model

Parameters:

name : string, optional

The desired name of the model. If None then a random name will be generated. If dataset == None, then the name will be used to lazy load the model from the SliceMatrix-IO cloud.

dataset : pandas.DataFrame

Input DataFrame. shape = (n_rows, n_features) where each row is a data point and the columns are numeric features

D : int, optional

The desired embedding dimension. Defaults to 2-D

K : integer greater than 1, optional, ignored if geodesic == False

The number of neighbors to use in building the geodesic distance matrix. Geodesic distance is constructed by computing the K Nearest Neighbors graph for the input dataset, then constructing all pairwise distances using the geodesic distance, i.e. the number of edges in a shortest path between two points on the graph.

client : slicematrixIO.connect.ConnectIO

Low level client for dispatching requests to SliceMatrix-IO

pipeline : string, optional

An extant Pipeline to use for model creation. If None then one will be created

Returns:

model : Isomap

Isomap model object

Examples

Create a model for a given dataset

>>> sm = SliceMatrix(api_key)
>>> iso = sm.Isomap(dataset = dataframe, D = 3, K = 10)

Get the embedding

>>> iso.embedding()

Methods

edges()

Get a list of all the edges in the KNN graph used to created the Isomap model

Returns:

edges : list

list of all edge / link tuples. Source is edge[0] Target is edge[1]

embedding(nodes=True)

Get the D dimensional embedding of the training data

I.e.

  1. Take input data in high dimensions
  2. Transform via Isomap to D dimensions
Parameters:

nodes : boolean, optional

Whether to return with node names. Default == True

Returns:

embedding : pandas.DataFrame

D dimensional embedding. shape = (n_rows, D)

neighborhood(node)

Get the nearest neighbors of the given node

Parameters:

node : string

The name of the target node we want to find the nearest neighbors for

Returns:

neighbors : dict

Dictionary of nearest neighbors with distances to target node

nodes()

Get the names of the data points / nodes that make of the training dataset

Returns:

nodes : list

Data point names / indices

Rank the links by geodesic distance

Returns:

links : dict

dictionary of links with associated geodesic distances

rankNodes(statistic='closeness_centrality')

Rank the model’s nodes by the given network graph statistic / factor

Parameters:

statistic : string [‘degree_centrality’ | ‘eigen_centrality’ | ‘closeness_centrality’ | ‘betweenness_centrality’ | ‘is_connected’ |

‘curr_flow_centrality’ | ‘pagerank’ | ‘hits’ | ‘communicability’ | ‘clustering’ | ‘square_clustering’ | ‘greedy_colors’ | ‘eccentricity’ | ‘clique_numbers’ | ‘number_of_cliques’ | ‘estrada_index’ | ‘assortivity’ | ‘transitivity’ | ‘avg_clustering’ | ‘maximal_matching’ | ‘max_weight_matching’ | ‘dispersion’]

The desired graph statistic

Returns:

stats : array-like

Depending on the statistic this will be an array or a single float value

recon_error()

Get the reconstruction error of the model.

Reconstruction error of the embedding

Returns:

recon_error : float

Reconstruction error for the model

search(point)
class slicematrixIO.manifolds.IsomapPipeline(name, D=2, K=3, client=None)

Bases: slicematrixIO.core.BasePipeline

Create a Pipeline for training Isomap models

Parameters:

name : string

The desired name of the Pipeline.

D : int, optional

The desired embedding dimension. Defaults to 2-D

K : integer greater than 1, optional, ignored if geodesic == False

The number of neighbors to use in building the geodesic distance matrix. Geodesic distance is constructed by computing the K Nearest Neighbors graph for the input dataset, then constructing all pairwise distances using the geodesic distance, i.e. the number of edges in a shortest path between two points on the graph.

client : slicematrixIO.connect.ConnectIO

Low level client for dispatching requests to SliceMatrix-IO

Returns:

response : dict

success or failure response to Pipeline creation request

Examples

Create a Isomap Pipeline for processing multiple datasets

>>> io = ConnectIO(api_key)
>>> iso_pipe = KernelPCAPipeline(D = 3, K = 4, client = io)
>>> for dataframe in dataframes:
>>>     current_model = iso_pipe.run(dataset = dataframe, name = slicematrixIO.utils.rando_name())

Methods

run(dataset, model)

Run the Pipeline and create a new Isomap model

Parameters:

dataset : pandas.DataFrame

The dataset to pass into the Pipeline which will train an Isomap model using the parameters defined upon Pipeline creation. Pipelines are reusable sets of instructions to train a machine learning model.

Returns:

response : dict

success or failure response to model creation request

class slicematrixIO.manifolds.KernelPCA(dataset=None, name=None, pipeline=None, D=2, kernel='linear', alpha=1.0, invert=False, kernel_params={}, client=None)

Kernel Principal Component Analysis model

For non-linear dimensionality reduction, simulation, classification, and regression.

Applies the kernel trick to PCA.

Parameters:

dataset : pandas.DataFrame, optional

The dataset to use in training the KernelPCA model. If None, then lazy loading is in effect and a name parameter should be given which matches an already created model. shape = (n_rows, n_features)

name : string, optional

The desired name of the model. If None then a random name will be generated. If dataset == None, then the name will be used to lazy load the model from the SliceMatrix-IO cloud.

D : int, optional

The desired embedding dimension. Defaults to 2-D

kernel : string, optional

The distance kernel / metric to use in constructing the distance matrix. Default is euclidean.

alpha : float, optional

Parameter of ridge regression which learns the inverse transform. Ignored if invert == False

invert : boolean, optional

Whether to learn the inverse transform (from low dimension space back to high dimension space)

kernel_params : dict, optional

Any extra parameters specific to the chosen kernel

client : slicematrixIO.connect.ConnectIO

Low level client for dispatching requests to SliceMatrix-IO

pipeline : string, optional

An extant Pipeline to use for model creation. If None then one will be created

Returns:

model : KernelPCA

KPCA model object

Examples

Create a KernelPCA model for a given dataset

>>> sm = SliceMatrix(api_key)
>>> kpca = sm.KernelPCA(dataset = dataframe, D = 5, kernel = "rbf")

Get the embedding

>>> kpca.embedding()

Learn the inverse transform

>>> kpca = sm.KernelPCA(dataset = dataframe, invert = True)
>>> kpca.inverse_embedding()

Methods

embedding(nodes=True)

Get the D dimensional embedding of the training data

I.e.

  1. Take input data in high dimensions
  2. Transform via KPCA to D dimensions
Parameters:

nodes : boolean, optional

Whether to return with node names. Default == True

Returns:

embedding : pandas.DataFrame

D dimensional embedding. shape = (n_rows, D)

feature_names()

Get the names of the features, if applicable

Returns:

meta : dict

Model feature names

inverse_embedding(nodes=True)

Get the inverse embedding of the training data in original dimensions

I.e.

  1. Take input data in high dimensions
  2. Transform via KPCA to D dimensions
  3. Tranform back to high dimensions using model
Parameters:

nodes : boolean, optional

Whether to return with node names. Default == True

Returns:

inverse_embedding : pandas.DataFrame

Original dimension inverse embedding. shape = (n_rows, n_features)

meta()

Get the model metadata such as D, kernel name, etc...

Returns:

meta : dict

Model metadata

nodes()

Get the names of the data points / nodes that make of the training dataset

Returns:

nodes : list

Data point names / indices

class slicematrixIO.manifolds.KernelPCAPipeline(name, D=2, kernel='linear', alpha=1.0, invert=False, kernel_params={}, client=None)

Bases: slicematrixIO.core.BasePipeline

Pipeline for creating Kernel Principal Component Analysis models

For non-linear dimensionality reduction, simulation, classification, and regression.

Applies the kernel trick to PCA.

Parameters:

name : string

The desired name of the Pipeline.

D : int, optional

The desired embedding dimension. Defaults to 2-D

kernel : string, optional

The distance kernel / metric to use in constructing the distance matrix. Default is euclidean.

alpha : float, optional

Parameter of ridge regression which learns the inverse transform. Ignored if invert == False

invert : boolean, optional

Whether to learn the inverse transform (from low dimension space back to high dimension space)

kernel_params : dict, optional

Any extra parameters specific to the chosen kernel

client : slicematrixIO.connect.ConnectIO

Low level client for dispatching requests to SliceMatrix-IO

Returns:

response : dict

success or failure response to Pipeline creation request

Examples

Create a KernelPCA Pipeline for processing multiple datasets

>>> io = ConnectIO(api_key)
>>> kpca_pipe = KernelPCAPipeline(D = 5, kernel = "rbf", client = io)
>>> for dataframe in dataframes:
>>>     current_kpca_model = kpca_pipe.run(dataset = dataframe, name = slicematrixIO.utils.rando_name())

Methods

run(dataset, model)

Run the Pipeline and create a new KernelPCA model

Parameters:

dataset : pandas.DataFrame

The dataset to pass into the Pipeline which will train a KernelPCA model using the parameters defined upon Pipeline creation. Pipelines are reusable sets of instructions to train a machine learning model.

Returns:

response : dict

success or failure response to model creation request

class slicematrixIO.manifolds.LaplacianEigenmapper(dataset=None, name=None, pipeline=None, D=2, affinity='knn', K=5, gamma=1.0, client=None)

Train / Reload a Laplacian Eigenmapper model

Parameters:

dataset : pandas.DataFrame

Input DataFrame. shape = (n_rows, n_features) where each row is a data point and the columns are numeric features

D : int, optional

The desired embedding dimension. Defaults to 2-D

affinity : string [“knn” | “rbf”], optional

How should we construct the affinity matrix?

“knn” : use k nearest neighbors graph

“rbf” : use radial basis function kernel

K : integer greater than 1, optional

The K to use if affinity is “knn”.

gamma : float, optional

Kernel coefficient for affinity “rbf”

name : string, optional

The desired name of the model. If None then a random name will be generated

pipeline : string, optional

An extant Pipeline to use for model creation. If None then one will be created

client : slicematrixIO.connect.ConnectIO

Low level client for dispatching requests to SliceMatrix-IO

Returns:

model : LaplacianEigenmapper

Examples

Create a model for a given dataset

>>> sm = SliceMatrix(api_key)
>>> spectral = sm.KernelPCA(dataset = dataframe, D = 3)

Get the embedding

>>> spectral.embedding()

Methods

affinity_matrix()

Get the affinity matrix used to perform the embedding

Returns:

affinity_matrix : matrix-like

Model affinity matrix shape = (n_rows, n_rows)

embedding(nodes=True)

Get the D dimensional embedding of the training data

I.e.

  1. Take input data in high dimensions
  2. Transform via Laplacian Eigenmapper to D dimensions
Parameters:

nodes : boolean, optional

Whether to return with node names. Default == True

Returns:

embedding : pandas.DataFrame

D dimensional embedding. shape = (n_rows, D)

feature_names()

Get the names of the features, if applicable

Returns:

meta : dict

Model feature names

meta()

Get the model metadata such as D, affinity, etc...

Returns:

meta : dict

Model metadata

nodes()

Get the names of the data points / nodes that make of the training dataset

Returns:

nodes : list

Data point names / indices

class slicematrixIO.manifolds.LaplacianEigenmapperPipeline(name, D=2, affinity='knn', K=5, gamma=1.0, client=None)

Bases: slicematrixIO.core.BasePipeline

Create a Laplacian Eigenmapper Pipeline for creating LaplacianEigenmapper models from input training datasets

Parameters:

name : string

The desired name of the Pipeline.

D : int, optional

The desired embedding dimension. Defaults to 2-D

affinity : string [“knn” | “rbf”], optional

How should we construct the affinity matrix?

“knn” : use k nearest neighbors graph

“rbf” : use radial basis function kernel

K : integer greater than 1, optional

The K to use if affinity is “knn”.

gamma : float, optional

Kernel coefficient for affinity “rbf”

Returns:

response : dict

success or failure response to Pipeline creation request

Examples

Create a Pipeline for processing multiple datasets into LaplacianEigenmapper models

>>> io = ConnectIO(api_key)
>>> spectral_pipe = LaplacianEigenmapperPipeline(D = 5, client = io)
>>> for dataframe in dataframes:
>>>     current_model = spectral_pipe.run(dataset = dataframe, name = slicematrixIO.utils.rando_name())

Methods

run(dataset, model)

Run the Pipeline and create a new LaplacianEigenmapper model

Parameters:

dataset : pandas.DataFrame

The dataset to pass into the Pipeline which will train a KernelPCA model using the parameters defined upon Pipeline creation. Pipelines are reusable sets of instructions to train a machine learning model.

Returns:

response : dict

success or failure response to model creation request

class slicematrixIO.manifolds.LocalLinearEmbedder(dataset=None, name=None, pipeline=None, D=2, K=3, method='standard', client=None)

Create a Pipeline for training Local Linear Embedder models

Parameters:

name : string

The desired name of the model.

D : int, optional

The desired embedding dimension. Defaults to 2-D

K : integer greater than 1, optional

The number of neighbors to use in building the embedding. Default is 3

method : string [‘standard’ | ‘hessian’ | ‘modified’ | ‘ltsa’]

Which LLE algorithm should we use?

‘standard’ : standard LLE method

‘hessian’: hessian eigenmap LLE method, requires that K > D * (1 + (D + 1) / 2

‘modified’ : modified LLE method

‘ltsa’: local tangent space alignment LLE method

client : slicematrixIO.connect.ConnectIO

Low level client for dispatching requests to SliceMatrix-IO

pipeline : string, optional

An extant Pipeline to use for model creation. If None then one will be created

Returns:

model : LocalLinearEmbedder

LLE model object

Examples

Create a LLE model for a given dataset

>>> sm = SliceMatrix(api_key)
>>> lle = sm.LocalLinearEmbedder(dataset = dataframe, D = 2)

Methods

embedding(nodes=True)

Get the D dimensional embedding of the training data

I.e.

  1. Take input data in high dimensions
  2. Transform via LLE to D dimensions
Parameters:

nodes : boolean, optional

Whether to return with node names. Default == True

Returns:

embedding : pandas.DataFrame

D dimensional embedding. shape = (n_rows, D)

feature_names()

Get the names of the features, if applicable

Returns:

meta : dict

Model feature names

meta()

Get the model metadata such as D, method, etc...

Returns:

meta : dict

Model metadata

nodes()

Get the names of the data points / nodes that make of the training dataset

Returns:

nodes : list

Data point names / indices

recon_error()

Get the reconstruction error of the LLE model.

Reconstruction error of the embedding

Returns:

recon_error : float

Reconstruction error for the model

class slicematrixIO.manifolds.LocalLinearEmbedderPipeline(name, D=2, K=3, method='standard', client=None)

Bases: slicematrixIO.core.BasePipeline

Create a Pipeline for training Local Linear Embedder models

Parameters:

name : string

The desired name of the Pipeline.

D : int, optional

The desired embedding dimension. Defaults to 2-D

K : integer greater than 1, optional

The number of neighbors to use in building the embedding. Default is 3

method : string [‘standard’ | ‘hessian’ | ‘modified’ | ‘ltsa’]

Which LLE algorithm should we use?

‘standard’ : standard LLE method

‘hessian’: hessian eigenmap LLE method, requires that K > D * (1 + (D + 1) / 2

‘modified’ : modified LLE method

‘ltsa’: local tangent space alignment LLE method

client : slicematrixIO.connect.ConnectIO

Low level client for dispatching requests to SliceMatrix-IO

Returns:

response : dict

success or failure response to Pipeline creation request

Examples

Create a LocalLinearEmbedder Pipeline for processing multiple datasets

>>> io = ConnectIO(api_key)
>>> lle_pipe = LocalLinearEmbedderPipeline(D = 2, client = io)
>>> for dataframe in dataframes:
>>>     current_model = lle_pipe.run(dataset = dataframe, name = slicematrixIO.utils.rando_name())

Methods

run(dataset, model)

Run the Pipeline and create a new LocalLinearEmbedder model

Parameters:

dataset : pandas.DataFrame

The dataset to pass into the Pipeline which will train a LocalLinearEmbedder model using the parameters defined upon Pipeline creation. Pipelines are reusable sets of instructions to train a machine learning model.

Returns:

response : dict

success or failure response to model creation request

slicematrixIO.matrices module

Distance / Similarity Matrix Models

Generalization of the correlation matrix for different metrics / kernels / similarity measures

class slicematrixIO.matrices.DistanceMatrix(dataset=None, name=None, pipeline=None, K=5, kernel='euclidean', geodesic=False, kernel_params={}, client=None)

Train / Reload a DistanceMatrix model

Parameters:

name : string, optional

The desired name of the model. If None a random name will be generated

K : integer greater than 1, optional, ignored if geodesic == False

The number of neighbors to use in building the geodesic distance matrix. Geodesic distance is constructed by computing the K Nearest Neighbors graph for the input dataset, then constructing all pairwise distances using the geodesic distance, i.e. the number of edges in a shortest path between two points on the graph.

kernel : string, optional

The distance kernel / metric to use in constructing the distance matrix. Default is euclidean.

kernel_params : dict, optional

Any extra parameters specific to the chosen kernel

geodesic : boolean, optional

Whether to create the geodesic distance matrix or the brute force pairwise distance matrix. Default is False

client : slicematrixIO.connect.ConnectIO

Low level client for dispatching requests to SliceMatrix-IO

pipeline : string, optional

An extant DistanceMatrixPipeline to use for model creation. If None then one will be created

Methods

getKeys()

Get the names of the datapoints in the model’s training dataset

Returns:

keys : list

The names of the datapoints in the model’s training dataset

rankDist(target, page=0)

Get the closest datapoints to the given target

Parameters:

page : integer, optional

The current page. Responses come in chunks of 100. To iterate through the full list increase the page number.

Returns:

distances : pandas.DataFrame

DataFrame with list of datapoints sorted by distance from target point

class slicematrixIO.matrices.DistanceMatrixPipeline(name, kernel='euclidean', geodesic=False, K=5, kernel_params={}, client=None)

Bases: slicematrixIO.core.BasePipeline

Create a Pipeline to train DistanceMatrix models from input datasets

Parameters:

name : string, optional

The desired name of the Pipeline.

K : integer greater than 1, optional, ignored if geodesic == False

The number of neighbors to use in building the geodesic distance matrix. Geodesic distance is constructed by computing the K Nearest Neighbors graph for the input dataset, then constructing all pairwise distances using the geodesic distance, i.e. the number of edges in a shortest path between two points on the graph.

kernel : string, optional

The distance kernel / metric to use in constructing the distance matrix. Default is euclidean.

kernel_params : dict, optional

Any extra parameters specific to the chosen kernel

geodesic : boolean, optional

Whether to create the geodesic distance matrix or the brute force pairwise distance matrix. Default is False

client : slicematrixIO.connect.ConnectIO

Low level client for dispatching requests to SliceMatrix-IO

Returns:

response : dict

success or failure response to Pipeline creation request

Examples

Create a Pipeline for processing multiple datasets into DistanceMatrix models

>>> io = ConnectIO(api_key)
>>> matrix_pipe = DistanceMatrixPipeline(kernel = "correlation", client = io)
>>> for dataframe in dataframes:
>>>     current_model = matrix_pipe.run(dataset = dataframe, name = slicematrixIO.utils.rando_name())

Methods

run(dataset, model)

Run the Pipeline and create a new DistanceMatrix model

Parameters:

dataset : pandas.DataFrame

The dataset to pass into the Pipeline which will train a DistanceMatrix model using the parameters defined upon Pipeline creation. Pipelines are reusable sets of instructions to train a machine learning model.

slicematrixIO.matrix_models module

class slicematrixIO.matrix_models.MatrixAgglomerator(label_dataset=None, matrix=None, matrix_name=None, matrix_type=None, name=None, pipeline=None, alpha=0.1, client=None)

Methods

getKeys()
rankDist(target, page=0)
class slicematrixIO.matrix_models.MatrixAgglomeratorPipeline(name, alpha=0.1, client=None)

Bases: slicematrixIO.core.BasePipeline

Methods

run(label_dataset, model, matrix=None, matrix_name=None, matrix_type=None)
class slicematrixIO.matrix_models.MatrixKernelPCA(matrix=None, matrix_name=None, matrix_type=None, name=None, pipeline=None, D=2, client=None)

Methods

embedding()
meta()
nodes()
class slicematrixIO.matrix_models.MatrixKernelPCAPipeline(name, D=2, client=None)

Bases: slicematrixIO.core.BasePipeline

Methods

run(model, matrix=None, matrix_name=None, matrix_type=None)
class slicematrixIO.matrix_models.MatrixMinimumSpanningTree(matrix=None, matrix_name=None, matrix_type=None, name=None, pipeline=None, client=None)

Methods

edges()
neighborhood(node)
nodes()
rankNodes(statistic='closeness_centrality')
class slicematrixIO.matrix_models.MatrixMinimumSpanningTreePipeline(name, client=None)

Bases: slicematrixIO.core.BasePipeline

Methods

run(model, matrix=None, matrix_name=None, matrix_type=None)

slicematrixIO.notebook module

Module containing all Jupyter Notebook related classes / functions

All of this is meant to be run inside a Jupyter Notebook. The resulting graphs can be shared as notebook or html.

class slicematrixIO.notebook.GraphEngine(sm)

Class for setting up and drawing graphs / visualizations of SliceMatrix-I0 models directly in the Jupyter Notebook

Parameters:

sm : slicematrixIO.client.SliceMatrix

An extant client

Examples

Create the GraphEngine

>>> sm  = SliceMatrix(api_key)
>>> viz = GraphEngine(sm)

Initialize the notebook data

>>> viz.init_data()

Initialize the graph stylesheet

>>> viz.init_style()

Then visualize a model

>>> iso = sm.Isomap(dataset=prices)
>>> viz.drawNetworkGraph(iso, width = 1000, height = 600, color_map = "Heat")

You can then save and export the notebook for sharing your graph. HTML exports will render directly in the browser.

For another example check out https://slicematrix.github.io/manifold_learning_js.html

Methods

drawNetworkGraph(network_model, color_map='RdBuGn', graph_style='light', graph_layout='force', width=1000, height=600, charge=-100, color_axis='closeness_centrality', label_color='#000', label_shadow_color='#fffff0', min_node_size=5)

Embed a D3 network graph into a Jupyter Notebook to visualize a SiceMatrix-IO network graph model. Graphs embedded in notebooks can be shared.

Parameters:

network_model : graph-object

color_map : string

The desired color map for the node colors. Nodes are colored relative to their color_axis (the graph node statistic) selection.

Mappings go from min value to median value to max value

  • ‘RdBuGn’ : Red to Blue to Green
  • ‘RdGrGn’ : Red to Gray to Green
  • ‘PuBuXr’ : A purple to blue x-ray effect where nodes near the median appear to disappear on a dark background
  • ‘Viridis’ : The Viridis color map, good for dark or light backgrounds
  • ‘Heat’ : A Red/Orange colormap with darker hues at the extreme
  • ‘Winter’ : A Blue/Green colormap

graph_style : string [‘light’ | ‘dark’]

The overall styling of the graph. Light background vs dark background...

graph_layout : string [‘force’ | ‘embedding’]

The layout algorithm for the network graph.

  • ‘force’ : network layout (node positioning) will be determined by a force directed simulation
  • ‘embedding’ : node positioning will be static and determined by the positions returned in model.embedding(). For models without a .embedding() function, enabling this option may cause the graph to fail to display properly

width : integer, greater than 0

The desired width of the network graph

height : integer, greater than 0

The desired height of the network graph

charge : integer, less than 0

For graph_layout == ‘force’, ignored otherwise. The charge associated with each node for use in the force directed simulation layout. The more negative charge, the more the nodes tend to repel one another

color_axis : string

The name of the graph statistic to use for coloring the graph nodes. Should be valid statistic name for call to model.rankNodes()

label_color : string

The color of the node labels. Defaults to black. Accepts valid html color (e.g. #fff or rgba(255,255,255,0.8))

label_shadow_color : string

The color of the node label shadow. Defaults to “#fffff0”

min_node_size : integer, greater than 0

The minimum size to make the graph nodes. Defaults to 5

Returns:

html : IPython.display.javascript

A html + javascript network graph chart embedded in the Jupyter Notebook

init_data()

Initialize the window’s graph data

Returns:

js : IPython.display.javascript

A javascript code block

init_style()

Initialize the notebook’s graph style

Returns:

js : IPython.display.javascript

A javascript code block

slicematrixIO.regressors module

Regressors are machine learning models which learn a function between an input (X) and an output (Y).

In particular, SliceMatrix-IO offers a number of what are known as “multi-output” regression models.

This is a special type of regression which can have an output with a dimension greater than 1, useful for:

  • Prediction
  • Out of Sample Manifold Learning
  • As a step within a classification workflow
class slicematrixIO.regressors.KNNRegressor(X=None, Y=None, name=None, pipeline=None, K=5, kernel='euclidean', algo='auto', weights='uniform', kernel_params={}, client=None)

Train / Reload a KNNRegressor model for multi-output regression

Parameters:

X : pandas.DataFrame

Input DataFrame. shape = (n_rows, input_features) where each row is a data point and the columns are numeric features

Y : pandas.DataFrame

Output DataFrame. shape = (n_rows, output_features) where output_features < input_features and each row is a data point and the columns are numeric features

K : integer, optional

The desired K in the Nearest Neighbor classifier model

kernel : string [ ‘euclidean’ | ‘minkowski’ | ‘hammond’ | ‘etc...’], optional

The desired kernel for defining distance in our classifier. Default is ‘euclidean’

algo : string [‘auto’ | ‘ball’ | ‘kd_tree’ | ‘brute’], optional

The algorithm to use in determining Nearest Neighbors. Default is ‘auto’

weights : string [‘uniform’ | ‘weighted’], optional

Should voting be uniform (i.e. independent of distance) or weighted by distance (i.e. closer neighbor’s have higher weighted predictions)

kernel_params : dict, optional

Any parameters specific to the chosen kernel

name : string, optional

The desired name of the model. If None then a random name will be generated

pipeline : string, optional

An extant Pipeline to use for model creation. If None then one will be created

client : slicematrixIO.connect.ConnectIO

Low level client for dispatching requests to SliceMatrix-IO

Returns:

model : RFRegressor

Trained Random Forest Regressor model

Examples

Train a K Nearest Neighbors Regressor model

>>> sm = SliceMatrix(api_key)
>>> knn = sm.KNNRegressor(dataset = dataset, K = 3)

Make a prediction

>>> knn.predict([...])

Methods

predict(point)

Make a prediction using the given input features.

Also used for out of sample manifold learning.

I.e.

  1. Perform manifold learning embedding of input data (high dimension, H) to low dimension (D, D < H), however
    • Many manifold learnin algorithms don’t have straightforward out of sample generalizatons...
  2. Learn the “interpolation” function between high dim space and low dim space with a multi-output regression
    • Regress high dim (H) data points against the embedding (D) data points to learn the manifold embedding
  3. When presented with a new data point, an H dimension vector, or tensor or whatever term is fashionable, and “embed” it using the multi-output regression to output a D dimension vector
Parameters:

point : list

List of points to use as inputs to a prediction

score()

Get the R^2 of the training dataset / predictions

Returns:

r2 : float

The R^2 of the training dataset

class slicematrixIO.regressors.KNNRegressorPipeline(name, K=5, kernel='euclidean', algo='auto', weights='uniform', kernel_params={}, client=None)

Bases: slicematrixIO.core.BasePipeline

K Nearest Neighbors Regression.

Create a Pipeline for training KNNRegressor models.

Parameters:

name : string

The desired name of the Pipeline.

K : integer, optional

The desired K in the Nearest Neighbor classifier model

kernel : string [ ‘euclidean’ | ‘minkowski’ | ‘hammond’ | ‘etc...’], optional

The desired kernel for defining distance in our classifier. Default is ‘euclidean’

algo : string [‘auto’ | ‘ball’ | ‘kd_tree’ | ‘brute’], optional

The algorithm to use in determining Nearest Neighbors. Default is ‘auto’

weights : string [‘uniform’ | ‘weighted’], optional

Should voting be uniform (i.e. independent of distance) or weighted by distance (i.e. closer neighbor’s have higher weighted predictions)

kernel_params : dict, optional

Any parameters specific to the chosen kernel

client : slicematrixIO.connect.ConnectIO

Low level client for dispatching requests to SliceMatrix-IO

Returns:

response : dict

success or failure response to Pipeline creation request

Examples

Create a Pipeline for training multiple KNNRegressor models

>>> io = ConnectIO(api_key)
>>> pipe = KNNRegressorPipeline(K = 5, client = io)
>>> for dataframe in dataframes:
>>>     current_model = pipe.run(dataset = dataframe, name = slicematrixIO.utils.rando_name())

Methods

run(X, Y, model)

Run the Pipeline and create a new KNNRegressor model

Parameters:

dataset : pandas.DataFrame

The dataset to pass into the Pipeline which will train a KNNRegressor model using the parameters defined upon Pipeline creation. Pipelines are reusable sets of instructions to train a machine learning model.

Returns:

response : dict

success or failure response to model creation request

class slicematrixIO.regressors.KernelRidgeRegressor(X=None, Y=None, name=None, pipeline=None, kernel='linear', alpha=1.0, kernel_params={}, client=None)

Train / Reload a KernelRidgeRegressor model for multi-output regression

Parameters:

X : pandas.DataFrame

Input DataFrame. shape = (n_rows, input_features) where each row is a data point and the columns are numeric features

Y : pandas.DataFrame

Output DataFrame. shape = (n_rows, output_features) where output_features < input_features and each row is a data point and the columns are numeric features

K : integer, optional

The desired K in the Nearest Neighbor classifier model

kernel : string [ ‘euclidean’ | ‘minkowski’ | ‘hammond’ | ‘etc...’], optional

The desired kernel for defining distance in our classifier. Default is ‘euclidean’

algo : string [‘auto’ | ‘ball’ | ‘kd_tree’ | ‘brute’], optional

The algorithm to use in determining Nearest Neighbors. Default is ‘auto’

weights : string [‘uniform’ | ‘weighted’], optional

Should voting be uniform (i.e. independent of distance) or weighted by distance (i.e. closer neighbor’s have higher weighted predictions)

kernel_params : dict, optional

Any parameters specific to the chosen kernel

name : string, optional

The desired name of the model. If None then a random name will be generated

pipeline : string, optional

An extant Pipeline to use for model creation. If None then one will be created

client : slicematrixIO.connect.ConnectIO

Low level client for dispatching requests to SliceMatrix-IO

Returns:

model : RFRegressor

Trained Random Forest Regressor model

Examples

Train a Kernel Ridge Regressor model

>>> sm = SliceMatrix(api_key)
>>> krr = sm.KNNRegressor(dataset = dataset, kernel = "rbf")

Make a prediction

>>> krr.predict([...])

Methods

predict(point)

Make a prediction using the given input features.

Also used for out of sample manifold learning.

I.e.

  1. Perform manifold learning embedding of input data (high dimension, H) to low dimension (D, D < H), however
    • Many manifold learnin algorithms don’t have straightforward out of sample generalizatons...
  2. Learn the “interpolation” function between high dim space and low dim space with a multi-output regression
    • Regress high dim (H) data points against the embedding (D) data points to learn the manifold embedding
  3. When presented with a new data point, an H dimension vector, or tensor or whatever term is fashionable, and “embed” it using the multi-output regression to output a D dimension vector
Parameters:

point : list

List of points to use as inputs to a prediction

score()

Get the R^2 of the training dataset / predictions

Returns:

r2 : float

The R^2 of the training dataset

class slicematrixIO.regressors.KernelRidgeRegressorPipeline(name, kernel='linear', alpha=1.0, kernel_params={}, client=None)

Bases: slicematrixIO.core.BasePipeline

Kernel Ridge Regression.

Create a Pipeline for training KernelRidgeRegressor models.

Parameters:

name : string

The desired name of the Pipeline.

alpha : float, optional

Kernel Ridge Regressor model alpha value. Default 1.0

kernel : string [‘linear’, ‘rbf’, ‘poly’]

Kernel to use in regression. Linear is default. For nonlinear datasets, consider rbf or poly

‘linear’ : linear kernel

‘rbf’ : radial basis function kernel

‘poly’ : polynomial kernel

kernel_params : dict

Kernel specific parameters

client : slicematrixIO.connect.ConnectIO

Low level client for dispatching requests to SliceMatrix-IO

Returns:

response : dict

success or failure response to Pipeline creation request

Examples

Create a Pipeline for training multiple KernelRidgeRegressor models

>>> io = ConnectIO(api_key)
>>> pipe = KernelRidgeRegressorPipeline(K = 5, client = io)
>>> for dataframe in dataframes:
>>>     current_model = pipe.run(dataset = dataframe, name = slicematrixIO.utils.rando_name())

Methods

run(X, Y, model)

Run the Pipeline and create a new KernelRidgeRegressor model

Parameters:

dataset : pandas.DataFrame

The dataset to pass into the Pipeline which will train a KernelRidgeRegressor model using the parameters defined upon Pipeline creation. Pipelines are reusable sets of instructions to train a machine learning model.

Returns:

response : dict

success or failure response to model creation request

class slicematrixIO.regressors.RFRegressor(X=None, Y=None, name=None, pipeline=None, n_trees=8, client=None)

Train / Reload a RFRegressor model for multi-output regression

A Random Forest Regressor finds a function which maps the input space (X) to the lower dimension output space (Y) using decision trees

Parameters:

X : pandas.DataFrame

Input DataFrame. shape = (n_rows, input_features) where each row is a data point and the columns are numeric features

Y : pandas.DataFrame

Output DataFrame. shape = (n_rows, output_features) where output_features < input_features and each row is a data point and the columns are numeric features

n_trees : integer greater than 1, optional

The number of trees to use in construction of the Random Forest model. Default is 8 trees

name : string, optional

The desired name of the model. If None then a random name will be generated

pipeline : string, optional

An extant Pipeline to use for model creation. If None then one will be created

client : slicematrixIO.connect.ConnectIO

Low level client for dispatching requests to SliceMatrix-IO

Returns:

model : RFRegressor

Trained Random Forest Regressor model

Examples

Train a Random Forest Regressor model

>>> sm = SliceMatrix(api_key)
>>> rfr = sm.RFRegressor(dataset = dataset, n_trees = 50)

Make a prediction

>>> rfr.predict([...])

Methods

predict(point)

Make a prediction using the given input features.

Also used for out of sample manifold learning.

I.e.

  1. Perform manifold learning embedding of input data (high dimension, H) to low dimension (D, D < H), however
    • Many manifold learnin algorithms don’t have straightforward out of sample generalizatons...
  2. Learn the “interpolation” function between high dim space and low dim space with a multi-output regression
    • Regress high dim (H) data points against the embedding (D) data points to learn the manifold embedding
  3. When presented with a new data point, an H dimension vector, or tensor or whatever term is fashionable, and “embed” it using the multi-output regression to output a D dimension vector
Parameters:

point : list

List of points to use as inputs to a prediction

score()

Get the R^2 of the training dataset / predictions

Returns:

r2 : float

The R^2 of the training dataset

class slicematrixIO.regressors.RFRegressorPipeline(name, n_trees=8, client=None)

Bases: slicematrixIO.core.BasePipeline

Random Forest Regression.

Create a Pipeline for training RFRegressor models.

Parameters:

name : string

The desired name of the Pipeline.

n_trees : integer, greater than 0

The number of trees to use in the regression forest

client : slicematrixIO.connect.ConnectIO

Low level client for dispatching requests to SliceMatrix-IO

Returns:

response : dict

success or failure response to Pipeline creation request

Examples

Create a Pipeline for training multiple RFRegressor models

>>> io = ConnectIO(api_key)
>>> pipe = RFRegressorPipeline(n_trees = 100, client = io)
>>> for dataframe in dataframes:
>>>     current_model = pipe.run(dataset = dataframe, name = slicematrixIO.utils.rando_name())

Methods

run(X, Y, model)

Run the Pipeline and create a new RFRegressor model

Parameters:

dataset : pandas.DataFrame

The dataset to pass into the Pipeline which will train a RFRegressor model using the parameters defined upon Pipeline creation. Pipelines are reusable sets of instructions to train a machine learning model.

Returns:

response : dict

success or failure response to model creation request

slicematrixIO.utils module

Useful utility functions

slicematrixIO.utils.r_squared(Y_hat, Y)

Get the coefficient of determination, or r-squared, for a given prediction versus its ground truths

Parameters:

Y_hat : pandas.DataFrame

The predicted values DataFrame

Y : pandas.DataFrame

The actual values DataFrame

Returns:

r_2 : float

The r-squared value

slicematrixIO.utils.rando_name(type='short')

Generate a random name string

Longer name decreases chance of overwrite collision

Parameters:

type : string [“short” | “long”]

Whether to create a long or short name

Returns:

name : string

Random name

Module contents

slicematrixIO-python is the Python SDK for the SliceMatrix-IO Machine Learning Platform as a Service