slicematrixIO package¶
Submodules¶
slicematrixIO.bayesian_filters module¶
-
class
slicematrixIO.bayesian_filters.
KalmanOLS
(dataset=None, name=None, pipeline=None, init_alpha=None, init_beta=None, trans_cov=None, obs_cov=None, init_cov=None, optimizations=[], client=None)¶ Train / Reload a Kalman Filter model for online estimation of the parameters of Ordinary Least Squares (KalmanOLS)
Parameters: dataset: pandas.DataFrame
Input DataFrame. shape = (nrows, 2) where the first column is Y and the second is X in OLS model
init_alpha : float, optional
Initial value for alpha in OLS model (ignored if optimizations are enabled)
init_beta : float, optional
Initial value for beta in OLS model (ignored if optimizations are enabled)
trans_cov : array-like, optional
Transition covariance, shape = (2, 2)
init_cov : array-like, optional
Initial covariance, shape = (2, 2)
optimizations : list, optional
List of optimizations. Can include multiple optimizations. Default includes all:
- ‘transition_covariance’
- ‘observation_covariance’
- ‘initial_state_mean’
- ‘initial_state_covariance’
name : string, optional
The desired name of the model. If None then a random name will be generated
client :
slicematrixIO.connect.ConnectIO
Low level client for dispatching requests to SliceMatrix-IO
pipeline : BasePipeline, optional
Pipeline to use. Defaults to None. If None then a pipeline will be created for use in creating the model
Returns: model : :class`.KalmanOLS`
Trained Kalman Filter model
Examples
Create a KalmanOLS model for a given dataset
>>> sm = SliceMatrix(api_key) >>> kf = sm.KalmanOLS(dataset = dataframe)
Get the current internal state of the model (i.e. current alpha and beta and covariance)
>>> kf.getState()
Update the model will new information, and get updated state
>>> kf.update(X = 128.17, Y = 45.85)
Methods
-
getState
()¶ Get the current internal state of the Kalman Filter OLS model
Returns: state : dict
Dictionary with the current state of model i.e. - means (Beta and Alpha, respectively) - covariance
-
getTrainingData
()¶ Get the historical state of the model over time
Returns: history : dict
Historical state of both mean and covariance of the model over time.
-
update
(X, Y)¶ Step the model through a new learning iteration with new datapoints for input (X) and output (Y)
This will permanently change the state of the model as it adjusts to new information.
In a distributed setting, updates to the same
KalmanOLS
model are not guaranteed to be atomicParameters: X : float
The newly observed value for the input of the OLS model (X)
Y : float
The newly observed value for the output of the OLS model (Y)
Returns: state : dict
Dictionary with the current state of model i.e. - means (Beta and Alpha, respectively) - covariance
-
class
slicematrixIO.bayesian_filters.
KalmanOLSPipeline
(name, init_alpha=None, init_beta=None, trans_cov=None, obs_cov=None, init_cov=None, optimizations=[], client=None)¶ Bases:
slicematrixIO.core.BasePipeline
Create a Pipeline for training
KalmanOLS
models from input datasetsParameters: name : string
The desired name of the Pipeline
init_alpha : float, optional
Initial value for alpha in OLS model (ignored if optimizations are enabled)
init_beta : float, optional
Initial value for beta in OLS model (ignored if optimizations are enabled)
trans_cov : array-like, optional
Transition covariance, shape = (2, 2)
init_cov : array-like, optional
Initial covariance, shape = (2, 2)
optimizations : list, optional
List of optimizations. Can include multiple optimizations. Default includes all:
- ‘transition_covariance’
- ‘observation_covariance’
- ‘initial_state_mean’
- ‘initial_state_covariance’
client :
slicematrixIO.connect.ConnectIO
Low level client for dispatching requests to SliceMatrix-IO
Returns: response : dict
success or failure response to Pipeline creation request
Examples
Create a KalmanOLSPipeline for processing multiple datasets
>>> io = ConnectIO(api_key) >>> pipe = KalmanOLSPipeline(client = io) >>> for dataframe in dataframes: >>> current_model = pipe.run(dataset = dataframe, name = slicematrixIO.utils.rando_name())
Methods
-
run
(dataset, model)¶ Run the Pipeline and create a new
KalmanOLS
modelParameters: dataset : pandas.DataFrame
The dataset to pass into the Pipeline which will train a KernelPCA model using the parameters defined upon Pipeline creation. Pipelines are reusable sets of instructions to train a machine learning model.
Returns: response : dict
success or failure response to model creation request
slicematrixIO.classifiers module¶
Classifier models are examples of supervised machine learning techniques which aim to predict the class label of a given input datapoint
-
class
slicematrixIO.classifiers.
KNNClassifier
(dataset=None, class_column=None, name=None, pipeline=None, K=5, kernel='euclidean', algo='auto', weights='uniform', kernel_params={}, client=None)¶ Train / Reload a
KNNClassifier
modelParameters: dataset : pandas.DataFrame
Input DataFrame. shape = (n_rows, n_features + 1) where each row is a data point and the columns are numeric features and a column with the class labels
name : string
The desired name of the Pipeline.
class_column : string
The name of the column in the input dataset which describes the class labels
K : integer, optional
The desired K in the Nearest Neighbor classifier model
kernel : string [ ‘euclidean’ | ‘minkowski’ | ‘hammond’ | ‘etc...’], optional
The desired kernel for defining distance in our classifier. Default is ‘euclidean’
algo : string [‘auto’ | ‘ball’ | ‘kd_tree’ | ‘brute’], optional
The algorithm to use in determining Nearest Neighbors. Default is ‘auto’
weights : string [‘uniform’ | ‘weighted’], optional
Should voting be uniform (i.e. independent of distance) or weighted by distance (i.e. closer neighbor’s have higher weighted votes)
kernel_params : dict, optional
Any parameters specific to the chosen kernel
client :
slicematrixIO.connect.ConnectIO
Low level client for dispatching requests to SliceMatrix-IO
pipeline : string, optional
An extant Pipeline to use for model creation. If None then one will be created
Returns: model :
KNNClassifier
KNNClassifier model object
Examples
Create a KNNClassifier model for a given dataset
>>> sm = SliceMatrix(api_key) >>> knn = sm.KNNClassifier(dataset = dataframe, K = 5)
Predict the class of some new data
>>> knn.predict([...])
Methods
-
predict
(point)¶ Predict the class of new input datapoints
Parameters: point : list
A list of new datapoints. Shape = (n_points, n_features)
Returns: prediction : list
A list of new predictions for each input datapoint. Shape = (n_points, 1)
-
score
()¶ Get the training prediction R^2
Returns: r2 : float
The R^2 of the training predictions
-
training_data
()¶ Get the input data used to train the model
Returns: data : list
The training data
-
training_preds
()¶ Get the training predictions
Returns: prediction : list
A list of the training predictions
-
-
class
slicematrixIO.classifiers.
KNNClassifierPipeline
(name, K=5, kernel='euclidean', algo='auto', weights='uniform', kernel_params={}, client=None)¶ Bases:
slicematrixIO.core.BasePipeline
Create a Pipeline for training
KNNClassifier
models from input datasetsParameters: name : string
The desired name of the Pipeline.
K : integer, optional
The desired K in the Nearest Neighbor classifier model
kernel : string [ ‘euclidean’ | ‘minkowski’ | ‘hammond’ | ‘etc...’], optional
The desired kernel for defining distance in our classifier. Default is ‘euclidean’
algo : string [‘auto’ | ‘ball’ | ‘kd_tree’ | ‘brute’], optional
The algorithm to use in determining Nearest Neighbors. Default is ‘auto’
weights : string [‘uniform’ | ‘weighted’], optional
Should voting be uniform (i.e. independent of distance) or weighted by distance (i.e. closer neighbor’s have higher weighted votes)
kernel_params : dict, optional
Any parameters specific to the chosen kernel
client :
slicematrixIO.connect.ConnectIO
Low level client for dispatching requests to SliceMatrix-IO
Returns: response : dict
success or failure response to Pipeline creation request
Examples
Create a Pipeline for training multiple
KNNClassifier
models>>> io = ConnectIO(api_key) >>> pipe = KNNClassifierPipeline(K = 7, client = io) >>> for dataframe in dataframes: >>> current_model = pipe.run(dataset = dataframe, name = slicematrixIO.utils.rando_name())
Methods
-
run
(dataset, model, class_column)¶ Run the Pipeline and create a new
KNNClassifier
modelParameters: dataset : pandas.DataFrame
The dataset to pass into the Pipeline which will train a
KNNClassifier
model using the parameters defined upon Pipeline creation. Pipelines are reusable sets of instructions to train a machine learning model.Returns: response : dict
success or failure response to model creation request
-
-
class
slicematrixIO.classifiers.
PNNClassifier
(dataset, class_column, name=None, pipeline=None, sigma=0.1, client=None)¶ Train / Reload a
PNNClassifier
modelParameters: dataset : pandas.DataFrame
Input DataFrame. shape = (n_rows, n_features + 1) where each row is a data point and the columns are numeric features and a column with the class labels
name : string
The desired name of the Pipeline.
class_column : string
The name of the column in the input dataset which describes the class labels
client :
slicematrixIO.connect.ConnectIO
Low level client for dispatching requests to SliceMatrix-IO
pipeline : string, optional
An extant Pipeline to use for model creation. If None then one will be created
Returns: model :
PNNClassifier
PNNClassifier model object
Examples
Create a PNNClassifier model for a given dataset
>>> sm = SliceMatrix(api_key) >>> pnn = sm.PNNClassifier(dataset = dataframe, sigma = 0.12)
Predict the class of some new data
>>> pnn.predict([...])
Methods
-
predict
(point)¶ Predict the class of new input datapoints
Parameters: point : list
A list of new datapoints. Shape = (n_points, n_features)
Returns: prediction : list
A list of new predictions for each input datapoint. Shape = (n_points, 1)
-
score
()¶ Get the training prediction R^2
Returns: r2 : float
The R^2 of the training predictions
-
training_data
()¶ Get the input data used to train the model
Returns: data : list
The training data
-
training_preds
()¶ Get the training predictions
Returns: prediction : list
A list of the training predictions
-
-
class
slicematrixIO.classifiers.
PNNClassifierPipeline
(name, sigma=0.1, client=None)¶ Bases:
slicematrixIO.core.BasePipeline
Create a Pipeline for training
PNNClassifier
models from input datasetsParameters: name : string
The desired name of the Pipeline.
sigma : float in (0., 1.), optional
The desired smoothing parameter for the PNN model. Default is 0.1
client :
slicematrixIO.connect.ConnectIO
Low level client for dispatching requests to SliceMatrix-IO
Returns: response : dict
success or failure response to Pipeline creation request
Examples
Create a Pipeline for training multiple
PNNClassifier
models>>> io = ConnectIO(api_key) >>> pipe = PNNClassifierPipeline(sigma = 0.05, client = io) >>> for dataframe in dataframes: >>> current_model = pipe.run(dataset = dataframe, name = slicematrixIO.utils.rando_name())
Methods
-
run
(dataset, model, class_column)¶ Run the Pipeline and create a new
PNNClassifier
modelParameters: dataset : pandas.DataFrame
The dataset to pass into the Pipeline which will train a
PNNClassifier
model using the parameters defined upon Pipeline creation. Pipelines are reusable sets of instructions to train a machine learning model.Returns: response : dict
success or failure response to model creation request
-
slicematrixIO.client module¶
High Level Python Client for the SliceMatrix-IO Machine Learning PaaS
-
class
slicematrixIO.client.
SliceMatrix
(api_key, region='us-east-1')¶ Main business object for slicematrixIO-python
Builds upon low level api (ConnectIO) to create high level objects for each model type.
The models are meant to be created by the client, as opposed to instantiated directly.
Parameters: api_key: string
A Valid SliceMatrix-IO API Key
region : string [‘us-east-1’, ‘us-west-1’, ‘eu-central-1’, ‘ap-southeast-1’]
Data center of choice. API Key must be valid for that specific data center. Latency will be lowest if client is closest to data center.
‘us-east-1’: US East Coast Data Center
‘us-west-1’: US West Coast Data Center
‘eu-central-1’: Continental Europe Data Center
‘ap-southeast-1’: South-East Asian Data Center
Examples
Create a Kernel Density Estimator model that lives in the cloud
>>> kde = sm.KernelDensityEstimator(dataset=df)
Score a new data point
>>> kde.score(10325632)
Simulate 1000 new data points
>>> kde.simulate(1000)
Manifold Learning:
>>> iso = sm.Isomap(dataset=prices)
Get statistics / factors related to internal graph structure of each node
>>> iso.rankNodes("pagerank")
Find low dimensional embedding of input data
>>> iso.embedding()
See www.slicematrix.com/use-cases for more in-depth examples
Attributes
client (ConnectIO) Low level SliceMatrix-IO Python client Methods
-
BasicA2D
(dataset=None, retrain=True, name=None, pipeline=None)¶ Create a BasicA2D model (basic automatic anomaly detection)
Parameters: dataset : pandas.DataFrame
Input DataFrame. shape = (n_rows, n_features + 1) where each row is a data point and the columns are numeric features and a column with the class labels
name : string, optional
The desired name of the model. If None then a random name will be generated
pipeline : string, optional
An extant Pipeline to use for model creation. If None then one will be created
retrain : boolean, optional
Whether to automatically retrain the model upon a remote call to the update method. The BasicA2D is a window detector which can be retrained in an online fashion where new data is used to update the model’s understanding of the world and influence future anomaly scoring.
Returns:
-
CorrelationFilteredGraph
(dataset=None, K=3, name=None, pipeline=None)¶ Create a Correlation Filtered Graph
CFG are similar to MST’s, in that both graph’s begin with a distance matrix, but whereas MST’s are limited to constructing a tree, CFG’s draw links between a node and its closests K neighbors based on correlation distance. CFG’s are like KNN networks, but optimized for using correlation distance.
Parameters: dataset : pandas.DataFrame
Input DataFrame. shape = (n_rows, n_features) where each row is a data point and the columns are numeric features
K : integer greater than 1, optional
The number of nearest neighbors to use for constructing the CFG
name : string, optional
The desired name of the model. If None then a random name will be generated
pipeline : string, optional
An extant Pipeline to use for model creation. If None then one will be created
Returns:
-
DistanceMatrix
(dataset=None, K=5, kernel='euclidean', kernel_params={}, geodesic=False, name=None, pipeline=None)¶ Create a Distance Matrix model
Parameters: dataset : pandas.DataFrame
Input DataFrame. shape = (n_rows, n_features + 1) where each row is a data point and the columns are numeric features and a column with the class labels
name : string, optional
The desired name of the model. If None then a random name will be generated
pipeline : string, optional
An extant Pipeline to use for model creation. If None then one will be created
K : integer greater than 1, optional, ignored if geodesic == False
The number of neighbors to use in building the geodesic distance matrix. Geodesic distance is constructed by computing the K Nearest Neighbors graph for the input dataset, then constructing all pairwise distances using the geodesic distance, i.e. the number of edges in a shortest path between two points on the graph.
kernel : string, optional
The distance kernel / metric to use in constructing the distance matrix. Default is euclidean.
kernel_params : dict, optional
Any extra parameters specific to the chosen kernel
geodesic : boolean, optional
Whether to create the geodesic distance matrix or the brute force pairwise distance matrix. Default is False
Returns:
-
IsolationForest
(dataset=None, rate=0.1, n_trees=100, name=None, pipeline=None)¶ Create an Isolation Forest model for automatic anomaly detection
Parameters: dataset : pandas.DataFrame
Input DataFrame. shape = (n_rows, n_features + 1) where each row is a data point and the columns are numeric features and a column with the class labels
name : string, optional
The desired name of the model. If None then a random name will be generated
pipeline : string, optional
An extant Pipeline to use for model creation. If None then one will be created
rate : float in (0., 0.5), optional
The desired rate of anomaly detection in training data. Default is 0.1 i.e. 10%
n_trees : integer greater than 1, optional
The number of trees to use in construction of the Isolation Forest model. Default is 100 trees
Returns:
-
Isomap
(dataset=None, D=2, K=3, name=None, pipeline=None)¶ Create Isomap model for non-linear dimensonality reduction
Parameters: dataset : pandas.DataFrame
Input DataFrame. shape = (n_rows, n_features) where each row is a data point and the columns are numeric features
D : int, optional
The desired embedding dimension. Defaults to 2-D
K : integer greater than 1, optional, ignored if geodesic == False
The number of neighbors to use in building the geodesic distance matrix. Geodesic distance is constructed by computing the K Nearest Neighbors graph for the input dataset, then constructing all pairwise distances using the geodesic distance, i.e. the number of edges in a shortest path between two points on the graph.
name : string, optional
The desired name of the model. If None then a random name will be generated
pipeline : string, optional
An extant Pipeline to use for model creation. If None then one will be created
Returns: model :
slicematrixIO.manifolds.Isomap
-
KNNClassifier
(dataset=None, class_column=None, name=None, pipeline=None, K=5, kernel='euclidean', algo='auto', weights='uniform', kernel_params={})¶ Create K Nearest Neighbors Classifier model
Parameters: dataset : pandas.DataFrame
Input DataFrame. shape = (n_rows, n_features + 1) where each row is a data point and the columns are numeric features and a column with the class labels
class_column : string
The name of the column in the input dataset which describes the class labels
name : string, optional
The desired name of the model. If None then a random name will be generated
pipeline : string, optional
An extant Pipeline to use for model creation. If None then one will be created
K : integer, optional
The desired K in the Nearest Neighbor classifier model
kernel : string [ ‘euclidean’ | ‘minkowski’ | ‘hammond’ | ‘etc...’], optional
The desired kernel for defining distance in our classifier. Default is ‘euclidean’
algo : string [‘auto’ | ‘ball’ | ‘kd_tree’ | ‘brute’], optional
The algorithm to use in determining Nearest Neighbors. Default is ‘auto’
weights : string [‘uniform’ | ‘weighted’], optional
Should voting be uniform (i.e. independent of distance) or weighted by distance (i.e. closer neighbor’s have higher weighted votes)
kernel_params : dict, optional
Any parameters specific to the chosen kernel
Returns:
-
KNNRegressor
(X=None, Y=None, K=5, kernel='euclidean', algo='auto', weights='uniform', kernel_params={}, name=None, pipeline=None)¶ Create a K Nearest Neighbors Regressor model
Multi-output regression finds function from input space Y to lower dimension output space
Parameters: X : pandas.DataFrame
Input DataFrame. shape = (n_rows, input_features) where each row is a data point and the columns are numeric features
Y : pandas.DataFrame
Output DataFrame. shape = (n_rows, output_features) where output_features < input_features and each row is a data point and the columns are numeric features
name : string, optional
The desired name of the model. If None then a random name will be generated
pipeline : string, optional
An extant Pipeline to use for model creation. If None then one will be created
K : integer, optional
The desired K in the Nearest Neighbor classifier model
kernel : string [ ‘euclidean’ | ‘minkowski’ | ‘hammond’ | ‘etc...’], optional
The desired kernel for defining distance in our classifier. Default is ‘euclidean’
algo : string [‘auto’ | ‘ball’ | ‘kd_tree’ | ‘brute’], optional
The algorithm to use in determining Nearest Neighbors. Default is ‘auto’
weights : string [‘uniform’ | ‘weighted’], optional
Should voting be uniform (i.e. independent of distance) or weighted by distance (i.e. closer neighbor’s have higher weighted predictions)
kernel_params : dict, optional
Any parameters specific to the chosen kernel
Returns:
-
KalmanOLS
(dataset=None, init_alpha=None, init_beta=None, trans_cov=None, obs_cov=None, init_cov=None, optimizations=[], name=None, pipeline=None)¶ Create slicematrixIO.bayesian_filters.KalmanOLS object with current client
The KalmanOLS model
Parameters: dataset: pandas.DataFrame
Input DataFrame. shape = (nrows, 2) where the first column is Y and the second is X in OLS model
init_alpha : float, optional
Initial value for alpha in OLS model (ignored if optimizations are enabled)
init_beta : float, optional
Initial value for beta in OLS model (ignored if optimizations are enabled)
trans_cov : array-like, optional
Transition covariance, shape = (2, 2)
init_cov : array-like, optional
Initial covariance, shape = (2, 2)
optimizations : list, optional
List of optimizations. Can include multiple optimizations. Default includes all:
- ‘transition_covariance’
- ‘observation_covariance’
- ‘initial_state_mean’
- ‘initial_state_covariance’
name : string, optional
The desired name of the model. If None then a random name will be generated
pipeline : BasePipeline, optional
Pipeline to use. Defaults to None. If None then a pipeline will be created for use in creating the model
Returns:
-
KernelDensityEstimator
(dataset=None, bandwith='scott', kernel_params={}, name=None, pipeline=None)¶ Train a Kernel Density Estimator model
Parameters: dataset : pandas.DataFrame
Input DataFrame. shape = (n_rows, n_features + 1) where each row is a data point and the columns are numeric features and a column with the class labels
name : string, optional
The desired name of the model. If None then a random name will be generated
pipeline : string, optional
An extant Pipeline to use for model creation. If None then one will be created
bandwidth : str [‘scott’ | ‘silverman’], optional
The method for bandwidth selection in the KDE model
kernel_params : dict, optional
Any parameters specific to the chosen kernel
Returns:
-
KernelPCA
(dataset=None, D=2, kernel='linear', alpha=1.0, invert=False, kernel_params={}, name=None, pipeline=None)¶ Create a Kernel Principal Components Analysis model for non-linear dimensionality reduction. Applies the kernel trick to PCA
Parameters: dataset : pandas.DataFrame
Input DataFrame. shape = (n_rows, n_features) where each row is a data point and the columns are numeric features
D : int, optional
The desired embedding dimension. Defaults to 2-D
kernel : string, optional
The distance kernel / metric to use in constructing the distance matrix. Default is euclidean.
alpha : float, optional
Parameter of ridge regression which learns the inverse transform. Ignored if invert == False
invert : boolean, optional
Whether to learn the inverse transform (from low dimension space back to high dimension space)
kernel_params : dict, optional
Any extra parameters specific to the chosen kernel
name : string, optional
The desired name of the model. If None then a random name will be generated
pipeline : string, optional
An extant Pipeline to use for model creation. If None then one will be created
Returns:
-
KernelRidgeRegressor
(X=None, Y=None, kernel='linear', alpha=1.0, kernel_params={}, name=None, pipeline=None)¶ Create a Kernel Ridge Regressor model
Parameters: X : pandas.DataFrame
Input DataFrame. shape = (n_rows, input_features) where each row is a data point and the columns are numeric features
Y : pandas.DataFrame
Output DataFrame. shape = (n_rows, output_features) where output_features < input_features and each row is a data point and the columns are numeric features
alpha : float, optional
Kernel Ridge Regressor model alpha value. Default 1.0
kernel : string [‘linear’, ‘rbf’, ‘poly’]
Kernel to use in regression. Linear is default. For nonlinear datasets, consider rbf or poly
‘linear’ : linear kernel
‘rbf’ : radial basis function kernel
‘poly’ : polynomial kernel
kernel_params : dict
Kernel specific parameters
name : string, optional
The desired name of the model. If None then a random name will be generated
pipeline : string, optional
An extant Pipeline to use for model creation. If None then one will be created
Returns:
-
LaplacianEigenmapper
(dataset=None, D=2, affinity='knn', K=5, gamma=1.0, name=None, pipeline=None)¶ Create Laplacian Eiegenmapper model aka spectral embedder for non-linear dimensonality reduction
Parameters: dataset : pandas.DataFrame
Input DataFrame. shape = (n_rows, n_features) where each row is a data point and the columns are numeric features
D : int, optional
The desired embedding dimension. Defaults to 2-D
affinity : string [“knn” | “rbf”], optional
How should we construct the affinity matrix?
“knn” : use k nearest neighbors graph
“rbf” : use radial basis function kernel
K : integer greater than 1, optional
The K to use if affinity is “knn”.
gamma : float, optional
Kernel coefficient for affinity “rbf”
name : string, optional
The desired name of the model. If None then a random name will be generated
pipeline : string, optional
An extant Pipeline to use for model creation. If None then one will be created
Returns:
-
LocalLinearEmbedder
(dataset=None, D=2, K=3, method='standard', name=None, pipeline=None)¶ Create a Local Linear Embedder model for non-linear dimensonality reduction
Parameters: dataset : pandas.DataFrame
Input DataFrame. shape = (n_rows, n_features) where each row is a data point and the columns are numeric features
D : int, optional
The desired embedding dimension. Defaults to 2-D
K : integer greater than 1, optional
The number of neighbors to use in building the embedding. Default is 3
method : string [‘standard’ | ‘hessian’ | ‘modified’ | ‘ltsa’]
Which LLE algorithm should we use?
‘standard’ : standard LLE method
‘hessian’: hessian eigenmap LLE method, requires that K > D * (1 + (D + 1) / 2
‘modified’ : modified LLE method
‘ltsa’: local tangent space alignment LLE method
name : string, optional
The desired name of the model. If None then a random name will be generated
pipeline : string, optional
An extant Pipeline to use for model creation. If None then one will be created
Returns:
-
MatrixAgglomerator
(label_dataset=None, alpha=0.1, matrix=None, matrix_name=None, matrix_type=None, name=None, pipeline=None)¶ Create a Matrix Agglomerator model. Essential for supervised manifold learning, this model takes a previously created matrix model as input and applies class label information to the similarity matrix. In a nutshell, this model pulls data points of the same class closer together, increasing the separability of the dataset.
Parameters: label_dataset : pandas.DataFrame
The class label information. shape = (n_rows, 1). n_rows should be same dimension as input matrix
alpha : float in (0., 1.)
The agglomeration factor, i.e. how much does class label effect input distances. An alpha of 0 will have no effect, while 1.0 will pull data points of the same class completely together. The higher the value of alpha, the more information will be transfered to the distance matrix from the class labels. Higher alphas increase the in-sample performance but also increase the chance of over-fitting
matrix : object
Matrix model object; i.e. a class from
slicematrixIO.matrices
matrix_name : string, optional
The name of the existing matrix model. Optional if matrix is not None
matrix_type :
The type of the matrix model. Optional if matrix is not None. Required if using matrix_name
name : string, optional
The desired name of the model. If None then a random name will be generated
pipeline : string, optional
An extant Pipeline to use for model creation. If None then one will be created
Returns:
-
MatrixKernelPCA
(D=2, matrix=None, matrix_name=None, matrix_type=None, name=None, pipeline=None)¶ Decompose the input matrix and embed the input into a lower dimension space
Parameters: D : int, optional
The desired embedding dimension. Defaults to 2-D
matrix : object
Matrix model object; i.e. a class from slicematrixIO.matrices
matrix_name : string, optional
The name of the existing matrix model. Optional if matrix is not None
matrix_type :
The type of the matrix model. Optional if matrix is not None. Required if using matrix_name
name : string, optional
The desired name of the model. If None then a random name will be generated
pipeline : string, optional
An extant Pipeline to use for model creation. If None then one will be created
Returns:
-
MatrixMinimumSpanningTree
(matrix=None, matrix_name=None, matrix_type=None, name=None, pipeline=None)¶ Create a Minimum Spanning Tree model from a Distance Matrix model
This is an example of a Matrix Model, which creates a machine learning model using another already trained model as its input. You can think of this as model chaining.
In this case, this function takes a previously created Distance Matrix model and uses it to construct a network graph model called a Minimum Spanning Tree.
Parameters: matrix : object
Matrix model object; i.e. a class from slicematrixIO.matrices
matrix_name : string, optional
The name of the existing matrix model. Optional if matrix is not None
matrix_type :
The type of the matrix model. Optional if matrix is not None. Required if using matrix_name
name : string, optional
The desired name of the model. If None then a random name will be generated
pipeline : string, optional
An extant Pipeline to use for model creation. If None then one will be created
Returns: model :
slicematrixIO.matrix_models.MatrixMinimumSpanningTree
-
MinimumSpanningTree
(dataset=None, corr_method='pearson', name=None, pipeline=None)¶ Create a Minimum Spanning Tree graph model
MST models transform the input dataset into a distance matrix then construct a graph with the shortest possible total distance which visits all nodes without cycling (i.e. it creates a tree)
In particular, this model constructs the graph using the correlation matrix. For more flexible options in creating a MST graph, use slicematrixIO.matrix_models.MatrixMinimumSpanningTree in combination with a distance matrix
Parameters: dataset : pandas.DataFrame
Input DataFrame. shape = (n_rows, n_features) where each row is a data point and the columns are numeric features
corr_method : string [“pearson” | “spearman” | “kendall” ]
Which method should we use for computing the correlation matrix?
“pearson” : use the Pearson correlation coefficient
“spearman” : use Spearman’s rho
“kendall” : use Kendall’s tau
name : string, optional
The desired name of the model. If None then a random name will be generated
pipeline : string, optional
An extant Pipeline to use for model creation. If None then one will be created
Returns:
-
NeighborNetworkGraph
(dataset=None, K=3, kernel='euclidean', name=None, pipeline=None)¶ Create a K Nearest Neighbor Graph for the given dataset
Parameters: dataset : pandas.DataFrame
Input DataFrame. shape = (n_rows, n_features) where each row is a data point and the columns are numeric features
K : integer greater than 1, optional
The number of nearest neighbors to use for constructing the CFG
kernel : string, optional
The distance kernel / metric to use in constructing the distance matrix. Default is euclidean.
name : string, optional
The desired name of the model. If None then a random name will be generated
pipeline : string, optional
An extant Pipeline to use for model creation. If None then one will be created
Returns:
-
PNNClassifier
(dataset=None, class_column=None, name=None, pipeline=None, sigma=0.1)¶ Create a Probabilistic Neural Network Classifier model
Parameters: dataset : pandas.DataFrame
Input DataFrame. shape = (n_rows, n_features + 1) where each row is a data point and the columns are numeric features and a column with the class labels
class_column : string
The name of the column in the input dataset which describes the class labels
name : string, optional
The desired name of the model. If None then a random name will be generated
pipeline : string, optional
An extant Pipeline to use for model creation. If None then one will be created
sigma : float in (0., 1.), optional
The desired smoothing parameter for the PNN model. Default is 0.1
Returns:
-
RFRegressor
(X=None, Y=None, n_trees=8, name=None, pipeline=None)¶ Create a Random Forest Regressor model
A Random Forest Regressor finds a function which maps the input space (X) to the lower dimension output space (Y) using decision trees
Parameters: X : pandas.DataFrame
Input DataFrame. shape = (n_rows, input_features) where each row is a data point and the columns are numeric features
Y : pandas.DataFrame
Output DataFrame. shape = (n_rows, output_features) where output_features < input_features and each row is a data point and the columns are numeric features
n_trees : integer greater than 1, optional
The number of trees to use in construction of the Random Forest model. Default is 8 trees
name : string, optional
The desired name of the model. If None then a random name will be generated
pipeline : string, optional
An extant Pipeline to use for model creation. If None then one will be created
Returns:
-
slicematrixIO.connect module¶
low level SliceMatrix-IO API client
-
class
slicematrixIO.connect.
ConnectIO
(api_key, region='us-east-1')¶ Low Level Connection to SliceMatrix-IO
Implements basic interface
Parameters: api_key : string
Valid SliceMatrix-IO API Key
region : string [‘us-east-1’, ‘us-west-1’, ‘eu-central-1’, ‘ap-southeast-1’]
Data center of choice. API Key must be valid for that specific data center. Latency will be lowest if client is closest to data center.
‘us-east-1’: US East Coast Data Center
‘us-west-1’: US West Coast Data Center
‘eu-central-1’: Continental Europe Data Center
‘ap-southeast-1’: South-East Asian Data Center
Examples
>>> from slicematrixIO.connect import ConnectIO >>> io = ConnectIO(api_key) >>> io.create_pipeline(...) >>> io.run_pipeline(...) >>> io.call_model(...)
Attributes
uploader (object) Convienence class for uploading data to SliceMatrix-IO region (string) Methods
-
call_model
(model, type, method, extra_params={}, memory='large')¶ Remotely call a method in a machine learning model
Parameters: model : string
The name of the model
type : string
The type of the model
method: string
The name of the model method to call remotely. Acceptable inputs vary by Pipeline type. See Pipeline docs for more information
extra_params : dict
Any extra parameters to pass as key / values to the Pipeline
memory: string [ ‘large’]
The size of the container (always set to large for beta)
Returns: model_output: dict
-
create_pipeline
(name, type, params={})¶ Create a new Analytical Pipeline for distributed computation
Parameters: name : string
The desired name of the new Pipeline
type : string [ ‘raw_isomap’ | ‘raw_mst’ | ‘raw_lle’ | ‘raw_cfg’ | ‘raw_kde’ | ‘raw_knn_net’ | ‘raw_knn_classifier’ | ‘raw_knn_regressor’ | ‘raw_kpca’ | ‘raw_krr’ | ‘raw_rfr’ | ‘raw_laplacian’ | ‘raw_pnn’ | ‘matrix_mst’ | ‘matrix_kpca’ | ‘matrix_agg’ | ‘kalman_ols’ | ‘basic_a2d’ | ‘isolation_forest’ | ‘dist_matrix’]
The type of the Pipeline
params: dict
Any type specific parameters to the Pipeline in the key/val dictionary
Returns: response: dict
Notes
The basic structure of computation in SliceMatrix-IO starts with the Pipeline.
Pipelines can be thought of as analytical assembly lines, running code which transforms a dataset from raw input data into a meaningful machine learning model. Each pipeline can be reused to process multiple datasets. Pipelines can also be run in parallel.
-
list_files
()¶ Get a list of the files previously uploaded
Returns: file_list : list
-
put_df
(name, dataframe)¶ Upload the DataFrame with desired name and get response (success | failure)
Parameters: name : string
The desired name of the DataFrame
dataframe: pandas.DataFrame
The DataFrame for uploading to the SliceMatrix-IO backend
Returns: response : dict
-
run_pipeline
(name, model, type=None, dataset=None, matrix_name=None, matrix_type=None, X=None, Y=None, extra_params={}, memory='large')¶ Run a Pipeline with the given dataset
Parameters: name : string
The name of the target Pipeline
model : string
The desired name of the model
type : string [ ‘raw_isomap’ | ‘raw_mst’ | ‘raw_lle’ | ‘raw_cfg’ | ‘raw_kde’ |
‘raw_knn_net’ | ‘raw_knn_classifier’ | ‘raw_knn_regressor’ | ‘raw_kpca’ | ‘raw_krr’ | ‘raw_rfr’ | ‘raw_laplacian’ | ‘raw_pnn’ | ‘matrix_mst’ | ‘matrix_kpca’ | ‘matrix_agg’ | ‘kalman_ols’ | ‘basic_a2d’ | ‘isolation_forest’ | ‘dist_matrix’]
The type of the Pipeline
dataset : string
The name of the dataset to run through the Pipeline
matrix_name : string
The name of the matrix model to run through the Pipeline (for Matrix Models)
matrix_type : string [ ‘dist_matrix’ | ‘matrix_agg’ ]
The type of matrix
X : string
The name of the X input (for multi-output regression models)
Y : string
The name of the Y input (for multi-output regression models)
extra_params : dict
Any extra parameters to pass as key / values to the Pipeline
memory: string [ ‘large’]
The size of the container (always set to large for beta)
Returns: response : dict
Notes
This is a very flexible function for running any Pipeline in the SliceMatrix-IO platform.
Most Pipelines will take a single dataset name as input (such as raw_isomap and raw_knn_classifier), whereas others will have more complex inputs. Matrix Models will take matrix_name and matrix_type parameters and regression models will require the names of input (X) and output (Y) training sets.
-
-
class
slicematrixIO.connect.
Uploader
(api_key, region, api)¶ Object to handle uploads to SliceMatrix-IO backend
Parameters: api_key : string
Valid SliceMatrix-IO API Key
region : string [‘us-east-1’, ‘us-west-1’, ‘eu-central-1’, ‘ap-southeast-1’]
Data center of choice. API Key must be valid for that specific data center. Latency will be lowest if client is closest to data center.
‘us-east-1’: US East Coast Data Center
‘us-west-1’: US West Coast Data Center
‘eu-central-1’: Continental Europe Data Center
‘ap-southeast-1’: South-East Asian Data Center
api : string
API ID
Examples
>>> uploader = Uploader(api_key) >>> uploader.put_df("my_dataframe", df)
Methods
-
get_upload_url
(file_name)¶
-
list_files
()¶ Get a list of the files previously uploaded
Returns: file_list : list
-
put_df
(name, df)¶ Upload the DataFrame with desired name and get response (success | failure)
Parameters: name : string
The desired name of the DataFrame
df: pandas.DataFrame
The DataFrame for uploading to the SliceMatrix-IO backend
Returns: response : dict
-
slicematrixIO.core module¶
Core classes
-
class
slicematrixIO.core.
BasePipeline
(name, type, client=None, params={})¶ The base class for every Pipeline
Parameters: name : string
The desired name of the Pipeline
type : string
The type of the Pipeline
client :
slicematrixIO.connect.ConnectIO
Low level client for dispatching requests to SliceMatrix-IO
Returns: response : dict
success or failure response to Pipeline creation request
Methods
-
run
(model, type=None, dataset=None, matrix_name=None, matrix_type=None, X=None, Y=None, extra_params={})¶ Run the Pipeline and create a new model
Parameters: dataset : pandas.DataFrame
The dataset to pass into the Pipeline which will train a model using the parameters defined upon Pipeline creation. Pipelines are reusable sets of instructions to train a machine learning model.
-
slicematrixIO.distributions module¶
-
class
slicematrixIO.distributions.
BasicA2D
(dataset=None, name=None, pipeline=None, retrain=True, client=None)¶ Methods
-
getState
()¶
-
score
(value)¶
-
update
(value)¶
-
-
class
slicematrixIO.distributions.
BasicA2DPipeline
(name, retrain=True, client=None)¶ Bases:
slicematrixIO.core.BasePipeline
Methods
-
run
(dataset, model)¶
-
-
class
slicematrixIO.distributions.
IsolationForest
(dataset=None, name=None, pipeline=None, rate=0.1, n_trees=100, client=None)¶ Methods
-
score
(points)¶
-
training_scores
()¶
-
-
class
slicematrixIO.distributions.
IsolationForestPipeline
(name, rate=0.1, n_trees=100, client=None)¶ Bases:
slicematrixIO.core.BasePipeline
Methods
-
run
(dataset, model)¶
-
-
class
slicematrixIO.distributions.
KernelDensityEstimator
(dataset=None, name=None, pipeline=None, bandwidth='scott', client=None)¶ Methods
-
hypercube
(lower_bounds, upper_bounds)¶
-
simulate
(N=1)¶
-
-
class
slicematrixIO.distributions.
KernelDensityEstimatorPipeline
(name, bandwidth='scott', client=None)¶ Bases:
slicematrixIO.core.BasePipeline
Methods
-
run
(dataset, model)¶
-
slicematrixIO.graphs module¶
Classes for creating network graph models
-
class
slicematrixIO.graphs.
CorrelationFilteredGraph
(dataset=None, name=None, pipeline=None, K=3, client=None)¶ Methods
-
edges
()¶ Get a list of all the edges in the graph model
Returns: edges : list
list of all edge / link tuples. Source is edge[0] Target is edge[1]
-
neighborhood
(node)¶ Get the nearest neighbors of the given node
Parameters: node : string
The name of the target node we want to find the neighbors (shared edges)
Returns: neighbors : dict
Dictionary of nearest neighbors with distances to target node
-
nodes
()¶ Get the names of the data points / nodes that make of the training dataset
Returns: nodes : list
Data point names / indices
-
rankLinks
()¶ Rank the links by weight, if applicable
Returns: links : dict
dictionary of links with associated weight, if applicable
-
rankNodes
(statistic='closeness_centrality')¶ Rank the model’s nodes by the given network graph statistic / factor
Parameters: statistic : string [‘degree_centrality’ | ‘eigen_centrality’ | ‘closeness_centrality’ | ‘betweenness_centrality’ | ‘is_connected’ |
‘curr_flow_centrality’ | ‘pagerank’ | ‘hits’ | ‘communicability’ | ‘clustering’ | ‘square_clustering’ | ‘greedy_colors’ | ‘eccentricity’ | ‘clique_numbers’ | ‘number_of_cliques’ | ‘estrada_index’ | ‘assortivity’ | ‘transitivity’ | ‘avg_clustering’ | ‘maximal_matching’ | ‘max_weight_matching’ | ‘dispersion’]
The desired graph statistic
Returns: stats : array-like
Depending on the statistic this will be an array or a single float value
-
-
class
slicematrixIO.graphs.
CorrelationFilteredGraphPipeline
(name, K=3, client=None)¶ Bases:
slicematrixIO.core.BasePipeline
Create a Pipeline for training
CorrelationFilteredGraph
models.CFG’s are similar to MST’s, in that both graph’s begin with a distance matrix, but whereas MST’s are limited to constructing a tree, CFG’s draw links between a node and its closests K neighbors based on correlation distance. CFG’s are like KNN networks, but optimized for using correlation distance.
Parameters: name : string
The desired name of the Pipeline.
K : integer greater than 1, optional
The number of nearest neighbors to use for constructing the CFG
client :
slicematrixIO.connect.ConnectIO
Low level client for dispatching requests to SliceMatrix-IO
Returns: response : dict
success or failure response to Pipeline creation request
Examples
Create a Pipeline for training multiple
CorrelationFilteredGraph
models>>> io = ConnectIO(api_key) >>> pipe = CorrelationFilteredGraphPipeline(client = io) >>> for dataframe in dataframes: >>> current_model = pipe.run(dataset = dataframe, name = slicematrixIO.utils.rando_name())
Methods
-
run
(dataset, model)¶
-
-
class
slicematrixIO.graphs.
MinimumSpanningTree
(dataset=None, name=None, pipeline=None, corr_method='pearson', client=None)¶ Methods
-
edges
()¶ Get a list of all the edges in the graph model
Returns: edges : list
list of all edge / link tuples. Source is edge[0] Target is edge[1]
-
neighborhood
(node)¶ Get the nearest neighbors of the given node
Parameters: node : string
The name of the target node we want to find the neighbors (shared edges)
Returns: neighbors : dict
Dictionary of nearest neighbors with distances to target node
-
nodes
()¶ Get the names of the data points / nodes that make of the training dataset
Returns: nodes : list
Data point names / indices
-
rankLinks
()¶ Rank the links by weight, if applicable
Returns: links : dict
dictionary of links with associated weight, if applicable
-
rankNodes
(statistic='closeness_centrality')¶ Rank the model’s nodes by the given network graph statistic / factor
Parameters: statistic : string [‘degree_centrality’ | ‘eigen_centrality’ | ‘closeness_centrality’ | ‘betweenness_centrality’ | ‘is_connected’ |
‘curr_flow_centrality’ | ‘pagerank’ | ‘hits’ | ‘communicability’ | ‘clustering’ | ‘square_clustering’ | ‘greedy_colors’ | ‘eccentricity’ | ‘clique_numbers’ | ‘number_of_cliques’ | ‘estrada_index’ | ‘assortivity’ | ‘transitivity’ | ‘avg_clustering’ | ‘maximal_matching’ | ‘max_weight_matching’ | ‘dispersion’]
The desired graph statistic
Returns: stats : array-like
Depending on the statistic this will be an array or a single float value
-
-
class
slicematrixIO.graphs.
MinimumSpanningTreePipeline
(name, corr_method='pearson', client=None)¶ Bases:
slicematrixIO.core.BasePipeline
Create a Pipeline for training
MinimumSpanningTree
models.Parameters: name : string
The desired name of the Pipeline.
corr_method : string [“pearson” | “spearman” | “kendall” ]
Which method should we use for computing the correlation matrix?
“pearson” : use the Pearson correlation coefficient
“spearman” : use Spearman’s rho
“kendall” : use Kendall’s tau
client :
slicematrixIO.connect.ConnectIO
Low level client for dispatching requests to SliceMatrix-IO
Returns: response : dict
success or failure response to Pipeline creation request
Examples
Create a Pipeline for training multiple
MinimumSpanningTree
models>>> io = ConnectIO(api_key) >>> pipe = MinimumSpanningTreePipeline(client = io) >>> for dataframe in dataframes: >>> current_model = pipe.run(dataset = dataframe, name = slicematrixIO.utils.rando_name())
Methods
-
run
(dataset, model)¶
-
-
class
slicematrixIO.graphs.
NeighborNetworkGraph
(dataset=None, name=None, pipeline=None, K=3, kernel='euclidean', client=None)¶ Methods
-
edges
()¶ Get a list of all the edges in the graph model
Returns: edges : list
list of all edge / link tuples. Source is edge[0] Target is edge[1]
-
neighborhood
(node)¶ Get the nearest neighbors of the given node
Parameters: node : string
The name of the target node we want to find the neighbors (shared edges)
Returns: neighbors : dict
Dictionary of nearest neighbors with distances to target node
-
nodes
()¶ Get the names of the data points / nodes that make of the training dataset
Returns: nodes : list
Data point names / indices
-
rankLinks
()¶ Rank the links by weight, if applicable
Returns: links : dict
dictionary of links with associated weight, if applicable
-
rankNodes
(statistic='closeness_centrality')¶ Rank the model’s nodes by the given network graph statistic / factor
Parameters: statistic : string [‘degree_centrality’ | ‘eigen_centrality’ | ‘closeness_centrality’ | ‘betweenness_centrality’ | ‘is_connected’ |
‘curr_flow_centrality’ | ‘pagerank’ | ‘hits’ | ‘communicability’ | ‘clustering’ | ‘square_clustering’ | ‘greedy_colors’ | ‘eccentricity’ | ‘clique_numbers’ | ‘number_of_cliques’ | ‘estrada_index’ | ‘assortivity’ | ‘transitivity’ | ‘avg_clustering’ | ‘maximal_matching’ | ‘max_weight_matching’ | ‘dispersion’]
The desired graph statistic
Returns: stats : array-like
Depending on the statistic this will be an array or a single float value
-
-
class
slicematrixIO.graphs.
NeighborNetworkGraphPipeline
(name, K=3, kernel='euclidean', client=None)¶ Bases:
slicematrixIO.core.BasePipeline
Create a Pipeline for training
NeighborNetworkGraph
models.Parameters: name : string
The desired name of the Pipeline.
K : integer greater than 1, optional
The number of nearest neighbors to use for constructing the CFG
kernel : string, optional
The distance kernel / metric to use in constructing the distance matrix. Default is euclidean.
client :
slicematrixIO.connect.ConnectIO
Low level client for dispatching requests to SliceMatrix-IO
Returns: response : dict
success or failure response to Pipeline creation request
Examples
Create a Pipeline for training multiple
NeighborNetworkGraph
models>>> io = ConnectIO(api_key) >>> pipe = NeighborNetworkGraphPipeline(K = 5, client = io) >>> for dataframe in dataframes: >>> current_model = pipe.run(dataset = dataframe, name = slicematrixIO.utils.rando_name())
Methods
-
run
(dataset, model)¶
-
slicematrixIO.manifolds module¶
Manifold Learning Pipelines and Models
-
class
slicematrixIO.manifolds.
Isomap
(dataset, name=None, pipeline=None, D=2, K=3, client=None)¶ Train / Reload an Isomap model
Parameters: name : string, optional
The desired name of the model. If None then a random name will be generated. If dataset == None, then the name will be used to lazy load the model from the SliceMatrix-IO cloud.
dataset : pandas.DataFrame
Input DataFrame. shape = (n_rows, n_features) where each row is a data point and the columns are numeric features
D : int, optional
The desired embedding dimension. Defaults to 2-D
K : integer greater than 1, optional, ignored if geodesic == False
The number of neighbors to use in building the geodesic distance matrix. Geodesic distance is constructed by computing the K Nearest Neighbors graph for the input dataset, then constructing all pairwise distances using the geodesic distance, i.e. the number of edges in a shortest path between two points on the graph.
client :
slicematrixIO.connect.ConnectIO
Low level client for dispatching requests to SliceMatrix-IO
pipeline : string, optional
An extant Pipeline to use for model creation. If None then one will be created
Returns: model :
Isomap
Isomap model object
Examples
Create a model for a given dataset
>>> sm = SliceMatrix(api_key) >>> iso = sm.Isomap(dataset = dataframe, D = 3, K = 10)
Get the embedding
>>> iso.embedding()
Methods
-
edges
()¶ Get a list of all the edges in the KNN graph used to created the Isomap model
Returns: edges : list
list of all edge / link tuples. Source is edge[0] Target is edge[1]
-
embedding
(nodes=True)¶ Get the D dimensional embedding of the training data
I.e.
- Take input data in high dimensions
- Transform via Isomap to D dimensions
Parameters: nodes : boolean, optional
Whether to return with node names. Default == True
Returns: embedding : pandas.DataFrame
D dimensional embedding. shape = (n_rows, D)
-
neighborhood
(node)¶ Get the nearest neighbors of the given node
Parameters: node : string
The name of the target node we want to find the nearest neighbors for
Returns: neighbors : dict
Dictionary of nearest neighbors with distances to target node
-
nodes
()¶ Get the names of the data points / nodes that make of the training dataset
Returns: nodes : list
Data point names / indices
-
rankLinks
()¶ Rank the links by geodesic distance
Returns: links : dict
dictionary of links with associated geodesic distances
-
rankNodes
(statistic='closeness_centrality')¶ Rank the model’s nodes by the given network graph statistic / factor
Parameters: statistic : string [‘degree_centrality’ | ‘eigen_centrality’ | ‘closeness_centrality’ | ‘betweenness_centrality’ | ‘is_connected’ |
‘curr_flow_centrality’ | ‘pagerank’ | ‘hits’ | ‘communicability’ | ‘clustering’ | ‘square_clustering’ | ‘greedy_colors’ | ‘eccentricity’ | ‘clique_numbers’ | ‘number_of_cliques’ | ‘estrada_index’ | ‘assortivity’ | ‘transitivity’ | ‘avg_clustering’ | ‘maximal_matching’ | ‘max_weight_matching’ | ‘dispersion’]
The desired graph statistic
Returns: stats : array-like
Depending on the statistic this will be an array or a single float value
-
recon_error
()¶ Get the reconstruction error of the model.
Reconstruction error of the embedding
Returns: recon_error : float
Reconstruction error for the model
-
search
(point)¶
-
-
class
slicematrixIO.manifolds.
IsomapPipeline
(name, D=2, K=3, client=None)¶ Bases:
slicematrixIO.core.BasePipeline
Create a Pipeline for training
Isomap
modelsParameters: name : string
The desired name of the Pipeline.
D : int, optional
The desired embedding dimension. Defaults to 2-D
K : integer greater than 1, optional, ignored if geodesic == False
The number of neighbors to use in building the geodesic distance matrix. Geodesic distance is constructed by computing the K Nearest Neighbors graph for the input dataset, then constructing all pairwise distances using the geodesic distance, i.e. the number of edges in a shortest path between two points on the graph.
client :
slicematrixIO.connect.ConnectIO
Low level client for dispatching requests to SliceMatrix-IO
Returns: response : dict
success or failure response to Pipeline creation request
Examples
Create a Isomap Pipeline for processing multiple datasets
>>> io = ConnectIO(api_key) >>> iso_pipe = KernelPCAPipeline(D = 3, K = 4, client = io) >>> for dataframe in dataframes: >>> current_model = iso_pipe.run(dataset = dataframe, name = slicematrixIO.utils.rando_name())
Methods
-
run
(dataset, model)¶ Run the Pipeline and create a new
Isomap
modelParameters: dataset : pandas.DataFrame
The dataset to pass into the Pipeline which will train an
Isomap
model using the parameters defined upon Pipeline creation. Pipelines are reusable sets of instructions to train a machine learning model.Returns: response : dict
success or failure response to model creation request
-
-
class
slicematrixIO.manifolds.
KernelPCA
(dataset=None, name=None, pipeline=None, D=2, kernel='linear', alpha=1.0, invert=False, kernel_params={}, client=None)¶ Kernel Principal Component Analysis model
For non-linear dimensionality reduction, simulation, classification, and regression.
Applies the kernel trick to PCA.
Parameters: dataset : pandas.DataFrame, optional
The dataset to use in training the KernelPCA model. If None, then lazy loading is in effect and a name parameter should be given which matches an already created model. shape = (n_rows, n_features)
name : string, optional
The desired name of the model. If None then a random name will be generated. If dataset == None, then the name will be used to lazy load the model from the SliceMatrix-IO cloud.
D : int, optional
The desired embedding dimension. Defaults to 2-D
kernel : string, optional
The distance kernel / metric to use in constructing the distance matrix. Default is euclidean.
alpha : float, optional
Parameter of ridge regression which learns the inverse transform. Ignored if invert == False
invert : boolean, optional
Whether to learn the inverse transform (from low dimension space back to high dimension space)
kernel_params : dict, optional
Any extra parameters specific to the chosen kernel
client :
slicematrixIO.connect.ConnectIO
Low level client for dispatching requests to SliceMatrix-IO
pipeline : string, optional
An extant Pipeline to use for model creation. If None then one will be created
Returns: model :
KernelPCA
KPCA model object
Examples
Create a KernelPCA model for a given dataset
>>> sm = SliceMatrix(api_key) >>> kpca = sm.KernelPCA(dataset = dataframe, D = 5, kernel = "rbf")
Get the embedding
>>> kpca.embedding()
Learn the inverse transform
>>> kpca = sm.KernelPCA(dataset = dataframe, invert = True) >>> kpca.inverse_embedding()
Methods
-
embedding
(nodes=True)¶ Get the D dimensional embedding of the training data
I.e.
- Take input data in high dimensions
- Transform via KPCA to D dimensions
Parameters: nodes : boolean, optional
Whether to return with node names. Default == True
Returns: embedding : pandas.DataFrame
D dimensional embedding. shape = (n_rows, D)
-
feature_names
()¶ Get the names of the features, if applicable
Returns: meta : dict
Model feature names
-
inverse_embedding
(nodes=True)¶ Get the inverse embedding of the training data in original dimensions
I.e.
- Take input data in high dimensions
- Transform via KPCA to D dimensions
- Tranform back to high dimensions using model
Parameters: nodes : boolean, optional
Whether to return with node names. Default == True
Returns: inverse_embedding : pandas.DataFrame
Original dimension inverse embedding. shape = (n_rows, n_features)
-
meta
()¶ Get the model metadata such as D, kernel name, etc...
Returns: meta : dict
Model metadata
-
nodes
()¶ Get the names of the data points / nodes that make of the training dataset
Returns: nodes : list
Data point names / indices
-
-
class
slicematrixIO.manifolds.
KernelPCAPipeline
(name, D=2, kernel='linear', alpha=1.0, invert=False, kernel_params={}, client=None)¶ Bases:
slicematrixIO.core.BasePipeline
Pipeline for creating Kernel Principal Component Analysis models
For non-linear dimensionality reduction, simulation, classification, and regression.
Applies the kernel trick to PCA.
Parameters: name : string
The desired name of the Pipeline.
D : int, optional
The desired embedding dimension. Defaults to 2-D
kernel : string, optional
The distance kernel / metric to use in constructing the distance matrix. Default is euclidean.
alpha : float, optional
Parameter of ridge regression which learns the inverse transform. Ignored if invert == False
invert : boolean, optional
Whether to learn the inverse transform (from low dimension space back to high dimension space)
kernel_params : dict, optional
Any extra parameters specific to the chosen kernel
client :
slicematrixIO.connect.ConnectIO
Low level client for dispatching requests to SliceMatrix-IO
Returns: response : dict
success or failure response to Pipeline creation request
Examples
Create a KernelPCA Pipeline for processing multiple datasets
>>> io = ConnectIO(api_key) >>> kpca_pipe = KernelPCAPipeline(D = 5, kernel = "rbf", client = io) >>> for dataframe in dataframes: >>> current_kpca_model = kpca_pipe.run(dataset = dataframe, name = slicematrixIO.utils.rando_name())
Methods
-
run
(dataset, model)¶ Run the Pipeline and create a new KernelPCA model
Parameters: dataset : pandas.DataFrame
The dataset to pass into the Pipeline which will train a KernelPCA model using the parameters defined upon Pipeline creation. Pipelines are reusable sets of instructions to train a machine learning model.
Returns: response : dict
success or failure response to model creation request
-
-
class
slicematrixIO.manifolds.
LaplacianEigenmapper
(dataset=None, name=None, pipeline=None, D=2, affinity='knn', K=5, gamma=1.0, client=None)¶ Train / Reload a Laplacian Eigenmapper model
Parameters: dataset : pandas.DataFrame
Input DataFrame. shape = (n_rows, n_features) where each row is a data point and the columns are numeric features
D : int, optional
The desired embedding dimension. Defaults to 2-D
affinity : string [“knn” | “rbf”], optional
How should we construct the affinity matrix?
“knn” : use k nearest neighbors graph
“rbf” : use radial basis function kernel
K : integer greater than 1, optional
The K to use if affinity is “knn”.
gamma : float, optional
Kernel coefficient for affinity “rbf”
name : string, optional
The desired name of the model. If None then a random name will be generated
pipeline : string, optional
An extant Pipeline to use for model creation. If None then one will be created
client :
slicematrixIO.connect.ConnectIO
Low level client for dispatching requests to SliceMatrix-IO
Returns: model :
LaplacianEigenmapper
Examples
Create a model for a given dataset
>>> sm = SliceMatrix(api_key) >>> spectral = sm.KernelPCA(dataset = dataframe, D = 3)
Get the embedding
>>> spectral.embedding()
Methods
-
affinity_matrix
()¶ Get the affinity matrix used to perform the embedding
Returns: affinity_matrix : matrix-like
Model affinity matrix shape = (n_rows, n_rows)
-
embedding
(nodes=True)¶ Get the D dimensional embedding of the training data
I.e.
- Take input data in high dimensions
- Transform via Laplacian Eigenmapper to D dimensions
Parameters: nodes : boolean, optional
Whether to return with node names. Default == True
Returns: embedding : pandas.DataFrame
D dimensional embedding. shape = (n_rows, D)
-
feature_names
()¶ Get the names of the features, if applicable
Returns: meta : dict
Model feature names
-
meta
()¶ Get the model metadata such as D, affinity, etc...
Returns: meta : dict
Model metadata
-
nodes
()¶ Get the names of the data points / nodes that make of the training dataset
Returns: nodes : list
Data point names / indices
-
-
class
slicematrixIO.manifolds.
LaplacianEigenmapperPipeline
(name, D=2, affinity='knn', K=5, gamma=1.0, client=None)¶ Bases:
slicematrixIO.core.BasePipeline
Create a Laplacian Eigenmapper Pipeline for creating
LaplacianEigenmapper
models from input training datasetsParameters: name : string
The desired name of the Pipeline.
D : int, optional
The desired embedding dimension. Defaults to 2-D
affinity : string [“knn” | “rbf”], optional
How should we construct the affinity matrix?
“knn” : use k nearest neighbors graph
“rbf” : use radial basis function kernel
K : integer greater than 1, optional
The K to use if affinity is “knn”.
gamma : float, optional
Kernel coefficient for affinity “rbf”
Returns: response : dict
success or failure response to Pipeline creation request
Examples
Create a Pipeline for processing multiple datasets into
LaplacianEigenmapper
models>>> io = ConnectIO(api_key) >>> spectral_pipe = LaplacianEigenmapperPipeline(D = 5, client = io) >>> for dataframe in dataframes: >>> current_model = spectral_pipe.run(dataset = dataframe, name = slicematrixIO.utils.rando_name())
Methods
-
run
(dataset, model)¶ Run the Pipeline and create a new
LaplacianEigenmapper
modelParameters: dataset : pandas.DataFrame
The dataset to pass into the Pipeline which will train a KernelPCA model using the parameters defined upon Pipeline creation. Pipelines are reusable sets of instructions to train a machine learning model.
Returns: response : dict
success or failure response to model creation request
-
-
class
slicematrixIO.manifolds.
LocalLinearEmbedder
(dataset=None, name=None, pipeline=None, D=2, K=3, method='standard', client=None)¶ Create a Pipeline for training Local Linear Embedder models
Parameters: name : string
The desired name of the model.
D : int, optional
The desired embedding dimension. Defaults to 2-D
K : integer greater than 1, optional
The number of neighbors to use in building the embedding. Default is 3
method : string [‘standard’ | ‘hessian’ | ‘modified’ | ‘ltsa’]
Which LLE algorithm should we use?
‘standard’ : standard LLE method
‘hessian’: hessian eigenmap LLE method, requires that K > D * (1 + (D + 1) / 2
‘modified’ : modified LLE method
‘ltsa’: local tangent space alignment LLE method
client :
slicematrixIO.connect.ConnectIO
Low level client for dispatching requests to SliceMatrix-IO
pipeline : string, optional
An extant Pipeline to use for model creation. If None then one will be created
Returns: model :
LocalLinearEmbedder
LLE model object
Examples
Create a LLE model for a given dataset
>>> sm = SliceMatrix(api_key) >>> lle = sm.LocalLinearEmbedder(dataset = dataframe, D = 2)
Methods
-
embedding
(nodes=True)¶ Get the D dimensional embedding of the training data
I.e.
- Take input data in high dimensions
- Transform via LLE to D dimensions
Parameters: nodes : boolean, optional
Whether to return with node names. Default == True
Returns: embedding : pandas.DataFrame
D dimensional embedding. shape = (n_rows, D)
-
feature_names
()¶ Get the names of the features, if applicable
Returns: meta : dict
Model feature names
-
meta
()¶ Get the model metadata such as D, method, etc...
Returns: meta : dict
Model metadata
-
nodes
()¶ Get the names of the data points / nodes that make of the training dataset
Returns: nodes : list
Data point names / indices
-
recon_error
()¶ Get the reconstruction error of the LLE model.
Reconstruction error of the embedding
Returns: recon_error : float
Reconstruction error for the model
-
-
class
slicematrixIO.manifolds.
LocalLinearEmbedderPipeline
(name, D=2, K=3, method='standard', client=None)¶ Bases:
slicematrixIO.core.BasePipeline
Create a Pipeline for training Local Linear Embedder models
Parameters: name : string
The desired name of the Pipeline.
D : int, optional
The desired embedding dimension. Defaults to 2-D
K : integer greater than 1, optional
The number of neighbors to use in building the embedding. Default is 3
method : string [‘standard’ | ‘hessian’ | ‘modified’ | ‘ltsa’]
Which LLE algorithm should we use?
‘standard’ : standard LLE method
‘hessian’: hessian eigenmap LLE method, requires that K > D * (1 + (D + 1) / 2
‘modified’ : modified LLE method
‘ltsa’: local tangent space alignment LLE method
client :
slicematrixIO.connect.ConnectIO
Low level client for dispatching requests to SliceMatrix-IO
Returns: response : dict
success or failure response to Pipeline creation request
Examples
Create a
LocalLinearEmbedder
Pipeline for processing multiple datasets>>> io = ConnectIO(api_key) >>> lle_pipe = LocalLinearEmbedderPipeline(D = 2, client = io) >>> for dataframe in dataframes: >>> current_model = lle_pipe.run(dataset = dataframe, name = slicematrixIO.utils.rando_name())
Methods
-
run
(dataset, model)¶ Run the Pipeline and create a new LocalLinearEmbedder model
Parameters: dataset : pandas.DataFrame
The dataset to pass into the Pipeline which will train a LocalLinearEmbedder model using the parameters defined upon Pipeline creation. Pipelines are reusable sets of instructions to train a machine learning model.
Returns: response : dict
success or failure response to model creation request
-
slicematrixIO.matrices module¶
Distance / Similarity Matrix Models
Generalization of the correlation matrix for different metrics / kernels / similarity measures
-
class
slicematrixIO.matrices.
DistanceMatrix
(dataset=None, name=None, pipeline=None, K=5, kernel='euclidean', geodesic=False, kernel_params={}, client=None)¶ Train / Reload a
DistanceMatrix
modelParameters: name : string, optional
The desired name of the model. If None a random name will be generated
K : integer greater than 1, optional, ignored if geodesic == False
The number of neighbors to use in building the geodesic distance matrix. Geodesic distance is constructed by computing the K Nearest Neighbors graph for the input dataset, then constructing all pairwise distances using the geodesic distance, i.e. the number of edges in a shortest path between two points on the graph.
kernel : string, optional
The distance kernel / metric to use in constructing the distance matrix. Default is euclidean.
kernel_params : dict, optional
Any extra parameters specific to the chosen kernel
geodesic : boolean, optional
Whether to create the geodesic distance matrix or the brute force pairwise distance matrix. Default is False
client :
slicematrixIO.connect.ConnectIO
Low level client for dispatching requests to SliceMatrix-IO
pipeline : string, optional
An extant
DistanceMatrixPipeline
to use for model creation. If None then one will be createdMethods
-
getKeys
()¶ Get the names of the datapoints in the model’s training dataset
Returns: keys : list
The names of the datapoints in the model’s training dataset
-
rankDist
(target, page=0)¶ Get the closest datapoints to the given target
Parameters: page : integer, optional
The current page. Responses come in chunks of 100. To iterate through the full list increase the page number.
Returns: distances : pandas.DataFrame
DataFrame with list of datapoints sorted by distance from target point
-
-
class
slicematrixIO.matrices.
DistanceMatrixPipeline
(name, kernel='euclidean', geodesic=False, K=5, kernel_params={}, client=None)¶ Bases:
slicematrixIO.core.BasePipeline
Create a Pipeline to train
DistanceMatrix
models from input datasetsParameters: name : string, optional
The desired name of the Pipeline.
K : integer greater than 1, optional, ignored if geodesic == False
The number of neighbors to use in building the geodesic distance matrix. Geodesic distance is constructed by computing the K Nearest Neighbors graph for the input dataset, then constructing all pairwise distances using the geodesic distance, i.e. the number of edges in a shortest path between two points on the graph.
kernel : string, optional
The distance kernel / metric to use in constructing the distance matrix. Default is euclidean.
kernel_params : dict, optional
Any extra parameters specific to the chosen kernel
geodesic : boolean, optional
Whether to create the geodesic distance matrix or the brute force pairwise distance matrix. Default is False
client :
slicematrixIO.connect.ConnectIO
Low level client for dispatching requests to SliceMatrix-IO
Returns: response : dict
success or failure response to Pipeline creation request
Examples
Create a Pipeline for processing multiple datasets into
DistanceMatrix
models>>> io = ConnectIO(api_key) >>> matrix_pipe = DistanceMatrixPipeline(kernel = "correlation", client = io) >>> for dataframe in dataframes: >>> current_model = matrix_pipe.run(dataset = dataframe, name = slicematrixIO.utils.rando_name())
Methods
-
run
(dataset, model)¶ Run the Pipeline and create a new
DistanceMatrix
modelParameters: dataset : pandas.DataFrame
The dataset to pass into the Pipeline which will train a
DistanceMatrix
model using the parameters defined upon Pipeline creation. Pipelines are reusable sets of instructions to train a machine learning model.
-
slicematrixIO.matrix_models module¶
-
class
slicematrixIO.matrix_models.
MatrixAgglomerator
(label_dataset=None, matrix=None, matrix_name=None, matrix_type=None, name=None, pipeline=None, alpha=0.1, client=None)¶ Methods
-
getKeys
()¶
-
rankDist
(target, page=0)¶
-
-
class
slicematrixIO.matrix_models.
MatrixAgglomeratorPipeline
(name, alpha=0.1, client=None)¶ Bases:
slicematrixIO.core.BasePipeline
Methods
-
run
(label_dataset, model, matrix=None, matrix_name=None, matrix_type=None)¶
-
-
class
slicematrixIO.matrix_models.
MatrixKernelPCA
(matrix=None, matrix_name=None, matrix_type=None, name=None, pipeline=None, D=2, client=None)¶ Methods
-
embedding
()¶
-
meta
()¶
-
nodes
()¶
-
-
class
slicematrixIO.matrix_models.
MatrixKernelPCAPipeline
(name, D=2, client=None)¶ Bases:
slicematrixIO.core.BasePipeline
Methods
-
run
(model, matrix=None, matrix_name=None, matrix_type=None)¶
-
-
class
slicematrixIO.matrix_models.
MatrixMinimumSpanningTree
(matrix=None, matrix_name=None, matrix_type=None, name=None, pipeline=None, client=None)¶ Methods
-
edges
()¶
-
neighborhood
(node)¶
-
nodes
()¶
-
rankLinks
()¶
-
rankNodes
(statistic='closeness_centrality')¶
-
-
class
slicematrixIO.matrix_models.
MatrixMinimumSpanningTreePipeline
(name, client=None)¶ Bases:
slicematrixIO.core.BasePipeline
Methods
-
run
(model, matrix=None, matrix_name=None, matrix_type=None)¶
-
slicematrixIO.notebook module¶
Module containing all Jupyter Notebook related classes / functions
All of this is meant to be run inside a Jupyter Notebook. The resulting graphs can be shared as notebook or html.
-
class
slicematrixIO.notebook.
GraphEngine
(sm)¶ Class for setting up and drawing graphs / visualizations of SliceMatrix-I0 models directly in the Jupyter Notebook
Parameters: sm :
slicematrixIO.client.SliceMatrix
An extant client
Examples
Create the GraphEngine
>>> sm = SliceMatrix(api_key) >>> viz = GraphEngine(sm)
Initialize the notebook data
>>> viz.init_data()
Initialize the graph stylesheet
>>> viz.init_style()
Then visualize a model
>>> iso = sm.Isomap(dataset=prices) >>> viz.drawNetworkGraph(iso, width = 1000, height = 600, color_map = "Heat")
You can then save and export the notebook for sharing your graph. HTML exports will render directly in the browser.
For another example check out https://slicematrix.github.io/manifold_learning_js.html
Methods
-
drawNetworkGraph
(network_model, color_map='RdBuGn', graph_style='light', graph_layout='force', width=1000, height=600, charge=-100, color_axis='closeness_centrality', label_color='#000', label_shadow_color='#fffff0', min_node_size=5)¶ Embed a D3 network graph into a Jupyter Notebook to visualize a SiceMatrix-IO network graph model. Graphs embedded in notebooks can be shared.
Parameters: network_model : graph-object
The netwok graph model. Can be any model with the methods:
- .nodes()
- .edges()
- .rankNodes()
Current list of graphable objects includes:
color_map : string
The desired color map for the node colors. Nodes are colored relative to their color_axis (the graph node statistic) selection.
Mappings go from min value to median value to max value
- ‘RdBuGn’ : Red to Blue to Green
- ‘RdGrGn’ : Red to Gray to Green
- ‘PuBuXr’ : A purple to blue x-ray effect where nodes near the median appear to disappear on a dark background
- ‘Viridis’ : The Viridis color map, good for dark or light backgrounds
- ‘Heat’ : A Red/Orange colormap with darker hues at the extreme
- ‘Winter’ : A Blue/Green colormap
graph_style : string [‘light’ | ‘dark’]
The overall styling of the graph. Light background vs dark background...
graph_layout : string [‘force’ | ‘embedding’]
The layout algorithm for the network graph.
- ‘force’ : network layout (node positioning) will be determined by a force directed simulation
- ‘embedding’ : node positioning will be static and determined by the positions returned in model.embedding(). For models without a .embedding() function, enabling this option may cause the graph to fail to display properly
width : integer, greater than 0
The desired width of the network graph
height : integer, greater than 0
The desired height of the network graph
charge : integer, less than 0
For graph_layout == ‘force’, ignored otherwise. The charge associated with each node for use in the force directed simulation layout. The more negative charge, the more the nodes tend to repel one another
color_axis : string
The name of the graph statistic to use for coloring the graph nodes. Should be valid statistic name for call to model.rankNodes()
label_color : string
The color of the node labels. Defaults to black. Accepts valid html color (e.g. #fff or rgba(255,255,255,0.8))
label_shadow_color : string
The color of the node label shadow. Defaults to “#fffff0”
min_node_size : integer, greater than 0
The minimum size to make the graph nodes. Defaults to 5
Returns: html : IPython.display.javascript
A html + javascript network graph chart embedded in the Jupyter Notebook
-
init_data
()¶ Initialize the window’s graph data
Returns: js : IPython.display.javascript
A javascript code block
-
init_style
()¶ Initialize the notebook’s graph style
Returns: js : IPython.display.javascript
A javascript code block
-
slicematrixIO.regressors module¶
Regressors are machine learning models which learn a function between an input (X) and an output (Y).
In particular, SliceMatrix-IO offers a number of what are known as “multi-output” regression models.
This is a special type of regression which can have an output with a dimension greater than 1, useful for:
- Prediction
- Out of Sample Manifold Learning
- As a step within a classification workflow
-
class
slicematrixIO.regressors.
KNNRegressor
(X=None, Y=None, name=None, pipeline=None, K=5, kernel='euclidean', algo='auto', weights='uniform', kernel_params={}, client=None)¶ Train / Reload a
KNNRegressor
model for multi-output regressionParameters: X : pandas.DataFrame
Input DataFrame. shape = (n_rows, input_features) where each row is a data point and the columns are numeric features
Y : pandas.DataFrame
Output DataFrame. shape = (n_rows, output_features) where output_features < input_features and each row is a data point and the columns are numeric features
K : integer, optional
The desired K in the Nearest Neighbor classifier model
kernel : string [ ‘euclidean’ | ‘minkowski’ | ‘hammond’ | ‘etc...’], optional
The desired kernel for defining distance in our classifier. Default is ‘euclidean’
algo : string [‘auto’ | ‘ball’ | ‘kd_tree’ | ‘brute’], optional
The algorithm to use in determining Nearest Neighbors. Default is ‘auto’
weights : string [‘uniform’ | ‘weighted’], optional
Should voting be uniform (i.e. independent of distance) or weighted by distance (i.e. closer neighbor’s have higher weighted predictions)
kernel_params : dict, optional
Any parameters specific to the chosen kernel
name : string, optional
The desired name of the model. If None then a random name will be generated
pipeline : string, optional
An extant Pipeline to use for model creation. If None then one will be created
client :
slicematrixIO.connect.ConnectIO
Low level client for dispatching requests to SliceMatrix-IO
Returns: model :
RFRegressor
Trained Random Forest Regressor model
Examples
Train a K Nearest Neighbors Regressor model
>>> sm = SliceMatrix(api_key) >>> knn = sm.KNNRegressor(dataset = dataset, K = 3)
Make a prediction
>>> knn.predict([...])
Methods
-
predict
(point)¶ Make a prediction using the given input features.
Also used for out of sample manifold learning.
I.e.
- Perform manifold learning embedding of input data (high dimension, H) to low dimension (D, D < H), however
- Many manifold learnin algorithms don’t have straightforward out of sample generalizatons...
- Learn the “interpolation” function between high dim space and low dim space with a multi-output regression
- Regress high dim (H) data points against the embedding (D) data points to learn the manifold embedding
- When presented with a new data point, an H dimension vector, or tensor or whatever term is fashionable, and “embed” it using the multi-output regression to output a D dimension vector
Parameters: point : list
List of points to use as inputs to a prediction
-
score
()¶ Get the R^2 of the training dataset / predictions
Returns: r2 : float
The R^2 of the training dataset
-
-
class
slicematrixIO.regressors.
KNNRegressorPipeline
(name, K=5, kernel='euclidean', algo='auto', weights='uniform', kernel_params={}, client=None)¶ Bases:
slicematrixIO.core.BasePipeline
K Nearest Neighbors Regression.
Create a Pipeline for training
KNNRegressor
models.Parameters: name : string
The desired name of the Pipeline.
K : integer, optional
The desired K in the Nearest Neighbor classifier model
kernel : string [ ‘euclidean’ | ‘minkowski’ | ‘hammond’ | ‘etc...’], optional
The desired kernel for defining distance in our classifier. Default is ‘euclidean’
algo : string [‘auto’ | ‘ball’ | ‘kd_tree’ | ‘brute’], optional
The algorithm to use in determining Nearest Neighbors. Default is ‘auto’
weights : string [‘uniform’ | ‘weighted’], optional
Should voting be uniform (i.e. independent of distance) or weighted by distance (i.e. closer neighbor’s have higher weighted predictions)
kernel_params : dict, optional
Any parameters specific to the chosen kernel
client :
slicematrixIO.connect.ConnectIO
Low level client for dispatching requests to SliceMatrix-IO
Returns: response : dict
success or failure response to Pipeline creation request
Examples
Create a Pipeline for training multiple
KNNRegressor
models>>> io = ConnectIO(api_key) >>> pipe = KNNRegressorPipeline(K = 5, client = io) >>> for dataframe in dataframes: >>> current_model = pipe.run(dataset = dataframe, name = slicematrixIO.utils.rando_name())
Methods
-
run
(X, Y, model)¶ Run the Pipeline and create a new
KNNRegressor
modelParameters: dataset : pandas.DataFrame
The dataset to pass into the Pipeline which will train a
KNNRegressor
model using the parameters defined upon Pipeline creation. Pipelines are reusable sets of instructions to train a machine learning model.Returns: response : dict
success or failure response to model creation request
-
-
class
slicematrixIO.regressors.
KernelRidgeRegressor
(X=None, Y=None, name=None, pipeline=None, kernel='linear', alpha=1.0, kernel_params={}, client=None)¶ Train / Reload a
KernelRidgeRegressor
model for multi-output regressionParameters: X : pandas.DataFrame
Input DataFrame. shape = (n_rows, input_features) where each row is a data point and the columns are numeric features
Y : pandas.DataFrame
Output DataFrame. shape = (n_rows, output_features) where output_features < input_features and each row is a data point and the columns are numeric features
K : integer, optional
The desired K in the Nearest Neighbor classifier model
kernel : string [ ‘euclidean’ | ‘minkowski’ | ‘hammond’ | ‘etc...’], optional
The desired kernel for defining distance in our classifier. Default is ‘euclidean’
algo : string [‘auto’ | ‘ball’ | ‘kd_tree’ | ‘brute’], optional
The algorithm to use in determining Nearest Neighbors. Default is ‘auto’
weights : string [‘uniform’ | ‘weighted’], optional
Should voting be uniform (i.e. independent of distance) or weighted by distance (i.e. closer neighbor’s have higher weighted predictions)
kernel_params : dict, optional
Any parameters specific to the chosen kernel
name : string, optional
The desired name of the model. If None then a random name will be generated
pipeline : string, optional
An extant Pipeline to use for model creation. If None then one will be created
client :
slicematrixIO.connect.ConnectIO
Low level client for dispatching requests to SliceMatrix-IO
Returns: model :
RFRegressor
Trained Random Forest Regressor model
Examples
Train a Kernel Ridge Regressor model
>>> sm = SliceMatrix(api_key) >>> krr = sm.KNNRegressor(dataset = dataset, kernel = "rbf")
Make a prediction
>>> krr.predict([...])
Methods
-
predict
(point)¶ Make a prediction using the given input features.
Also used for out of sample manifold learning.
I.e.
- Perform manifold learning embedding of input data (high dimension, H) to low dimension (D, D < H), however
- Many manifold learnin algorithms don’t have straightforward out of sample generalizatons...
- Learn the “interpolation” function between high dim space and low dim space with a multi-output regression
- Regress high dim (H) data points against the embedding (D) data points to learn the manifold embedding
- When presented with a new data point, an H dimension vector, or tensor or whatever term is fashionable, and “embed” it using the multi-output regression to output a D dimension vector
Parameters: point : list
List of points to use as inputs to a prediction
-
score
()¶ Get the R^2 of the training dataset / predictions
Returns: r2 : float
The R^2 of the training dataset
-
-
class
slicematrixIO.regressors.
KernelRidgeRegressorPipeline
(name, kernel='linear', alpha=1.0, kernel_params={}, client=None)¶ Bases:
slicematrixIO.core.BasePipeline
Kernel Ridge Regression.
Create a Pipeline for training
KernelRidgeRegressor
models.Parameters: name : string
The desired name of the Pipeline.
alpha : float, optional
Kernel Ridge Regressor model alpha value. Default 1.0
kernel : string [‘linear’, ‘rbf’, ‘poly’]
Kernel to use in regression. Linear is default. For nonlinear datasets, consider rbf or poly
‘linear’ : linear kernel
‘rbf’ : radial basis function kernel
‘poly’ : polynomial kernel
kernel_params : dict
Kernel specific parameters
client :
slicematrixIO.connect.ConnectIO
Low level client for dispatching requests to SliceMatrix-IO
Returns: response : dict
success or failure response to Pipeline creation request
Examples
Create a Pipeline for training multiple
KernelRidgeRegressor
models>>> io = ConnectIO(api_key) >>> pipe = KernelRidgeRegressorPipeline(K = 5, client = io) >>> for dataframe in dataframes: >>> current_model = pipe.run(dataset = dataframe, name = slicematrixIO.utils.rando_name())
Methods
-
run
(X, Y, model)¶ Run the Pipeline and create a new
KernelRidgeRegressor
modelParameters: dataset : pandas.DataFrame
The dataset to pass into the Pipeline which will train a
KernelRidgeRegressor
model using the parameters defined upon Pipeline creation. Pipelines are reusable sets of instructions to train a machine learning model.Returns: response : dict
success or failure response to model creation request
-
-
class
slicematrixIO.regressors.
RFRegressor
(X=None, Y=None, name=None, pipeline=None, n_trees=8, client=None)¶ Train / Reload a
RFRegressor
model for multi-output regressionA Random Forest Regressor finds a function which maps the input space (X) to the lower dimension output space (Y) using decision trees
Parameters: X : pandas.DataFrame
Input DataFrame. shape = (n_rows, input_features) where each row is a data point and the columns are numeric features
Y : pandas.DataFrame
Output DataFrame. shape = (n_rows, output_features) where output_features < input_features and each row is a data point and the columns are numeric features
n_trees : integer greater than 1, optional
The number of trees to use in construction of the Random Forest model. Default is 8 trees
name : string, optional
The desired name of the model. If None then a random name will be generated
pipeline : string, optional
An extant Pipeline to use for model creation. If None then one will be created
client :
slicematrixIO.connect.ConnectIO
Low level client for dispatching requests to SliceMatrix-IO
Returns: model :
RFRegressor
Trained Random Forest Regressor model
Examples
Train a Random Forest Regressor model
>>> sm = SliceMatrix(api_key) >>> rfr = sm.RFRegressor(dataset = dataset, n_trees = 50)
Make a prediction
>>> rfr.predict([...])
Methods
-
predict
(point)¶ Make a prediction using the given input features.
Also used for out of sample manifold learning.
I.e.
- Perform manifold learning embedding of input data (high dimension, H) to low dimension (D, D < H), however
- Many manifold learnin algorithms don’t have straightforward out of sample generalizatons...
- Learn the “interpolation” function between high dim space and low dim space with a multi-output regression
- Regress high dim (H) data points against the embedding (D) data points to learn the manifold embedding
- When presented with a new data point, an H dimension vector, or tensor or whatever term is fashionable, and “embed” it using the multi-output regression to output a D dimension vector
Parameters: point : list
List of points to use as inputs to a prediction
-
score
()¶ Get the R^2 of the training dataset / predictions
Returns: r2 : float
The R^2 of the training dataset
-
-
class
slicematrixIO.regressors.
RFRegressorPipeline
(name, n_trees=8, client=None)¶ Bases:
slicematrixIO.core.BasePipeline
Random Forest Regression.
Create a Pipeline for training
RFRegressor
models.Parameters: name : string
The desired name of the Pipeline.
n_trees : integer, greater than 0
The number of trees to use in the regression forest
client :
slicematrixIO.connect.ConnectIO
Low level client for dispatching requests to SliceMatrix-IO
Returns: response : dict
success or failure response to Pipeline creation request
Examples
Create a Pipeline for training multiple
RFRegressor
models>>> io = ConnectIO(api_key) >>> pipe = RFRegressorPipeline(n_trees = 100, client = io) >>> for dataframe in dataframes: >>> current_model = pipe.run(dataset = dataframe, name = slicematrixIO.utils.rando_name())
Methods
-
run
(X, Y, model)¶ Run the Pipeline and create a new
RFRegressor
modelParameters: dataset : pandas.DataFrame
The dataset to pass into the Pipeline which will train a
RFRegressor
model using the parameters defined upon Pipeline creation. Pipelines are reusable sets of instructions to train a machine learning model.Returns: response : dict
success or failure response to model creation request
-
slicematrixIO.utils module¶
Useful utility functions
-
slicematrixIO.utils.
r_squared
(Y_hat, Y)¶ Get the coefficient of determination, or r-squared, for a given prediction versus its ground truths
Parameters: Y_hat : pandas.DataFrame
The predicted values DataFrame
Y : pandas.DataFrame
The actual values DataFrame
Returns: r_2 : float
The r-squared value
-
slicematrixIO.utils.
rando_name
(type='short')¶ Generate a random name string
Longer name decreases chance of overwrite collision
Parameters: type : string [“short” | “long”]
Whether to create a long or short name
Returns: name : string
Random name
Module contents¶
slicematrixIO-python is the Python SDK for the SliceMatrix-IO Machine Learning Platform as a Service