In trading as in life, it is often extremely valuable to determine whether or not the current environment is anomalous in some way. If things are acting "normal" we know our strategies can trade a certain way. For example, if we are in a normal trading environment we might employ a volatility shorting strategy. On the other hand, if we identify we are in an abnormally exciting market, it might behoove us to employ a strategy which does the exact opposite: seeking out opportunities for momentum based trading, for example. In that kind of market, shorting volatility could be very dangerous
SliceMatrix-IO offers a number of different options for detecting anomalies on both univariate and multivariate datasets. Today we will explore an anomaly detection algorithm called an Isolation Forest. This algorithm can be used on either univariate or multivariate datasets. It has one parameter, rate, which controls the target rate of anomaly detection. I.e. a rate equal to 0.2 will train the algorithm to detect anomalie in 1 out of 5 datapoints on average. The rate must be greater than 0 and less than 0.5
Since the Isolation Forest can handle multivariate data, it is ideal for detecting anomalies when you have multiple input features. In our case, our input features will be the daily trading volume for a list of ETF symbols. We will define this microcosm as our "market" although in practice we could potential make the universe much, much bigger
symbols = ['SPY', 'IWM', 'DIA', 'IEF', 'TLT', 'GLD', 'SLV', 'USO', 'XIV']
The goal of this algo is to determine when the trading volume for our list of symbols as a whole is in an anomalous state. This could mean, for example, that we are detecting a spike in trading volume. To do this, we begin by importing the SliceMatrix-IO Python client.
If you haven't installed the client yet, the easiest way is with pip:
pip install slicematrixIO
Now we can begin by creating the SliceMatrix-IO client. Make sure to substitute your own api key into the code.
Don't have a key yet? Get your api key here
from slicematrixIO import SliceMatrix
api_key = "insert your api key here"
sm = SliceMatrix(api_key)
Next let's import some useful Python modules such as Pandas, NumPy, and Pyplot
%matplotlib inline
import pandas as pd
#import pandas.io.data as web
from pandas_datareader import data as web
import datetime as dt
import numpy as np
import matplotlib.pyplot as plt
Grab trading volume data from Yahoo for our list of stocks using Pandas' Data-Reader
start = dt.datetime(2012, 1, 1)
end = dt.datetime(2017, 3, 6)
volume = []
closes = []
for symbol in symbols:
print symbol
vdata = web.DataReader(symbol, 'yahoo', start, end)
cdata = vdata[['Close']]
closes.append(cdata)
vdata = vdata[['Volume']]
volume.append(vdata)
volume = pd.concat(volume, axis = 1).dropna()
volume.columns = symbols
closes = pd.concat(closes, axis = 1).dropna()
closes.columns = symbols
volume.head()
volume.plot(figsize=(12, 6))
plt.show()
The time series of volume is has siginificant spikes in trading volume accross our ETF universe. Some notable events include the October 2014 Treasury Flash Crash, August 2015's spike in volatility, as well as Donald Trump's election in late 2016. Note the relative quiet in the start of 2017...
While these events are obvious to the naked eye (well after the fact) what would be useful is the ability to automatically classify events based purely off the trading volume data, i.e. without the use of a human being. This is the goal of machine learning, and luckily this is exactly the kind of use case our algo, the Isolation Forest, was built to handle
We'll start by creating 3 anomaly detectors with increasing values to the rate parameter. Remember, this controls how many anomalies each detector will pick up
# isolation forest multivariate anomaly detector
iso_forest1 = sm.IsolationForest(dataset = volume, rate = 0.1) # want signal every 1 / 10 days on average
iso_forest2 = sm.IsolationForest(dataset = volume, rate = 0.2) # want signal every 1 / 5 days on average
iso_forest3 = sm.IsolationForest(dataset = volume, rate = 0.33) # want signal every 1 / 3 days on average
These three models are now training and ready to be used in the cloud.
Now let's get the anomaly scores (i.e. whether or not the trading day was anomalous) from each model. The Isolation Forest returns a score of 1 for normal days and -1 for anomalous trading activity
scores1 = iso_forest1.training_scores()
scores2 = iso_forest2.training_scores()
scores3 = iso_forest3.training_scores()
scores1 = pd.DataFrame(scores1, columns = ["scores"])
scores2 = pd.DataFrame(scores2, columns = ["scores"])
scores3 = pd.DataFrame(scores3, columns = ["scores"])
scores1.plot(ylim = (-2.0, 2.0))
print scores1.shape, volume.shape
Let's make a function to visualize the 3 dectectors performance
import matplotlib.collections as collections
def draw_anomaly_plot(scores, volume, title, lw = 2):
fig, ax = plt.subplots(figsize=(12, 6))
ax.set_title(title)
ax.plot(scores.index.values, volume.sum(axis = 1), color='black')
ax.axhline(0, color='black', lw=2)
for i in range(0, scores.shape[0]):
score = scores.ix[i]
if score[0] < 0:
l = plt.axvline(x=i, color='red', alpha=0.25, lw = lw)
plt.show()
Now we can compare the three models and visualize how the rate parameter is affecting the probablity of detecting an anomaly
draw_anomaly_plot(scores1, volume, 'Market Volume Anomaly Detection (10%)')
draw_anomaly_plot(scores2, volume, 'Market Volume Anomaly Detection (20%)')
draw_anomaly_plot(scores3, volume, 'Market Volume Anomaly Detection (33%)')
We can see how the number of anomalies detected increases as we ratchet up the rate parameter.
In the models we created above, we used the entire dataset to train our Isolation Forests. In practice, we don't have access to information in the future (at least not with current technology) so we should introduce some reality into our model by splitting the dataset into two chunks: one for training the model (in-sample) and one for validating the model's performance (out-sample)
# split the dataset
volume_training = volume.ix[0:1200,:]
volume_testing = volume.ix[1201:,:]
iso_forest_live_model = sm.IsolationForest(dataset = volume_training, rate = 0.2)
This model is trained using only the volume_training dataframe. Now we can use this model to score the out of sample trading volume:
out_of_sample_scores = iso_forest_live_model.score(volume_testing.values.tolist())
out_of_sample_scores = pd.DataFrame(out_of_sample_scores, columns = ['scores'])
out_of_sample_scores.tail()
draw_anomaly_plot(out_of_sample_scores, volume_testing, 'Market Volume Anomaly Detection (Out of Sample)', lw = 7)
One of the strength's of using SliceMatrix-IO is that your models persist in the cloud after you train them, meaning you can load your models from any device anywhere on the planet with an internet connection.
For example, suppose we have one process which trains the models (e.g. using the code above) and another process which runs during live trading which does the anomaly scoring. Each model has an attribute called name which describes the unique id for that model:
print iso_forest_live_model.name
We can easily load this model in another process using the lazy load feature:
# in another process
iso_forest_live_model = sm.IsolationForest(name = "db51b214a67c")
# when we get a new data point we want to score...
iso_forest_live_model.score([[66650800, 30445200, 2580800, 2469300, 9460700, 10536000, 8681600, 13807500, 8518800]])
This way you can use SliceMatrix-IO to easily and quickly create real time machine learning models for trading anywhere on the globe
Don't have a SliceMatrix-IO api key yet? Get your api key here