Detecting Stock Market Anomalies Part 1:¶

In trading as in life, it is often extremely valuable to determine whether or not the current environment is anomalous in some way. If things are acting "normal" we know our strategies can trade a certain way. For example, if we are in a normal trading environment we might employ a volatility shorting strategy. On the other hand, if we identify we are in an abnormally exciting market, it might behoove us to employ a strategy which does the exact opposite: seeking out opportunities for momentum based trading, for example. In that kind of market, shorting volatility could be very dangerous

SliceMatrix-IO offers a number of different options for detecting anomalies on both univariate and multivariate datasets. Today we will explore an anomaly detection algorithm called an Isolation Forest. This algorithm can be used on either univariate or multivariate datasets. It has one parameter, rate, which controls the target rate of anomaly detection. I.e. a rate equal to 0.2 will train the algorithm to detect anomalie in 1 out of 5 datapoints on average. The rate must be greater than 0 and less than 0.5

Since the Isolation Forest can handle multivariate data, it is ideal for detecting anomalies when you have multiple input features. In our case, our input features will be the daily trading volume for a list of ETF symbols. We will define this microcosm as our "market" although in practice we could potential make the universe much, much bigger

symbols = ['SPY', 'IWM', 'DIA', 'IEF', 'TLT', 'GLD', 'SLV', 'USO', 'XIV']

The goal of this algo is to determine when the trading volume for our list of symbols as a whole is in an anomalous state. This could mean, for example, that we are detecting a spike in trading volume. To do this, we begin by importing the SliceMatrix-IO Python client.

If you haven't installed the client yet, the easiest way is with pip:

pip install slicematrixIO

Now we can begin by creating the SliceMatrix-IO client. Make sure to substitute your own api key into the code.

Don't have a key yet? Get your api key here

from slicematrixIO import SliceMatrix

api_key = "insert your api key here"
sm = SliceMatrix(api_key)

Next let's import some useful Python modules such as Pandas, NumPy, and Pyplot

%matplotlib inline
import pandas as pd
#import pandas.io.data as web
from pandas_datareader import data as web
import datetime as dt
import numpy as np
import matplotlib.pyplot as plt

Grab trading volume data from Yahoo for our list of stocks using Pandas' Data-Reader

start = dt.datetime(2012, 1, 1)
end = dt.datetime(2017, 3, 6)

volume = []
closes = []
for symbol in symbols:
    print symbol
    vdata = web.DataReader(symbol, 'yahoo', start, end)
    cdata = vdata[['Close']]
    closes.append(cdata)
    vdata = vdata[['Volume']]
    volume.append(vdata)
    
volume = pd.concat(volume, axis = 1).dropna()
volume.columns = symbols
closes = pd.concat(closes, axis = 1).dropna()
closes.columns = symbols

SPY
IWM
DIA
IEF
TLT
GLD
SLV
USO
XIV

volume.head()

volume.plot(figsize=(12, 6))
plt.show()

The time series of volume is has siginificant spikes in trading volume accross our ETF universe. Some notable events include the October 2014 Treasury Flash Crash, August 2015's spike in volatility, as well as Donald Trump's election in late 2016. Note the relative quiet in the start of 2017...

While these events are obvious to the naked eye (well after the fact) what would be useful is the ability to automatically classify events based purely off the trading volume data, i.e. without the use of a human being. This is the goal of machine learning, and luckily this is exactly the kind of use case our algo, the Isolation Forest, was built to handle

We'll start by creating 3 anomaly detectors with increasing values to the rate parameter. Remember, this controls how many anomalies each detector will pick up

# isolation forest multivariate anomaly detector
iso_forest1 = sm.IsolationForest(dataset = volume, rate = 0.1)  # want signal every 1 / 10 days on average
iso_forest2 = sm.IsolationForest(dataset = volume, rate = 0.2)  # want signal every 1 / 5 days on average
iso_forest3 = sm.IsolationForest(dataset = volume, rate = 0.33) # want signal every 1 / 3 days on average

These three models are now training and ready to be used in the cloud.

Now let's get the anomaly scores (i.e. whether or not the trading day was anomalous) from each model. The Isolation Forest returns a score of 1 for normal days and -1 for anomalous trading activity

scores1 = iso_forest1.training_scores()
scores2 = iso_forest2.training_scores()
scores3 = iso_forest3.training_scores()
scores1 = pd.DataFrame(scores1, columns = ["scores"])
scores2 = pd.DataFrame(scores2, columns = ["scores"])
scores3 = pd.DataFrame(scores3, columns = ["scores"])

scores1.plot(ylim = (-2.0, 2.0))

<matplotlib.axes._subplots.AxesSubplot at 0x7af9970>

print scores1.shape, volume.shape

(1301, 1) (1301, 9)

Let's make a function to visualize the 3 dectectors performance

import matplotlib.collections as collections

def draw_anomaly_plot(scores, volume, title, lw = 2):
    fig, ax = plt.subplots(figsize=(12, 6))
    ax.set_title(title)

    ax.plot(scores.index.values, volume.sum(axis = 1), color='black')
    ax.axhline(0, color='black', lw=2)

    for i in range(0, scores.shape[0]):
        score = scores.ix[i]
        if score[0] < 0:
            l = plt.axvline(x=i, color='red', alpha=0.25, lw = lw)

    plt.show()

Now we can compare the three models and visualize how the rate parameter is affecting the probablity of detecting an anomaly

draw_anomaly_plot(scores1, volume, 'Market Volume Anomaly Detection (10%)')

draw_anomaly_plot(scores2, volume, 'Market Volume Anomaly Detection (20%)')

draw_anomaly_plot(scores3, volume, 'Market Volume Anomaly Detection (33%)')

We can see how the number of anomalies detected increases as we ratchet up the rate parameter.

In the models we created above, we used the entire dataset to train our Isolation Forests. In practice, we don't have access to information in the future (at least not with current technology) so we should introduce some reality into our model by splitting the dataset into two chunks: one for training the model (in-sample) and one for validating the model's performance (out-sample)

# split the dataset
volume_training = volume.ix[0:1200,:]
volume_testing  = volume.ix[1201:,:]

iso_forest_live_model = sm.IsolationForest(dataset = volume_training, rate = 0.2)

This model is trained using only the volume_training dataframe. Now we can use this model to score the out of sample trading volume:

out_of_sample_scores  = iso_forest_live_model.score(volume_testing.values.tolist())

out_of_sample_scores  = pd.DataFrame(out_of_sample_scores, columns = ['scores'])
out_of_sample_scores.tail()

draw_anomaly_plot(out_of_sample_scores, volume_testing, 'Market Volume Anomaly Detection (Out of Sample)', lw = 7)

One of the strength's of using SliceMatrix-IO is that your models persist in the cloud after you train them, meaning you can load your models from any device anywhere on the planet with an internet connection.

For example, suppose we have one process which trains the models (e.g. using the code above) and another process which runs during live trading which does the anomaly scoring. Each model has an attribute called name which describes the unique id for that model:

print iso_forest_live_model.name

db51b214a67c

We can easily load this model in another process using the lazy load feature:

# in another process
iso_forest_live_model = sm.IsolationForest(name = "db51b214a67c")

# when we get a new data point we want to score...
iso_forest_live_model.score([[66650800, 30445200, 2580800, 2469300, 9460700, 10536000, 8681600, 13807500, 8518800]])

[1]

This way you can use SliceMatrix-IO to easily and quickly create real time machine learning models for trading anywhere on the globe

Don't have a SliceMatrix-IO api key yet? Get your api key here

	SPY	IWM	DIA	IEF	TLT	GLD	SLV	USO	XIV
Date
2012-01-03	193697900	60504700	7175100	1297700	9076900	13385800	28140300	12369900	5366800
2012-01-04	127186500	34648500	7625200	1789000	8417100	11549700	18062600	13812800	6686900
2012-01-05	173895000	57274600	8678900	1311300	6465800	11621600	13858900	11799600	4373600
2012-01-06	148050000	45499800	7488600	998200	7348500	9790500	20679500	9760600	5765800
2012-01-09	99530200	52042400	5881800	379900	5582400	8771900	11638200	7509300	3306600