Simulating Correlated Random Walks for the S&P 500

Previously I showed how to simulate correlated random walks using copulas on my blog, MKTSTK.com.

I was really thinking about the application to pairs trading back then, because one of the limitations of that implementation was that the method could only simulate two random variables at a time. If you wanted to do some large universe like the S&P 500 you had to do everything pairwise, and then you weren't really capturing any higher dimension relationships within the market.

The solution I found in another, more intuitive model, is the KernelDesnityEstimator (KDE). This method allows for simulation of an arbitrary number of random variables, such as 500. This makes it much more flexible for simulation of trading strategies and modeling market risk.

Outline of notebook:

  1. get sp500 data
  2. train kde with data
  3. simulate a years worth of data
  4. compare correlation matrices
    • heatmaps
    • correlation filtered graphs

Est time to run: < 5 mins

To begin, let's fire up a client in our data center. Remember to substitute your api key in the code below.

Don't have a key yet? Get your api key here

In [1]:
from slicematrixIO import SliceMatrix

sm = SliceMatrix(api_key)

Next let's import some useful packages...

In [2]:
import pandas as pd
from pandas_datareader import data as web
import datetime as dt
import numpy as np

I want to look in particular at the last year's worth of price data. I'm going to download the list of symbols for the S&P 500 from Quandl. Its cool we can do this directly from Amazon's S3 using pandas:

In [3]:
start = dt.datetime(2016, 3, 15)
end = dt.datetime(2017, 3, 15)

data = pd.read_csv("https://s3.amazonaws.com/static.quandl.com/tickers/SP500.csv")

Then we can download the daily price data from Yahoo. Since its so many symbols (and some are going to fail because the symbol list is a few months stale) we'll print out successes and failures...

In [4]:
# get the price data
volume = []
closes = []
good_tickers = []
for ticker in data['ticker'].values.tolist():
    print ticker,
    try:
        vdata = web.DataReader(ticker, 'yahoo', start, end)
        cdata = vdata[['Close']]
        closes.append(cdata)
        vdata = vdata[['Volume']]
        volume.append(vdata)
        good_tickers.append(ticker)
    except:
        print "x",
MMM ABT ABBV ACN ACE x ATVI ADBE ADT AAP AES AET AFL AMG A GAS x APD ARG AKAM AA AGN ALXN ALLE ADS ALL GOOGL GOOG ALTR x MO AMZN AEE AAL AEP AXP AIG AMT AMP ABC AME AMGN APH APC ADI AON APA AIV AAPL AMAT ADM AIZ T ADSK ADP AN AZO AVGO AVB AVY BHI BLL BAC BK BCR BXLT BAX BBT BDX BBBY BRK-B BBY BIIB BLK HRB BA BWA BXP BSX BMY BRCM x BF-B CHRW CA CVC COG CAM x CPB COF CAH HSIC KMX CCL CAT CBG CBS CELG CNP CTL CERN CF SCHW CHK CVX CMG CB CI XEC CINF CTAS CSCO C CTXS CLX CME CMS COH KO CCE CTSH CL CPGX x CMCSA CMCSK x CMA CSC CAG COP CNX ED STZ GLW COST CCI CSX CMI CVS DHI DHR DRI DVA DE DLPH DAL XRAY DVN DO DFS DISCA DISCK DG DLTR D DOV DOW DPS DTE DD DUK DNB ETFC EMN ETN EBAY ECL EIX EW EA EMC EMR ENDP ESV ETR EOG EQT EFX EQIX EQR ESS EL ES EXC EXPE EXPD ESRX XOM FFIV FB FAST FDX FIS FITB FSLR FE FISV FLIR FLS FLR FMC FTI F FOSL BEN FCX FTR GME GPS GRMN GD GE GGP GIS GM GPC GNW GILD GS GT GWW HAL HBI HOG HAR HRS HIG HAS HCA HCP HP HES HPQ HD HON HRL HST HCBK x HUM HBAN ITW IR INTC ICE IBM IP IPG IFF INTU ISRG IVZ IRM JEC JBHT JNJ JCI JPM JNPR KSU K KEY GMCR x KMB KIM KMI KLAC KSS KHC KR LB LLL LH LRCX LM LEG LEN LVLT LUK LLY LNC LLTC LMT L LOW LYB MTB MAC M MNK MRO MPC MAR MMC MLM MAS MA MAT MKC MCD MHFI x MCK MJN WRK MDT MRK MET KORS MCHP MU MSFT MHK TAP MDLZ MON MNST MCO MS MOS MSI MUR MYL NDAQ NOV NAVI NTAP NFLX NWL NFX NEM NWSA NWS NEE NLSN NKE NI NBL JWN NSC NTRS NOC NRG NUE NVDA ORLY OXY OMC OKE ORCL OI PCAR PH PDCO PAYX PYPL PNR PBCT POM x PEP PKI PRGO PFE PCG PM PSX PNW PXD PBI PCL x PNC RL PPG PPL PX PCP x PCLN PFG PG PGR PLD PRU PEG PSA PHM PVH QRVO PWR QCOM DGX RRC RTN O RHT REGN RF RSG RAI RHI ROK COL ROP ROST RCL R CRM SNDK SCG SLB SNI STX SEE SRE SHW SIAL x SIG SPG SWKS SLG SJM SNA SO LUV SWN SE STJ SWK SPLS SBUX HOT STT SRCL SYK STI SYMC SYY TROW TGT TEL TE x TGNA THC TDC TSO TXN TXT HSY TRV TMO TIF TWX TWC x TJX TMK TSS TSCO RIG TRIP FOXA FOX TSN TYC USB UA UNP UAL UNH UPS URI UTX UHS UNM URBN VFC VLO VAR VTR VRSN VRSK VZ VRTX VIAB V VNO VMC WMT WBA DIS WM WAT ANTM WFC HCN WDC WU WY WHR WFM WMB WEC WYN WYNN XEL XRX XLNX XL XYL YHOO YUM ZBH ZION ZTS

Now we can combine the individual stock's closing price data into one dataframe, then we can take the log differences so we are simulating price changes

In [5]:
closes = pd.concat(closes, axis = 1)
closes.columns = good_tickers

diffs = np.log(closes).diff().dropna(axis = 0, how = "all").dropna(axis = 1, how = "any")
diffs.head()
Out[5]:
MMM ABT ABBV ACN ATVI ADBE AAP AES AET AFL ... XEL XRX XLNX XL XYL YHOO YUM ZBH ZION ZTS
Date
2016-03-16 0.002951 0.006993 0.012329 0.004845 0.009499 0.015315 0.002393 0.037572 0.004332 0.014676 ... 0.010317 0.002837 0.012459 -0.004748 0.012766 0.022299 0.005354 -0.000479 -0.001209 -0.006406
2016-03-17 0.005266 0.002734 -0.017738 0.008975 -0.005811 0.020098 -0.006437 0.012217 -0.025101 0.014780 ... 0.014075 0.010334 -0.012884 0.016107 0.009789 0.007908 0.001778 -0.006447 0.008432 -0.018459
2016-03-18 0.009603 0.012334 0.022344 -0.007304 -0.012346 0.037740 -0.002980 0.007775 0.018463 0.002049 ... -0.010659 0.005592 0.009100 0.012865 0.011723 0.025631 -0.006111 0.008842 0.007965 0.028297
2016-03-21 -0.002968 -0.000736 -0.009949 0.003612 -0.004981 -0.009897 0.007213 -0.006908 -0.010713 -0.003786 ... -0.000244 -0.009337 0.000421 0.001630 -0.001268 0.008494 0.014201 -0.001245 0.007115 0.018429
2016-03-22 0.000425 0.006602 0.026255 -0.003056 0.015485 0.000648 -0.002652 0.000000 0.009914 0.000948 ... -0.006844 0.003745 0.001683 0.002170 -0.000507 -0.001693 -0.005681 0.000287 -0.001577 0.026788

5 rows × 479 columns

SliceMatrix-IO will use the daily log price change data to train a Kernel Density Estimator:

In [6]:
kde = sm.KernelDensityEstimator(dataset = diffs)

From here its easy to simulate a year's worth of trading data (approx 250 trading days in a year)

In [7]:
sim_data = kde.simulate(250)
In [8]:
sim_data = pd.DataFrame(sim_data, index = diffs.columns).T
In [9]:
sim_data.head()
Out[9]:
MMM ABT ABBV ACN ATVI ADBE AAP AES AET AFL ... XEL XRX XLNX XL XYL YHOO YUM ZBH ZION ZTS
0 0.006159 0.007662 0.009268 0.012846 0.029493 -0.002194 0.008525 -0.000653 -0.000636 0.018822 ... 0.007127 -0.030757 0.001632 -0.011515 0.014036 0.014358 -0.013102 0.002392 0.002477 0.009678
1 -0.002018 0.014994 -0.010243 0.024916 -0.009707 0.012144 0.004570 0.025307 0.019325 -0.000195 ... 0.012077 0.007151 0.003383 -0.026380 0.005627 0.028981 0.012425 0.002630 -0.025116 -0.004304
2 0.000193 -0.012505 -0.007895 0.006504 -0.000856 0.009518 -0.007657 -0.019319 -0.010091 -0.016492 ... 0.007651 -0.019429 -0.014360 0.002130 -0.012713 -0.027744 0.012755 0.023383 -0.021362 -0.008379
3 0.012992 0.016534 -0.002165 0.026241 -0.016912 -0.037794 0.043180 -0.027637 0.003202 0.019111 ... -0.003606 -0.015081 -0.010234 0.001233 0.016993 -0.021139 -0.060853 -0.012675 0.030996 -0.017149
4 0.028327 0.021744 0.034191 0.026641 0.063421 0.033316 -0.003978 -0.016046 0.028354 0.013191 ... -0.012433 0.016448 0.055355 0.038266 0.053153 0.013385 0.048467 0.014084 0.060291 0.024805

5 rows × 479 columns

Traditionally, the humble heatmap has been the visualization of choice for traders / quants who want to visualize correlation matrices. While the heatmap is good for rendering small to moderately sized matrices, once you get hundreds of nodes it becomes hard to tell whats going on. For example:

In [10]:
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt
In [11]:
f, ax = plt.subplots(figsize=(12, 9))

sns.heatmap(diffs.corr(), vmax=.8, square=True)
plt.show()
C:\Python27\lib\site-packages\matplotlib\collections.py:571: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if self._edgecolors == str('face'):

This produces a static image which is hard to get to look right in a small confined space. Even when you zoom in to a small scale, you lose the entire feel of the graph and what symbols are connected to what.

Luckily there are better tools at our disposal for the visualization of larger matrices: the network graph model. One such model which is particularly useful for analysis of time series data like the stock market is the Correlation Filtered Graph. This model is driven from the correlation matrix of the input data. This matrix is transformed into a distance and a nearest neighbors graph is drawn using that distance matrix.

The resulting graph can be plotted. From there we can verify that the simulated data is indeed representative of the original (real / observed) dataset.

Let's created two graphs, one using real data and one using the simulated data:

In [12]:
cfg_real = sm.CorrelationFilteredGraph(dataset = diffs.T)
cfg_sim  = sm.CorrelationFilteredGraph(dataset = sim_data.T)

slicematrixIO-python has a module for rendering network graphs inside Jupyter Notebooks like this one. To begin let's so some initial setup for graphing the network models.

In [13]:
from slicematrixIO.notebook import GraphEngine
viz = GraphEngine(sm)
initializing window.graph_data
In [14]:
viz.init_style()
Out[14]:
In [15]:
viz.init_data()
Out[15]:

Now we can render the graph models using a D3 force directed network graph visualization:

In [17]:
viz.drawNetworkGraph(cfg_real, width = 950, height = 950, min_node_size = 6, charge = -50, color_map = "Winter", graph_layout = "force", label_color = "#000", graph_style = "white")
Out[17]:
In [18]:
viz.drawNetworkGraph(cfg_sim, width = 950, height = 950, min_node_size = 6, charge = -50, color_map = "Winter", graph_layout = "force", label_color = "#000", graph_style = "white")
Out[18]:

Visually we can confirm the same basic shape and structure to both the global graph and local clusters. We can also delve deeper into each graph and verify that the same basic correlation structure is preserved:

In [20]:
cfg_real.neighborhood("AAPL")
Out[20]:
{u'AVGO': {u'weight': 0.48320809630633565},
 u'GOOG': {u'weight': 0.4263563496717201},
 u'GOOGL': {u'weight': 0.42263098020664097}}
In [21]:
cfg_sim.neighborhood("AAPL")
Out[21]:
{u'AVGO': {u'weight': 0.47731219367886335},
 u'GOOG': {u'weight': 0.4390895656012743},
 u'GOOGL': {u'weight': 0.43969453781368845}}

Don't have a key yet? Get your api key here