📈Time Series Regression-based Trading Strategy

In this approach, we'll use past components of SPY to predict in the future with a certain degree of accuracy. A less complex task would be to predict only "one" day into the future.This approach is plausible because just like the trend following strategy, we are going to update our position on a daily basis. This signifies that we can use all the data up to the current time step and only predict the return for the next timestep. Our decision to Buy, Sell or Do nothing is updated every single day. Therefore, we'd only worry about predicting one day into the future.

To feed our model we'll choose some companies from the S&P 500, which have the largest market cap. (arbitrary as we're just exploring)

A more intuitive approach would have been to feed our model all of the 500 & some companies and let the model do the work? however our model will overfit.

We also will be using returns in lieu of stock prices as ML models are not good at extrapolation since, as we know, stock prices in the market are generally going up.If we were train our model on prices in the range one hundred dollars to two hundred dollars, our model might learn what to do for that range. But if it goes up to three hundred dollars in the test set, our model has never seen that before and it doesn't know what to do in return. However, as we know, returns are more or less stationary therefore being a better candidate than prices for input features.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv('sp500_closefull.csv', index_col=0, parse_dates=True)
df.head()

#We'll drop the NA's to avoid Forward and Backward filling
df.dropna(axis=0, how='all', inplace=True)
df.dropna(axis=1, how='any', inplace=True)
df.isna().sum().sum()

#Creating new dtaframe that only contains the returns 
df_returns = pd.DataFrame()
for name in df.columns:
  df_returns[name] = np.log(df[name]).diff()

As displayed above, our data currently displays the close prices of various stocks from the S&P lined up by day. We've preprocessed our dataframe by removing any columns with missing values as well as drop any rows in which all the values are missing.

df_returns['SPY'] = df_returns['SPY'].shift(-1)
df_returns['SPY'].tail()
Ntest = 1000
train = df_returns.iloc[1:-Ntest]
test = df_returns.iloc[-Ntest:-1]
# x_cols = df.columns.drop('SPY')

#Abitratry chose stocks with High Market cap 
x_cols = ['AAPL', 'MSFT', 'AMZN', 'JNJ', 'V', 'PG', 'JPM']
x_cols

#Splitting our data between Training and Test set 
Xtrain = train[x_cols]
Ytrain = train['SPY']
Xtest = test[x_cols]
Ytest = test['SPY']

Xtrain.head()

#Call model on traing and test set to evaluate how close model is to hitting target

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(Xtrain, Ytrain)

#Call model on traing and test set to evaluate how close model is to hitting target 
model.score(Xtrain, Ytrain), model.score(Xtest, Ytest)

O/P: (0.0082717541782342, -0.011369618185062102)

As you can see these scores are quite low, a perfect score would have been one and predicting the average would be zero. We can think of that as like the naive prediction. For the training set, we are only slightly above the naive prediction and for the test set we're getting a negative value.

We are actually doing worse than the naive prediction. Nonetheless, this won't stop us from just seeing how the model performs.

We don't actually care about the value of the prediction, we simply want to know whether it is positive (we buy) or negative (we sell).

# Direction
Ptrain = model.predict(Xtrain)
Ptest = model.predict(Xtest)

np.mean(np.sign(Ptrain) == np.sign(Ytrain)), np.mean(np.sign(Ptest) == np.sign(Ytest))

set(np.sign(Ptrain)), set(np.sign(Ptest))

train_idx = df.index <= train.index[-1]
test_idx = df.index > train.index[-1]

train_idx[0] = False
test_idx[-1] = False

df_returns['Position'] = 0 # create new column
df_returns.loc[train_idx,'Position'] = (Ptrain > 0)
df_returns.loc[test_idx,'Position'] = (Ptest > 0)

df_returns['AlgoReturn'] = df_returns['Position'] * df_returns['SPY']

# Total algo log return train
df_returns.iloc[1:-Ntest]['AlgoReturn'].sum()

# Total algo log return test
df_returns.iloc[-Ntest:-1]['AlgoReturn'].sum()


# Total return buy-and-hold train
Ytrain.sum()


# Total return buy-and-hold test
Ytest.sum()


from sklearn.linear_model import LogisticRegression
model = LogisticRegression(C=10)
Ctrain = (Ytrain > 0)
Ctest = (Ytest > 0)
model.fit(Xtrain, Ctrain)
model.score(Xtrain, Ctrain), model.score(Xtest, Ctest)

Ptrain = model.predict(Xtrain)
Ptest = model.predict(Xtest)
set(Ptrain), set(Ptest)

We call model model.predict function which will give us Ptrain and Ptest. We'll then measure their accuracy by using the sign function which converts the array into plus one if the argument is positive and minus one if the argument is negative. Applying this for both the predictions and the targets, this will give us an array of booleans forwhich we will take the mean of the boolean array to arrive at our classification accuracy.

PreviousBacktest Equity Trading with SMA Strategy NextBest Time to Buy and Sell Stock II

Last updated 3 years ago

Was this helpful?