Question: Does my logistic regression loose accuracy because of regime shift or lack of sufficiently varied data?

In [1]:
import os
os.chdir('../src')
In [3]:
import numpy as np
import pandas as pd
from sklearn import svm, model_selection, linear_model, preprocessing, metrics
import matplotlib.pyplot as plt
from matplotlib import colors
import seaborn as sns
In [4]:
df = pd.read_hdf('../data/features/technical.h5')
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-4-332ade095a81> in <module>
----> 1 df = pd.read_hdf('../data/features/technical.h5')

~/.conda/envs/crypto/lib/python3.7/site-packages/pandas/io/pytables.py in read_hdf(path_or_buf, key, mode, **kwargs)
    364         if not exists:
    365             raise compat.FileNotFoundError(
--> 366                 'File {path} does not exist'.format(path=path_or_buf))
    367 
    368         store = HDFStore(path_or_buf, mode=mode, **kwargs)

FileNotFoundError: File ../data/features/technical.h5 does not exist
In [17]:
scaler = preprocessing.StandardScaler()
x = df.drop(['target', 'SQZMI'], axis=1)
x = scaler.fit_transform(x)
x = np.hstack([x, np.expand_dims(df.SQZMI, axis=1)])

# ^ this affects our keynames
keynames = df.keys().drop(['target', 'SQZMI']).tolist()
keynames.append('SQZMI')
In [36]:
def score_logistic(n):
    model = linear_model.LogisticRegression(solver='liblinear')
    model.fit(x[40000:45000], df.target[40000:45000])
    print(n, model.score(x[40000-n:40000], df.target[40000-n:40000]))
In [ ]:
for n in [40000, 20000, 10000, 5000, 2500, 2000, 1500, 1250, 500, 700, 250, 125, 70, 50, 25, 10, 5, 1]:
    score_logistic(n)
In [39]:
linear_model.LogisticRegression(solver='liblinear')
Out[39]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False)

This is fucking insane. The accuracy is highly sensitive on the time window. The optimal training set is probably highly dependent on the window.