import os
os.chdir('../src')
import pandas as pd
import numpy as np
from sklearn import metrics
import matplotlib.pyplot as plt
Theoretically optimizing the log-return accuracy should be equivalent to optimizing the expected log-returns and thus equivalent to optimizing the returns. Thus, returns-wise it might be a better idea to use regression.
featureset = 'indicators1'
df = pd.read_hdf('../data/features/technical.h5', key=featureset)
df['preds'] = pd.read_hdf('../data/predictions/regression_{}.h5'.format(featureset), key='predictions')
df.preds = df.preds
df = df.dropna()
importances = pd.read_hdf('../data/predictions/regression_{}.h5'.format(featureset), key='importances')
metrics.accuracy_score(df.target > 0, df.preds > 0)
df.preds.std()
s = (df.rolling(100).std().dropna() == 0).sum()
s[s != 0]
importances.sort_values()
df.target[30000:30500].plot()
(df.preds * df.target.rolling(25).std())[30000:30500].plot()
Nice predictor we got here. At least it's not overfitting. The little accuracy performance we got was due to me dividing by the rolling standard deviation.
df.preds.rolling(100).std().plot()
prices = pd.read_json('../data/raw/BTC_ETH.json')
prices.close.plot()
prices['log_diff'] = np.log(prices.close) - np.log(prices.close.shift(-1))
ax = plt.gca(xlim=(1, 60000), ylim=(0, 0.00001))
(np.square(prices.log_diff) / prices.volume).rolling(100).mean().plot()
My hypothesis is that total volume in bitcoin and volatility is very correlated, and that the graph above shows that poloniex did have a smaller share of the volume pie in the beginning and at the end, so the quotient doesn't perfectly match up.
prices.volume.rolling(500).mean().plot()
Through which lens can we look at the data such that it has the nicest statistical properties possible? This question is especiallyl crucial for such notoriously noisy and unpredictable data. The goal here is to do regression, but if you look at the regression notebook you will see that the price norms are all over the place and the predictions are very close to the mean. This changes if you divide the target by the rolling standard deviation and you go from 51.4% accuracy on the sign of the prediction to 53.2%. It doesn't matter if you do this to the features.
What we basically want to minimize is the local and global standard deviation of log-diff^2. How does a rolling average affect this?
p_squared = np.square(prices.log_diff[2000:] / prices.log_diff[2000:].std())
p_squared.rolling(500).mean().plot() # note the high rolling frame period
print('global std:', pd.Series(p_squared).std())
print('local std:', pd.Series(p_squared).rolling(100).std().mean())
s = prices.log_diff[2000:]
s = s / s.rolling(25).std()
s_squared = pd.Series(np.square(s))
s_squared.rolling(500).mean().plot()
print('global std:', pd.Series(s_squared).std())
print('local std:', pd.Series(s_squared).rolling(100).std().mean())
Basically we have removed large norm deviations. But it definetly feel very forced, because we are straight up fabricating the new target. What if we used candles that dependende on the trade volume and not on time? This would seem very intuitive, because price action is very dense on the hype zone, see volume graph above and it does not compare to the other historical values. Maybe on the days where "nothing happens", one can actually not predict anything and on the extreme days one would profit from a more granular approach. First we granular data of a big chunk of ethereum history and a good estimate for the total volume in this asset class.
a = df[:10]
a.target
b = df[5:20]
a.combine_first(b)