In [1]:
import os
os.chdir('../src')
In [9]:
import pandas as pd
import numpy as np
from sklearn import metrics
import matplotlib.pyplot as plt

Regression

Theoretically optimizing the log-return accuracy should be equivalent to optimizing the expected log-returns and thus equivalent to optimizing the returns. Thus, returns-wise it might be a better idea to use regression.

In [6]:
featureset = 'indicators1'

df = pd.read_hdf('../data/features/technical.h5', key=featureset)
df['preds'] = pd.read_hdf('../data/predictions/regression_{}.h5'.format(featureset), key='predictions')
df.preds = df.preds
df = df.dropna()

importances = pd.read_hdf('../data/predictions/regression_{}.h5'.format(featureset), key='importances')
metrics.accuracy_score(df.target > 0, df.preds > 0)
Out[6]:
0.5321079718516962
In [7]:
df.preds.std()
Out[7]:
0.17205063579462265
In [70]:
s = (df.rolling(100).std().dropna() == 0).sum()
s[s != 0]
Out[70]:
MFI_0.5           7
BBWIDTH_2std    511
dtype: int64
In [8]:
importances.sort_values()
Out[8]:
CCI_0.5                -0.166892
STOCH_k                -0.126528
BBANDS_up_0.5std       -0.107407
CCI_2                  -0.101263
STOCHRSI_k_2           -0.081370
BBANDS_down_2std       -0.056023
MINUS_DI_2             -0.053487
APO_2                  -0.053139
STOCHRSI_d_0.5         -0.050865
PPO_0.5                -0.048513
PPO_2                  -0.045451
MFI_0.5                -0.031851
CMO_2                  -0.031641
RSI_2                  -0.031641
AROONOSC_0.5           -0.031298
BBWIDTH_2std           -0.027144
BBWIDTH_0.5std         -0.027144
DX_2                   -0.025787
HT_TREND_/             -0.023411
PLUS_DM_2              -0.023240
AROONOSC_2             -0.021201
NATR_2                 -0.019813
HT_SINE_sine           -0.018770
HT_TRENDMODE           -0.012869
ROCR_2                 -0.011956
STOCH_d                -0.011360
ADOSC_2                -0.010265
MFI_2                  -0.009727
ADX_0.5                -0.009357
ADOSC_0.5              -0.006132
                          ...   
HT_DCPERIOD            -0.003009
AROON_down             -0.002281
ROCR_0.5               -0.002216
HT_DCPHASE              0.000171
NATR_0.5                0.002568
DX_0.5                  0.003522
APO_0.5                 0.004646
MOM_0.5                 0.004864
AROON_up                0.008799
SAR_Signal              0.008941
HT_PHASOR_inphase       0.008971
HT_PHASOR_quadrature    0.010715
BBANDS_down_0.5std      0.011832
ADX_2                   0.014461
BOP                     0.019358
HT_SINE_leadsine        0.023015
MINUS_DM_0.5            0.033589
ULTOSC_0.5              0.037310
PLUS_DM_0.5             0.038539
PLUS_DI_2               0.040160
MINUS_DM_2              0.044125
MINUS_DI_0.5            0.047213
WILLR_2                 0.050130
MOM_2                   0.065506
RSI_0.5                 0.083370
CMO_0.5                 0.083370
BBANDS_up_2std          0.120628
EMA/                    0.136828
STOCHRSI_d_2            0.153133
WILLR_0.5               0.213211
Length: 63, dtype: float64
In [10]:
df.target[30000:30500].plot()
(df.preds * df.target.rolling(25).std())[30000:30500].plot()
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ffa0136be48>

Nice predictor we got here. At least it's not overfitting. The little accuracy performance we got was due to me dividing by the rolling standard deviation.

In [60]:
df.preds.rolling(100).std().plot()
Out[60]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fd6c0ad7f60>
In [12]:
prices = pd.read_json('../data/raw/BTC_ETH.json')
prices.close.plot()
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff9f97d3518>
In [13]:
prices['log_diff'] = np.log(prices.close) - np.log(prices.close.shift(-1))
ax = plt.gca(xlim=(1, 60000), ylim=(0, 0.00001))
(np.square(prices.log_diff) / prices.volume).rolling(100).mean().plot()
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff9eb5a89e8>

My hypothesis is that total volume in bitcoin and volatility is very correlated, and that the graph above shows that poloniex did have a smaller share of the volume pie in the beginning and at the end, so the quotient doesn't perfectly match up.

In [14]:
prices.volume.rolling(500).mean().plot()
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff9f9743dd8>

Proper Normalization

Through which lens can we look at the data such that it has the nicest statistical properties possible? This question is especiallyl crucial for such notoriously noisy and unpredictable data. The goal here is to do regression, but if you look at the regression notebook you will see that the price norms are all over the place and the predictions are very close to the mean. This changes if you divide the target by the rolling standard deviation and you go from 51.4% accuracy on the sign of the prediction to 53.2%. It doesn't matter if you do this to the features.

What we basically want to minimize is the local and global standard deviation of log-diff^2. How does a rolling average affect this?

In [15]:
p_squared = np.square(prices.log_diff[2000:] / prices.log_diff[2000:].std())
p_squared.rolling(500).mean().plot() # note the high rolling frame period
print('global std:', pd.Series(p_squared).std())
print('local std:', pd.Series(p_squared).rolling(100).std().mean())
global std: 4.878865602063849
local std: 2.345057301790629
In [172]:
s = prices.log_diff[2000:]
s = s / s.rolling(25).std()
s_squared = pd.Series(np.square(s))
s_squared.rolling(500).mean().plot()

print('global std:', pd.Series(s_squared).std())
print('local std:', pd.Series(s_squared).rolling(100).std().mean())
global std: 1.9862657251083975
local std: 1.9334807337311726

Basically we have removed large norm deviations. But it definetly feel very forced, because we are straight up fabricating the new target. What if we used candles that dependende on the trade volume and not on time? This would seem very intuitive, because price action is very dense on the hype zone, see volume graph above and it does not compare to the other historical values. Maybe on the days where "nothing happens", one can actually not predict anything and on the extreme days one would profit from a more granular approach. First we granular data of a big chunk of ethereum history and a good estimate for the total volume in this asset class.

In [16]:
a = df[:10]
a.target
b = df[5:20]
a.combine_first(b)
Out[16]:
target BBANDS_up_0.5std BBANDS_down_0.5std BBWIDTH_0.5std HT_TREND_/ EMA/ SAR_Signal ADX_0.5 APO_0.5 AROON_up ... PPO_2 ROCR_2 RSI_2 STOCHRSI_k_2 STOCHRSI_d_2 ULTOSC_2 WILLR_2 ADOSC_2 NATR_2 preds
11608 0.000082 False False 0.000298 0.996189 0.992443 False 39.423713 0.000082 57.142857 ... -1.839475 0.969277 42.735499 90.652541 8.918859e+01 51.732165 -49.515553 559.835671 1.129223 0.205075
11609 0.000253 False False 0.000247 0.993160 0.989861 False 35.570254 0.000072 50.000000 ... -1.841288 0.964591 41.875333 72.470998 8.770785e+01 49.233225 -52.727746 478.176436 1.111532 -0.008953
11610 -0.000216 False True 0.000317 0.983005 0.981020 False 36.833520 0.000091 42.857143 ... -1.846187 0.956199 39.337832 24.199228 6.244092e+01 49.749049 -65.251989 370.087703 1.144292 0.012586
11611 0.000128 False False 0.000269 0.991778 0.990331 False 38.725967 0.000092 35.714286 ... -1.857074 0.968222 42.425523 77.490274 5.805350e+01 52.705077 -53.793103 474.228244 1.160683 0.418740
11612 0.000005 False False 0.000239 0.987256 0.986142 False 39.423254 0.000097 28.571429 ... -1.892152 0.975477 41.138705 53.003238 5.156425e+01 52.206604 -60.583554 478.317377 1.154992 -0.019543
11613 0.000000 False True 0.000000 0.988048 0.986837 False 40.490430 0.000036 21.428571 ... -1.934235 0.991048 41.088220 56.689237 6.239425e+01 54.507229 -60.848806 456.623862 1.134895 0.042632
11614 0.000009 False True 0.000000 0.989797 0.987675 False 38.845330 -0.000024 14.285714 ... -1.962864 0.994577 41.088220 56.689237 5.546057e+01 53.774255 -60.848806 396.263735 1.122845 -0.143098
11615 0.000152 False True 0.000000 0.991856 0.988124 False 36.141586 -0.000070 7.142857 ... -1.984150 0.987064 40.991167 0.000000 3.779282e+01 52.977731 -61.324668 316.701620 1.122097 -0.212954
11616 0.000089 False True 0.000000 0.988360 0.983150 False 37.166287 -0.000054 0.000000 ... -1.984154 0.988502 39.357175 0.000000 1.889641e+01 50.578602 -69.389920 262.732493 1.132360 -0.048544
11617 -0.000173 False True 0.000206 0.986819 0.980864 False 33.955588 -0.000103 7.142857 ... -1.974413 0.980655 38.428462 0.000000 2.178998e-13 44.942795 -74.105040 118.522048 1.161958 -0.485548
11618 -0.000059 True False 0.000000 0.995327 0.988613 False 30.370919 -0.000101 0.000000 ... -1.920548 0.991986 41.226938 100.000000 3.333333e+01 43.804099 -63.757477 -30.183656 1.182065 -0.195873
11619 0.000328 True False 0.000000 0.998728 0.991567 False 26.208830 -0.000107 42.857143 ... -1.889636 1.001100 42.156135 100.000000 6.666667e+01 41.140276 -58.668197 -220.188837 1.189461 -0.210026
11620 -0.000048 False True 0.000235 0.986095 0.979694 False 27.711831 -0.000114 100.000000 ... -1.862386 0.991246 38.631389 5.443801 6.848127e+01 40.978423 -71.197649 -532.048626 1.230400 0.130800
11621 0.000026 False False 0.000256 0.988405 0.982799 False 29.000118 -0.000120 92.857143 ... -1.868551 0.997493 39.400183 26.067744 4.383718e+01 38.817281 -61.097259 -654.549819 1.216241 0.203289
11622 -0.000348 False False 0.000274 0.987788 0.982905 False 27.849208 -0.000109 85.714286 ... -1.870025 0.995199 39.124908 14.001532 1.517103e+01 36.672683 -63.395225 -811.427975 1.213820 -0.040316
11623 -0.000959 True False 0.000322 1.002155 0.997201 False 23.980953 -0.000072 78.571429 ... -1.803760 0.992656 44.508383 100.000000 4.668976e+01 40.140030 -32.625995 -751.981624 1.213965 0.273276
11624 -0.000040 True False 0.001012 1.040545 1.033668 False 28.317644 0.000040 71.428571 ... -1.624706 1.042150 55.703449 100.000000 7.133384e+01 49.079352 -2.271437 46.616263 1.266276 0.218943
11625 0.000110 True False 0.001204 1.040683 1.032934 False 33.133007 0.000135 64.285714 ... -1.400331 1.036437 56.086654 100.000000 1.000000e+02 47.344993 -14.556041 510.836537 1.315250 -0.147615
11626 0.000004 False False 0.001124 1.034565 1.026608 False 37.260460 0.000275 57.142857 ... -1.207757 1.039305 54.736251 92.038535 9.734618e+01 47.212337 -19.893256 549.492163 1.321382 -0.067679
11627 -0.000183 False False 0.000751 1.032810 1.024718 False 38.067732 0.000406 50.000000 ... -1.062342 1.024170 54.692799 87.961459 9.333333e+01 49.409780 -20.063076 699.123280 1.310935 0.204615

20 rows × 65 columns