import os
os.chdir('../src')

import pandas as pd
import numpy as np
from sklearn import metrics
import matplotlib.pyplot as plt

Regression¶

Theoretically optimizing the log-return accuracy should be equivalent to optimizing the expected log-returns and thus equivalent to optimizing the returns. Thus, returns-wise it might be a better idea to use regression.

featureset = 'indicators1'

df = pd.read_hdf('../data/features/technical.h5', key=featureset)
df['preds'] = pd.read_hdf('../data/predictions/regression_{}.h5'.format(featureset), key='predictions')
df.preds = df.preds
df = df.dropna()

importances = pd.read_hdf('../data/predictions/regression_{}.h5'.format(featureset), key='importances')
metrics.accuracy_score(df.target > 0, df.preds > 0)

0.5321079718516962

df.preds.std()

0.17205063579462265

s = (df.rolling(100).std().dropna() == 0).sum()
s[s != 0]

MFI_0.5           7
BBWIDTH_2std    511
dtype: int64

importances.sort_values()

CCI_0.5                -0.166892
STOCH_k                -0.126528
BBANDS_up_0.5std       -0.107407
CCI_2                  -0.101263
STOCHRSI_k_2           -0.081370
BBANDS_down_2std       -0.056023
MINUS_DI_2             -0.053487
APO_2                  -0.053139
STOCHRSI_d_0.5         -0.050865
PPO_0.5                -0.048513
PPO_2                  -0.045451
MFI_0.5                -0.031851
CMO_2                  -0.031641
RSI_2                  -0.031641
AROONOSC_0.5           -0.031298
BBWIDTH_2std           -0.027144
BBWIDTH_0.5std         -0.027144
DX_2                   -0.025787
HT_TREND_/             -0.023411
PLUS_DM_2              -0.023240
AROONOSC_2             -0.021201
NATR_2                 -0.019813
HT_SINE_sine           -0.018770
HT_TRENDMODE           -0.012869
ROCR_2                 -0.011956
STOCH_d                -0.011360
ADOSC_2                -0.010265
MFI_2                  -0.009727
ADX_0.5                -0.009357
ADOSC_0.5              -0.006132
                          ...   
HT_DCPERIOD            -0.003009
AROON_down             -0.002281
ROCR_0.5               -0.002216
HT_DCPHASE              0.000171
NATR_0.5                0.002568
DX_0.5                  0.003522
APO_0.5                 0.004646
MOM_0.5                 0.004864
AROON_up                0.008799
SAR_Signal              0.008941
HT_PHASOR_inphase       0.008971
HT_PHASOR_quadrature    0.010715
BBANDS_down_0.5std      0.011832
ADX_2                   0.014461
BOP                     0.019358
HT_SINE_leadsine        0.023015
MINUS_DM_0.5            0.033589
ULTOSC_0.5              0.037310
PLUS_DM_0.5             0.038539
PLUS_DI_2               0.040160
MINUS_DM_2              0.044125
MINUS_DI_0.5            0.047213
WILLR_2                 0.050130
MOM_2                   0.065506
RSI_0.5                 0.083370
CMO_0.5                 0.083370
BBANDS_up_2std          0.120628
EMA/                    0.136828
STOCHRSI_d_2            0.153133
WILLR_0.5               0.213211
Length: 63, dtype: float64

df.target[30000:30500].plot()
(df.preds * df.target.rolling(25).std())[30000:30500].plot()

<matplotlib.axes._subplots.AxesSubplot at 0x7ffa0136be48>

Nice predictor we got here. At least it's not overfitting. The little accuracy performance we got was due to me dividing by the rolling standard deviation.

df.preds.rolling(100).std().plot()

<matplotlib.axes._subplots.AxesSubplot at 0x7fd6c0ad7f60>

prices = pd.read_json('../data/raw/BTC_ETH.json')
prices.close.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x7ff9f97d3518>

prices['log_diff'] = np.log(prices.close) - np.log(prices.close.shift(-1))
ax = plt.gca(xlim=(1, 60000), ylim=(0, 0.00001))
(np.square(prices.log_diff) / prices.volume).rolling(100).mean().plot()

<matplotlib.axes._subplots.AxesSubplot at 0x7ff9eb5a89e8>

My hypothesis is that total volume in bitcoin and volatility is very correlated, and that the graph above shows that poloniex did have a smaller share of the volume pie in the beginning and at the end, so the quotient doesn't perfectly match up.

prices.volume.rolling(500).mean().plot()

<matplotlib.axes._subplots.AxesSubplot at 0x7ff9f9743dd8>

Proper Normalization¶

Through which lens can we look at the data such that it has the nicest statistical properties possible? This question is especiallyl crucial for such notoriously noisy and unpredictable data. The goal here is to do regression, but if you look at the regression notebook you will see that the price norms are all over the place and the predictions are very close to the mean. This changes if you divide the target by the rolling standard deviation and you go from 51.4% accuracy on the sign of the prediction to 53.2%. It doesn't matter if you do this to the features.

What we basically want to minimize is the local and global standard deviation of log-diff^2. How does a rolling average affect this?

p_squared = np.square(prices.log_diff[2000:] / prices.log_diff[2000:].std())
p_squared.rolling(500).mean().plot() # note the high rolling frame period
print('global std:', pd.Series(p_squared).std())
print('local std:', pd.Series(p_squared).rolling(100).std().mean())

global std: 4.878865602063849
local std: 2.345057301790629

s = prices.log_diff[2000:]
s = s / s.rolling(25).std()
s_squared = pd.Series(np.square(s))
s_squared.rolling(500).mean().plot()

print('global std:', pd.Series(s_squared).std())
print('local std:', pd.Series(s_squared).rolling(100).std().mean())

global std: 1.9862657251083975
local std: 1.9334807337311726

Basically we have removed large norm deviations. But it definetly feel very forced, because we are straight up fabricating the new target. What if we used candles that dependende on the trade volume and not on time? This would seem very intuitive, because price action is very dense on the hype zone, see volume graph above and it does not compare to the other historical values. Maybe on the days where "nothing happens", one can actually not predict anything and on the extreme days one would profit from a more granular approach. First we granular data of a big chunk of ethereum history and a good estimate for the total volume in this asset class.

a = df[:10]
a.target
b = df[5:20]
a.combine_first(b)

	target	BBANDS_up_0.5std	BBANDS_down_0.5std	BBWIDTH_0.5std	HT_TREND_/	EMA/	SAR_Signal	ADX_0.5	APO_0.5	AROON_up	...	PPO_2	ROCR_2	RSI_2	STOCHRSI_k_2	STOCHRSI_d_2	ULTOSC_2	WILLR_2	ADOSC_2	NATR_2	preds
11608	0.000082	False	False	0.000298	0.996189	0.992443	False	39.423713	0.000082	57.142857	...	-1.839475	0.969277	42.735499	90.652541	8.918859e+01	51.732165	-49.515553	559.835671	1.129223	0.205075
11609	0.000253	False	False	0.000247	0.993160	0.989861	False	35.570254	0.000072	50.000000	...	-1.841288	0.964591	41.875333	72.470998	8.770785e+01	49.233225	-52.727746	478.176436	1.111532	-0.008953
11610	-0.000216	False	True	0.000317	0.983005	0.981020	False	36.833520	0.000091	42.857143	...	-1.846187	0.956199	39.337832	24.199228	6.244092e+01	49.749049	-65.251989	370.087703	1.144292	0.012586
11611	0.000128	False	False	0.000269	0.991778	0.990331	False	38.725967	0.000092	35.714286	...	-1.857074	0.968222	42.425523	77.490274	5.805350e+01	52.705077	-53.793103	474.228244	1.160683	0.418740
11612	0.000005	False	False	0.000239	0.987256	0.986142	False	39.423254	0.000097	28.571429	...	-1.892152	0.975477	41.138705	53.003238	5.156425e+01	52.206604	-60.583554	478.317377	1.154992	-0.019543
11613	0.000000	False	True	0.000000	0.988048	0.986837	False	40.490430	0.000036	21.428571	...	-1.934235	0.991048	41.088220	56.689237	6.239425e+01	54.507229	-60.848806	456.623862	1.134895	0.042632
11614	0.000009	False	True	0.000000	0.989797	0.987675	False	38.845330	-0.000024	14.285714	...	-1.962864	0.994577	41.088220	56.689237	5.546057e+01	53.774255	-60.848806	396.263735	1.122845	-0.143098
11615	0.000152	False	True	0.000000	0.991856	0.988124	False	36.141586	-0.000070	7.142857	...	-1.984150	0.987064	40.991167	0.000000	3.779282e+01	52.977731	-61.324668	316.701620	1.122097	-0.212954
11616	0.000089	False	True	0.000000	0.988360	0.983150	False	37.166287	-0.000054	0.000000	...	-1.984154	0.988502	39.357175	0.000000	1.889641e+01	50.578602	-69.389920	262.732493	1.132360	-0.048544
11617	-0.000173	False	True	0.000206	0.986819	0.980864	False	33.955588	-0.000103	7.142857	...	-1.974413	0.980655	38.428462	0.000000	2.178998e-13	44.942795	-74.105040	118.522048	1.161958	-0.485548
11618	-0.000059	True	False	0.000000	0.995327	0.988613	False	30.370919	-0.000101	0.000000	...	-1.920548	0.991986	41.226938	100.000000	3.333333e+01	43.804099	-63.757477	-30.183656	1.182065	-0.195873
11619	0.000328	True	False	0.000000	0.998728	0.991567	False	26.208830	-0.000107	42.857143	...	-1.889636	1.001100	42.156135	100.000000	6.666667e+01	41.140276	-58.668197	-220.188837	1.189461	-0.210026
11620	-0.000048	False	True	0.000235	0.986095	0.979694	False	27.711831	-0.000114	100.000000	...	-1.862386	0.991246	38.631389	5.443801	6.848127e+01	40.978423	-71.197649	-532.048626	1.230400	0.130800
11621	0.000026	False	False	0.000256	0.988405	0.982799	False	29.000118	-0.000120	92.857143	...	-1.868551	0.997493	39.400183	26.067744	4.383718e+01	38.817281	-61.097259	-654.549819	1.216241	0.203289
11622	-0.000348	False	False	0.000274	0.987788	0.982905	False	27.849208	-0.000109	85.714286	...	-1.870025	0.995199	39.124908	14.001532	1.517103e+01	36.672683	-63.395225	-811.427975	1.213820	-0.040316
11623	-0.000959	True	False	0.000322	1.002155	0.997201	False	23.980953	-0.000072	78.571429	...	-1.803760	0.992656	44.508383	100.000000	4.668976e+01	40.140030	-32.625995	-751.981624	1.213965	0.273276
11624	-0.000040	True	False	0.001012	1.040545	1.033668	False	28.317644	0.000040	71.428571	...	-1.624706	1.042150	55.703449	100.000000	7.133384e+01	49.079352	-2.271437	46.616263	1.266276	0.218943
11625	0.000110	True	False	0.001204	1.040683	1.032934	False	33.133007	0.000135	64.285714	...	-1.400331	1.036437	56.086654	100.000000	1.000000e+02	47.344993	-14.556041	510.836537	1.315250	-0.147615
11626	0.000004	False	False	0.001124	1.034565	1.026608	False	37.260460	0.000275	57.142857	...	-1.207757	1.039305	54.736251	92.038535	9.734618e+01	47.212337	-19.893256	549.492163	1.321382	-0.067679
11627	-0.000183	False	False	0.000751	1.032810	1.024718	False	38.067732	0.000406	50.000000	...	-1.062342	1.024170	54.692799	87.961459	9.333333e+01	49.409780	-20.063076	699.123280	1.310935	0.204615