import os
os.chdir('../src/')

from importlib import reload
import pandas as pd
import numpy as np
import lightgbm as lgb
import matplotlib.pyplot as plt
from pandas.plotting import autocorrelation_plot
from sklearn import decomposition, model_selection, svm, linear_model
import math
from finta import TA
import statsmodels

df = pd.read_json('../data/raw/BTC_ETH.json')
df.tail()

df.close.plot(title='ethereum price in bitcoin from Aug 2015 to Dec 2018')

<matplotlib.axes._subplots.AxesSubplot at 0x7f8a8593c390>

pd.Series(np.log(df.close)).plot(title='log ethereum price in bitcoin from Aug 2015 to Dec 2018')

<matplotlib.axes._subplots.AxesSubplot at 0x7f8a8c3aef60>

Log prices actually look more well behaved. Constant growth rate should be a straight line in the price graph, which in log prices it actually is, as opposed to something stupid in non-log prices.

df['log_diff'] = pd.Series(np.log(df.close)).diff()
df.log_diff.plot(title='log-differenced price')

<matplotlib.axes._subplots.AxesSubplot at 0x7f8a85f94c50>

The first few price differences do have an abnormally high norm, let's remove them.

df = df[2000:]

df.isnull().sum()

close              0
date               0
high               0
low                0
open               0
quoteVolume        0
volume             0
weightedAverage    0
dtype: int64

Hurray! We have no nulls!

print(len(df))
df.head()

59214

What do all these features mean?¶

I am guessing that volume is given in bitcoin and quoted volume is the corresponding amount in ethereum.

pd.concat([df.volume / df.weightedAverage, df.quoteVolume], axis=1).head()

It seems they compute the quoteVolume with the weightedAverage(otherwise the values don't match), but what is the weighted average? This reddit user suggests that it is the price weighted by trade volume, which sounds like a very reasonable measure to have.

Autocorrelation¶

The following plots show that there is actual autocorrelation going on(though barely), but more importantly, variance of the series is very autocorrelated, maybe the variance will be less autocorrelated once we look at the residuals of a fit. For fun I also plotted the skew autocorrelation and higher order moments.

df['logarithmic_returns'] = pd.Series(np.log(df.close)).diff()

ax = plt.gca(xlim=(1, 1000), ylim=(-0.1, 0.1))
autocorrelation_plot(df.log_diff, ax=ax)

<matplotlib.axes._subplots.AxesSubplot at 0x7f8a7fc682e8>

Variance autocorrelation:

ax = plt.gca(xlim=(1, 1000), ylim=(-0.25, 0.25))
autocorrelation_plot(np.square(df.log_diff), ax=ax)

<matplotlib.axes._subplots.AxesSubplot at 0x7f8a859d07b8>

Skew autocorrelation:

ax = plt.gca(xlim=(1, 1000), ylim=(-0.2, 0.2))
autocorrelation_plot(np.power(df.log_diff, np.full(len(df), 3)), ax=ax)

<matplotlib.axes._subplots.AxesSubplot at 0x7f8a840f3240>

ax = plt.gca(xlim=(1, 1000), ylim=(0, 0.2))
autocorrelation_plot(np.power(df.log_diff, np.full(len(df), 4)), ax=ax)

<matplotlib.axes._subplots.AxesSubplot at 0x7f8a7fa7e710>

ax = plt.gca(xlim=(1, 1000), ylim=(-0.2, 0.2))
autocorrelation_plot(np.power(df.log_diff, np.full(len(df), 5)), ax=ax)

<matplotlib.axes._subplots.AxesSubplot at 0x7f8a7ffa5400>

There is a huge spike at 500, how much time is that? Apparently 10 days.

500 * 30 / 60 / 24

10.416666666666666

12h correlation is much higher.

ax = plt.gca(xlim=(1, 1000), ylim=(-0.1, 0.1))
autocorrelation_plot(pd.Series(np.log(df.close)).diff(20)[2000:], ax=ax)

<matplotlib.axes._subplots.AxesSubplot at 0x7f8a85c573c8>

This ilustrates volatility clustering is a very real thing.

Convolution stuff¶

Let a convolution $p * f$ of the time series $p$ by $f = [a_0, a_2, ... ,a_{n-1}]$ be defined by $(p * f)(t) = \sum_{i=0}^{n-1}{p_{t-i}f_{i}}$.

Convolutions are used suprisingly often in technical indicators. Examples are:

lag $[0,1,...,0]$
difference $[0,-1,...,0,0,0]$
moving average $ma(n) = 1/n[1,..,1]$, where the 1 is repeated n times
exponential moving averages: $ema(n) = [1-\alpha,...,(1-\alpha)\alpha^n]$
moving average convergence divergense: $macd(m,n,d) = ma(d) * (ema(m) - ema(n))$ (thanks to the convolution properties we can express it like this, note that here $*$ is the convolution operator)
rsi: left to the reader

The cool thing about doing this stuff on log prices, is that the lag convolution turns into taking $p(t+1)/p(t)$ in the original time series, which is "normalized" in the sense that it is invariant to scaling the whole series. It should not matter whether we are looking at the series with the price given in Dollar or looking at it given in Bitcoin. This is a clear motivation for using convolutional neural networks and maybe even initializing their weights with technical indicators.

	close	date	high	low	open	quoteVolume	volume	weightedAverage
59209	0.032398	2018-12-23 17:30:00	0.032440	0.032264	0.032279	167.521008	5.415495	0.032327
59210	0.032335	2018-12-23 18:00:00	0.032388	0.032264	0.032384	59.336265	1.919648	0.032352
59211	0.032439	2018-12-23 18:30:00	0.032456	0.032341	0.032373	188.175282	6.101123	0.032423
59212	0.032444	2018-12-23 19:00:00	0.032473	0.032398	0.032403	157.947252	5.122461	0.032431
59213	0.032380	2018-12-23 19:30:00	0.032444	0.032356	0.032444	60.197846	1.950349	0.032399

	close	date	high	low	open	quoteVolume	volume	weightedAverage
0	0.005501	2015-08-08 05:00:00	50.00000	0.005501	50.000000	4333.469000	33.932021	0.007830
1	0.006000	2015-08-08 05:30:00	0.01280	0.005500	0.005501	3642.226628	21.749825	0.005972
2	0.005875	2015-08-08 06:00:00	0.00719	0.005000	0.006900	14607.375725	84.009002	0.005751
3	0.006500	2015-08-08 06:30:00	0.00659	0.005223	0.005425	12520.280800	74.269166	0.005932
4	0.006890	2015-08-08 07:00:00	0.00690	0.005873	0.006340	6799.871349	44.791872	0.006587

	0	quoteVolume
2000	14640.678555	14640.673648
2001	28759.370117	28759.357018
2002	3481.788564	3481.783586
2003	2889.778120	2889.771728
2004	1637.671817	1637.668127