import os
os.chdir('../src/')
from importlib import reload
import pandas as pd
import numpy as np
import lightgbm as lgb
import matplotlib.pyplot as plt
from pandas.plotting import autocorrelation_plot
from sklearn import decomposition, model_selection, svm, linear_model
import math
from finta import TA
import statsmodels
df = pd.read_json('../data/raw/BTC_ETH.json')
df.tail()
df.close.plot(title='ethereum price in bitcoin from Aug 2015 to Dec 2018')
pd.Series(np.log(df.close)).plot(title='log ethereum price in bitcoin from Aug 2015 to Dec 2018')
Log prices actually look more well behaved. Constant growth rate should be a straight line in the price graph, which in log prices it actually is, as opposed to something stupid in non-log prices.
df['log_diff'] = pd.Series(np.log(df.close)).diff()
df.log_diff.plot(title='log-differenced price')
The first few price differences do have an abnormally high norm, let's remove them.
df = df[2000:]
df.isnull().sum()
Hurray! We have no nulls!
print(len(df))
df.head()
I am guessing that volume is given in bitcoin and quoted volume is the corresponding amount in ethereum.
pd.concat([df.volume / df.weightedAverage, df.quoteVolume], axis=1).head()
It seems they compute the quoteVolume with the weightedAverage(otherwise the values don't match), but what is the weighted average? This reddit user suggests that it is the price weighted by trade volume, which sounds like a very reasonable measure to have.
The following plots show that there is actual autocorrelation going on(though barely), but more importantly, variance of the series is very autocorrelated, maybe the variance will be less autocorrelated once we look at the residuals of a fit. For fun I also plotted the skew autocorrelation and higher order moments.
df['logarithmic_returns'] = pd.Series(np.log(df.close)).diff()
ax = plt.gca(xlim=(1, 1000), ylim=(-0.1, 0.1))
autocorrelation_plot(df.log_diff, ax=ax)
Variance autocorrelation:
ax = plt.gca(xlim=(1, 1000), ylim=(-0.25, 0.25))
autocorrelation_plot(np.square(df.log_diff), ax=ax)
Skew autocorrelation:
ax = plt.gca(xlim=(1, 1000), ylim=(-0.2, 0.2))
autocorrelation_plot(np.power(df.log_diff, np.full(len(df), 3)), ax=ax)
ax = plt.gca(xlim=(1, 1000), ylim=(0, 0.2))
autocorrelation_plot(np.power(df.log_diff, np.full(len(df), 4)), ax=ax)
ax = plt.gca(xlim=(1, 1000), ylim=(-0.2, 0.2))
autocorrelation_plot(np.power(df.log_diff, np.full(len(df), 5)), ax=ax)
There is a huge spike at 500, how much time is that? Apparently 10 days.
500 * 30 / 60 / 24
12h correlation is much higher.
ax = plt.gca(xlim=(1, 1000), ylim=(-0.1, 0.1))
autocorrelation_plot(pd.Series(np.log(df.close)).diff(20)[2000:], ax=ax)
This ilustrates volatility clustering is a very real thing.
Let a convolution $p * f$ of the time series $p$ by $f = [a_0, a_2, ... ,a_{n-1}]$ be defined by $(p * f)(t) = \sum_{i=0}^{n-1}{p_{t-i}f_{i}}$.
Convolutions are used suprisingly often in technical indicators. Examples are:
The cool thing about doing this stuff on log prices, is that the lag convolution turns into taking $p(t+1)/p(t)$ in the original time series, which is "normalized" in the sense that it is invariant to scaling the whole series. It should not matter whether we are looking at the series with the price given in Dollar or looking at it given in Bitcoin. This is a clear motivation for using convolutional neural networks and maybe even initializing their weights with technical indicators.