Trends and stationarity #

A basic model for a single time series is to express the observed series \(y_t\) in the form \(y_t = \mu_t + \epsilon_t\) , where \(\mu_t\) is the deterministic trend and \(\epsilon_t\) is stochastic (i.e. there is a joint probability distribution describing \(\epsilon_1, \epsilon_2, \ldots\) ). In order to make \(\mu_t\) identifiable, we require \(E\epsilon_t = 0\) for each \(t\) .

If we only observe a single series, without further conditions \(\mu_t\) and \(\epsilon_t\) cannot be distinguished. Technical assumptions allow them to be distinguished, but we will not pursue that further here. See Dette, Preuss, Sen (2017) for extensive discussion of this issue. For the remainder of this section, we take \(\mu_t \equiv 0\) .

The series \(y_t\) is stationary if \((y_t,\ldots,y_{t+d})\) is equal in distribution to \((y_s,\ldots,y_{s+d})\) for any \(s, t, d\) . The weaker condition of covariance stationarity holds if \(\textrm{Cov}(y_t,\ldots,y_{t+d})\) is equal to \(\textrm{Cov}(y_s,\ldots,y_{s+d})\) for any \(s, t, d\) .

Autocorrelation #

One of the basic characteristics of a stationary time series is the autocorrelation function, \(r_d = \textrm{Cor}(y_t, y_{t+d})\) , which does not depend on \(t\) due to the series being stationary.

The conventional way to define correlation is the product-moment form, \(E[X\cdot Y]\) , where \(X\) and \(Y\) are standardized to have expected value zero and unit variance. The standard estimator of a product moment correlation is the sample Pearson correlation coefficient. Estimators of this type do not perform very well for heavy-tailed distributions. There are many “robust” correlation coefficients that perform better in these settings. We only discuss one of them here, known as the tau-correlation, (or “Kendall’s tau correlation”).

Given data \((x_i, y_i)\) for \(i=1, \ldots, n\) , the tau-correlation is defined in terms of the concordances \(I_{ij} = I((x_i - x_j)(y_i - y_j) > 0)\) . This is the indicator that the ordering between \(x_i\) and \(x_j\) is the same (in terms of direction) as the ordering between \(y_i\) and \(y_j\) . For the present discussion, we ignore the possibility of ties. Let \(c = I_{\cdot\cdot}\) denote the number of concordant pairs, and let \(m = n(n-1)/2\) denote the number of pairs. Then the tau correlation is \(2c/m - 1\) . Note that the tau correlation ranges from -1 to 1, and is equal to 1 when all pairs are concordant, is equal to -1 when all pairs are discordant, and is equal to zero when exactly half of the pairs are concordant.

The tau-autocorrelation at lag \(d\) is the tau-correlation between \(x_1, \ldots, x_{n-d}\) and \(x_d, \ldots, x_n\) .

Singular Spectrum Analysis #

Factorization methods, or dimension reduction methods in statistics aim to identify low dimensional structures in complex high-dimensional data. The most familiar factorization method is Principal Component Analysis, which finds low dimensional structure in a collection of vectors.

Time series are not directly amenable to analysis with PCA. There are several ways to adapt PCA to be suitable for use with time series. One such method is called Singular Spectrum Analysis (SSA).

We start with a time series \(x_1, \ldots, x_n\) , which without loss of generality has been centered so that \(\bar{x} = 0\) .

The essential idea of SSA is to explain the variation that arises from translating the observed series \(x\) in the time-domain. A “simple” time series is one whose translates can be represented by linear combinations of just a few components. These would typically be time series with strong periodicity. A complex time series requires many components to represent its translates.

There are several different ways to develop SSA. One approach starts with the following matrix \(Q\) :

\(Q \equiv \left( \begin{array}{llll} x_1 & x_2 & \cdots & x_L\\ x_2 & x_3 & \cdots & x_{L+1}\\ x_3 & x_4 & \cdots & x_{L+2}\\ &\cdots&\cdots\\ &\cdots&\cdots\\ x_{T-L+1} & x_{T-L+2} & \cdots & x_T\\ \end{array} \right)\)

The parameter \(L\) is a tuning parameter called the “window width” that corresponds to the maximum lag at which periodicity and correlation structure can be identified through SSA.

Each column and each row of \(Q\) contains a subseries of the overall time series, with consecutive rows and consecutive columns shifted by one position. When \(L \ll T\) , the columns of \(Q\) contain essentially the entire series, while the rows of \(Q\) contain rather short segments from the series.

This matrix \(Q\) is a Hankel matrix, meaning that the “ascending diagonals” are constant: \( Q_{i,1} = Q_{i-1,2} = Q_{i-2,3}. \)

SSA next takes the singular value decomposition (SVD) \(Q = USV^\prime\) , where \(U\) has the same shape as \(Q\) , \(S = \textrm{diag}(s_1, \ldots, s_L)\) is \(L\times L\) and diagonal with non-negative descending diagonal entries \(s_1 \ge s_2 \ge \cdots s_L\) , and \(V\) is \(L\times L\) , with both \(U\) and \(V\) being orthogonal matrices.

The SVD expresses the matrix \(Q\) as a sum of rank-1 matrices \(s_ku_kv_k^\prime\) , where \(u_k\) and \(v_k\) are the k’th columns of \(U\) and \(V\) respectively. A rank-1 matrix has the property that every row is a scalar multiple of every other row, and every column is a scalar multiple of every other column.

One way to view SSA is that it represents time translation of the observed series \(\{x_j\}_{j=1}^T\) as multiplication of fixed vectors \(u_k\) by various constants. Note that the columns of \(Q\) are time translates of the original series. Each column of \(s_ku_kv_k^\prime\) is a series of scalar multiples of \(u_k\) .

A very elementary type of time series is one in which time translation corresponds to scalar multiplication of a small number of “basis series”. For example, if our time series is sinusoidal \(x_k = \sin(k\cdot m)\) for a “step size” \(m\) , then using the angle sum identity \(\sin(t + s) = \sin(t)\cos(s) + \cos(t)\sin(s)\) , we see that all translations of \(x\) can be written as linear combinations of \(u_1(k) = \sin(k\cdot m) \equiv x_k\) (itself) and and \(u_2(k) = \cos(k\cdot m)\) . These also happen to be orthogonal vectors as long as \(T\) is divisible by the period. In words, it only takes two basis vectors to represent all time translates of a sine wave, so a sine wave is very elementary in this sense.

In general, it will take \(L\) basis vectors \(u_1, \ldots, u_L\) to represent \(L\) time translates of a time series. However some of these basis vectors will explain more variation than others. The dimensionality of the SSA decomposition is represented through the “spectrum” which is the sequence of singular values \(s_1, s_2, \ldots, s_L\) . Graphing the spectrum is analogous to the “scree plot” in PCA, except that it is more conventional in SSA to graph the spectrum in log/log terms, so that \(\log(s_k)\) is plotted against \(\log(k)\) . If the singular values approximately follow a “power law” \(s_k = k^{-p}\) for \(p \ge 0\) , then the spectrum plot will be linear on the log/log scale. A power law decays slowly, so a power law spectrum implies that the overall time series is far from simple, and it requires substantial contributions from many components to explain all the variation that arises when translating the series. Ideally, the spectrum will not be exactly a power law, but rather a few dominant components will pop out above the power law trend. These are the components that are most likely to have a clear-cut intepretation.

In addition to inspecting the spectrum, the actual components \(u_k\) can be plotted to assess the time domain behavior that each component represents. Typically only the dominant few such components can be interpreted. An alternative and more elaborate approach is to project the rank-1 matrix \(u_kv_k^\prime\) to have a Hankel structure, which can be accomplished by averaging the values along each ascending diagonal to obtain \(\tilde{u}(i) = \textrm{Avg}(u_k(i)\cdot v_k(1), u_k(i-1)\cdot v_k(2), \ldots)\) . The coefficients \(v_k\) can be inspected to understand how the amplitude of each component varies over time. If the component represents a periodic effect, then the pattern of values in \(v_k\) will be approximately periodic.

To obtain another perspective on SSA, we can take the Gram matrix \(Q^\prime Q\) . This matrix is approximately Toeplitz, meaning that that all of its diagonals are constant. It is only approximately Toeplitz due to edge effects which we ignore here. The matrix \(Q^\prime Q\) is approximately equal to the “autocovariance matrix” of the time series \(x\) . That is, \((Q^\prime Q)_{ij} \propto \textrm{Cov}(B_{|i-j|}x, x)\) , where \(B_\ell\) is the backshift operator that shifts a time series by \(\ell\) positions. The spectrum of \(Q^\prime Q\) is the square of the spectrum of \(Q\) , and the eigenvectors of \(Q^\prime Q\) are the right singular vectors of \(Q\) . This establishes a connection between SSA and the eigen-structure of the autocovariance matrix, which is a central quantity in time series analysis.

Time/Frequency Representations #

A basic frequency representation of a time series starts with a collection of basis vectors \(\psi_{fj}\) , where \(f\) denotes frequency, and \(\psi_{f1}, \psi_{f2}, \ldots\) are a collection of basis vectors at frequency \(f\) . The most common form of frequency representation is the Fourier representation, for which

\(\psi_{f1}(i) = \sin(2\pi fi/T)\)

and

\(\psi_{f2}(i) = \cos(2\pi fi/T)\) .

The basis functions \(\psi_{fj}\) are mutually orthogonal.

Given a time series \(x\) we can represent it as a linear combination of these basis functions:

\(x_i = \sum_f\sum_j \beta_{fj}\psi_{fj}(i)\)

This is usually taken to be an approximation in the L2 sense, which means that the \(\beta_{fj}\) can in principle be obtained using least squares regression.

The fitted values \(\hat{Y}^{(f)} = \sum_j \hat{\beta}_{fj}\psi_{fj}\) represent the component of the time series at frequency \(f\) , and \(\|\hat{Y}^{(f)}\|^2\) represents the “energy” in the time series at frequency \(f\) .

The Fourier spectrum of a time series is the sequence of energies as defined above. Inspecting the Fourier spectrum can give insight about what frequencies are most prominent in a particular time series.

A limitation of the traditional Fourier spectrum is that we may have a long time series that has different spectral characteristics in different segments (i.e. is non-stationary). To obtain an understanding of the spectrum of a non-stationary time series, it is helpful to calculate the Fourier spectrum locally. This gives rise to the time frequency representation, or TFR.

There are various ways to constrtuct a TFR, the basic idea is to introduce a windowing function \(w_t(\cdot)\) that is centered at time point \(t\) . Then, we estimate the representation

\(\hat{Y}^{(f,t)} = \sum_j \hat{\beta}_{fjt}\psi_{fj}\)

using weighted regression, with the \(w_t(\cdot)\) as weights. We can now define “local energies” \(\|\hat{Y}^{(f,t)}\|_t^2\) , where the norm \(\|\cdot\|_t^2\) is also calculated in a weighted manner.

The TFR has a two-dimensional structure since it is indexed by frequency \(f\) and by time \(t\) . It can be visualized, or used as an input for additional formal analysis. In situations where a large collection of time series is observed, we can calculate the TFR for each, then use them as features for a joint analysis of all of the time series.

L-moments #

Many statistical summaries take the form of either moments or quantiles. In general, a moment of the random variable \(X\) has the form \(Ef(X)\) , where \(f(\cdot)\) is a function. A quantile \(q\) satisfies the relationship \(P(X \le q) = p\) , where \(p\) is a given probability point.

L-moments are linear functions of quantiles that describe the shape of a distribution. They can be estimated much more stably than classical moments, especially when the data follow a heavy-tailed distribution. They are related to quantiles, but defined explicitly in terms of order statistics of samples of data.

To define L-moments, let \(X_{i:m}\) denote the i’th order statistic from a sample of size \(m\) from a distribution. These are the increasing order statistics, so that \(X_{i:m} \le X_{i+1:m}\) for all \(1 \le i < m\) . The L-moments are defined in terms of expectations of these values. Here are the first four L-moments:

\(\begin{array}{lll} \lambda_1 &\equiv& EX_{1:1}\\ \lambda_2 &\equiv& (EX_{2:2} - EX_{1:2})/2\\ \lambda_3 &\equiv& (EX_{3:3} - 2EX_{2:3} + EX_{1:3})/3\\ \lambda_4 &\equiv& (EX_{4:4} - 3EX_{3:4} + 3EX_{2:4} - EX_{1:4})/4\\ \end{array}\)

These four L-moments are measures of location, dispersion, skew, and kurtosis, respectively, just like the first four classical moments. Higher-order L-moments can also be defined but are not considered further here. Sample estimates of these statistics can be constructed based on plugging-in the sample analogue of each \(EX_{i:m}\) , using a sample mean to estimate the population mean. However this would seem to involve a lot of computation, since, for example, estimating \(X_{i:m}\) involves considering all subsets of size \(m\) from the dataset. It turns out that more efficient computational approaches exist that bypass this combinatorial calculation, so that L-moments are actually very fast to calculate even with large data sets. However the expressions above remain the most intuitive path to understanding the meaning of L-moments.

Long memory #

A time series has long memory or long range dependence if its autocorrelations are not summable, i.e. \(\sum_{d=0}^\infty r_d\) is not a finite number. This definition is applicable to time series of infinite length, so cannot be directly checked with data. A typical example of a time series that has long range dependence would be one in which \(r_d \sim 1/d\) , while a time series with \(r_d \sim 1/d^p\) for \(p>1\) would have short range dependence.

Long range dependence (LRD) is an interesting property of a time series to consider, not a problematic issue that needs to be fixed. It is simply a property that some time series exhibit that is important to explore when characterizing the behavior of the series. In many cases, LRD is reduced by differencing the time series, i.e. by defining a new time series \(y_t = x_t - x_{t-1}\) . By reducing or removing the LRD, we can then use more conventional methods to characterize the short range dependence structure of the differenced series. This does not mean that the LRD was an artifact that we removed, it simply means that we describe the time series from two points of view – the long range dependence of the original series, and the short range dependence of the differenced series.

A property of a time series with long memory is that the sample means of blocks of consecutive data values behave completely differently from what we see under short-range dependence or independence. Specifically, suppose we consider the sample mean of \(d\) consecutive values, \(\bar{x}_d = (x_t + \cdots + x_{t+d-1})/d\) . Let \(v_d = \textrm{Var}(\bar{x}_d)\) denote the variance of \(\bar{x}_d\) . In many cases we have a relationship of the form \(v_d \sim c\cdot d^\beta\) , with \(\beta = -1\) corresponding to what we see under short-range dependence or independence. When \(\beta > -1\) we have long range dependence.

The Hurst index captures the form of dependence exhibited by a particular time series. It statisfies \(H = 1 + \beta/2\) , so that \(v_d \sim c\cdot d^{2(H-1)}\) . A Hurst index of 1/2 reflects short range dependence or independence, and Hurst indices greater than 1/2 indicate long range dependence.

The Hurst index is quite difficult to estimate and even harder to consider from the point of view of statistical uncertainty. A number of estimators exist, the most basic one based on the sample analogue of the relationship between the variance of the sample means at a given block size, and the block size. We can create blocks of various sizes, and calculate the sample means for non-overlapping blocks, then calculate the sample variances of these sample means. This gives us estimates \(\hat{v}_d\) for values of \(d\) that are sufficiently smaller than \(n\) . A log/log plot of \(\hat{v}_d\) against \(d\) should be approximately linear with a slope of \(\beta\) .

Another approach to estimating the Hurst index is based on using a fractional Brownian motion (FBM) model. In this model, the autocorrelation at lag \(d\) has the form \(r_d = (|d+1|^{2H} - 2|d|^{2H} + |d-1|^{2H})/2\) , where \(H\) is the Hurst parameter. Given this parametric form, maximum likelihood can be used to estimate \(H\) . Specifically, if \(z\) is our data, and we set \(\Sigma_{ij}(H) = r_{|i-j|}(H)\) , then \(-\log(|\Sigma(H)|)/2 -z^\prime \Sigma(H)^{-1} z\) is a log-likelihood function that can be maximized over \(H\) .

Heavy tails #

The right tail of a probability distribution is described through its rate of decay, \(P(X > t)\) , taken as a function of \(t\) . Some distributions like the Gaussian distribution have light tails because \(P(X > t) \sim \exp(-t^2)\) , i.e. the tails decay at an exponential rate. A heavy-tailed distribution has tails that decay at a polynomial rate, \(P(X > t) \sim 1/t^\alpha\) .

A parametric distribution with a polynomial right tail is the Pareto distribution. However it is not realistic to use the Pareto distribution to model most datasets, because while the tail behavior may capture the truth, the behavior away from the tail may be wrong. Thus, it is desirable to estimate the tail parameter \(\alpha\) nonparametrically. A common way to do this is using the “Hill estimator”. The motivation for the Hill estimator begins with the observation that for a Pareto distribution, the identity \(E\log(X) = 1/\alpha\) holds. Thus we can use the reciprocal of the average of the logged data to estimate the tail parameter, if we have a sample from a Pareto distribution.

If we have data sampled from a distribution with polynomial tails that is not necessarily Pareto, then we may wish to use only the upper order statistics to estimate the tail parameter. If the descending order statistics are \(x_{(1)} \ge x_{(2)} \ge \cdots \) , then we can estimate the tail parameter using \(H_k \equiv (\log x_{(1)} + \cdots + \log x_{(k)})/k - \log x_{(k+1)}\) .

Since this estimator only uses the order statistics from the tail of the distribution, it is unaffected by the shape of the distribution away from the tail. In practice, a “Hill plot” – a plot of \(H_k\) against \(k\) is used for estimation. If the plot is stable for a range of moderate values of \(k\) , then this stable value can be used to estimate the tail parameter.

Models for short range dependence #

Classical time series analysis is largely concerned with models for the short-range dependence structure. There are many useful approaches, we only mention here one of them – the autoregressive approaches, because they are easy to explain in the language of regression. An autoregressive mean structure model has the form \(E[x_t | x_{t-1}, x_{t-2}, \ldots] = f(x_{t-1}, x_{t-2}, \ldots)\) . A model of this form can be estimated using any form of linear or nonlinear regression to fit the specified mean structure.