% LaTeX template, xeCJK usepackage and Unicode text need XeLaTeX to compile.
% MiKTeX package can be downloaded at miktex.org, WinEdt can be downloaded at http://www.winedt.com/.
% WinEdt 6 and higher provide XeLaTeX and Unicode support.
% LaTeX template version 1.0.2, last revised on 2014-11-04.
\documentclass[10pt]{article}
% *************************** installed packages *************************
% AMS packages
\usepackage{amsfonts} % TeX fonts from the American Mathematical Society.
\usepackage{amsmath} % AMS mathematical facilities for LaTeX.
\usepackage{amssymb} % AMS symbols
\usepackage{amsthm} % Provide proclamations environment.
% graphics packages
\usepackage{graphics} % Standard LaTeX graphics.
\usepackage{tikz} % TikZ and PGF package for graphics
\usetikzlibrary{matrix} % matrix library of TikZ package
\usetikzlibrary{trees} % trees library of TikZ package
% support for foreign languages, esp. Chinese.
\usepackage{xeCJK} % Support for CJK (Chinese, Japanese, Korean) documents in XeLaTeX.
\setCJKmainfont{SimSun} % MUST appear when xeCJK is loaded.
%\setCJKmainfont{DFKai-SB} % 设置正文罗马族的CKJ字体，影响 \rmfamily 和 \textrm 的字体。此处设为“标楷体”。
%\setCJKmainfont{SimSun} % 设置正文罗马族的CKJ字体，影响 \rmfamily 和 \textrm 的字体。此处设为“宋体”。
%\setCJKmonofont{MingLiU} % 设置正文等宽族的CJK字体，影响 \ttfamily 和 \texttt 的字体。此处设为“细明体”。
%\renewcommand\abstractname{摘要} % 重定义摘要名：abstract->摘要。
%\renewcommand\appendixname{附录} % 重定义附录名：appendix->附录。
%\renewcommand\bibname{参考文献} % 重定义参考文献名：bibliography->参考文献。
%\renewcommand\contentsname{目录} % 重定义目录名：contents->目录。
%\renewcommand\refname{参考文献} % 重定义参考文献名：references->参考文献。
% miscellaneous packages
\usepackage[toc, page]{appendix} % Extra control of appendices.
\usepackage{clrscode} % Typesets pseudocode as in Introduction to Algorithms.
%\usepackage{courier} % Typesets program code.
\usepackage{epsfig}
\usepackage{eurosym} % Metafont and macros for Euro sign.
\usepackage{float} % Improved interface for floating objects.
\usepackage{fontspec} % Advanced font selection in XeLaTeX and LuaLaTeX.
\usepackage{indentfirst}
\usepackage{xcolor} % Driver-independent color extensions for LaTeX and pdfLaTeX.
% must-be-the-last packages
\usepackage[pagebackref]{hyperref} % Extensive support for hypertext in LaTeX; MUST be on the last \usepackage line in the preamble. [pagebackref] for page referencing; [backref] for section referencing.
% ********************** end of installed packages ***********************
% ************************** fullpage.sty ********************************
% This is FULLPAGE.STY by H.Partl, Version 2 as of 15 Dec 1988.
% Document Style Option to fill the paper just like Plain TeX.
\typeout{Style Option FULLPAGE Version 2 as of 15 Dec 1988}
\topmargin 0pt \advance \topmargin by -\headheight \advance
\topmargin by -\headsep
\textheight 8.9in
\oddsidemargin 0pt \evensidemargin \oddsidemargin \marginparwidth
0.5in
\textwidth 6.5in
% For users of A4 paper: The above values are suited for American 8.5x11in
% paper. If your output driver performs a conversion for A4 paper, keep
% those values. If your output driver conforms to the TeX standard (1in/1in),
% then you should add the following commands to center the text on A4 paper:
% \advance\hoffset by -3mm % A4 is narrower.
% \advance\voffset by 8mm % A4 is taller.
% ************************ end of fullpage.sty ***************************
% ************** Proclamations (theorem-like structures) *****************
% [section] option provides numbering within a section.
\theoremstyle{plain}
\newtheorem{theorem}{Theorem}
\newtheorem{lemma}{Lemma}[section]
\newtheorem{definition}{Definition}[section]
\newtheorem{prop}{Proposition}[section]
\newtheorem{corollary}{Corollary}[section]
\newtheorem{remark}{Remark}[section]
\newtheorem{example}{Example}
% ************************************************************************
% ************* Solutions use a modified proof environment ***************
\newenvironment{solution}
{\begin{proof}[Solution]}
{\end{proof}}
% ************************************************************************
% ************* Frequently used commands as shorthand ********************
\newcommand{\norm}{|\!|}
% ************************************************************************
\begin{document}
\title{ \Huge{\it Time Series Models: A Guide for Practitioners} }
\author{Yan Zeng}
\date{Version 1.0.2, last revised on 2015-02-17.}
\maketitle
\begin{abstract}
A user's guide on time series models, based on Ruppert \cite{Ruppert11}.
\end{abstract}
\tableofcontents
\newpage
\section{Test of Stationarity}\label{sect_stationarity}
\noindent $\bullet$ Visual inspection via time series plot (stationarity implies {\it mean-reversion} phenomenon).
\medskip
\noindent $\bullet$ Plot of autocorrelation functions (ACF).
Most statistical software will plot a sample ACF with {\it test bounds}. These bounds are used to test the null hypothesis that an autocorrelation coefficient is 0. The null hypothesis is rejected if the sample autocorrelation is outside the bounds. The ususal level of the test is 0.05.
\medskip
\noindent $\bullet$ Hypothesis testing via a simultaneous test (Ljung-Box test).
A {\it simultaneous test} is one that tests whether a group of null hypotheses are all true versus the alternative that at least one of them is false. The null hypothesis of the Ljung-Box test is $H_0$:
\[
\rho(1) = \rho(2) = \cdots = \rho(K) = 0 \;\; \mbox{for some $K$, say $K=5$ or $10$.}
\]
The Ljung-Box test was implemented using \texttt{R}'s \texttt{Box.test} function. $K$ is called \texttt{lag} when \texttt{Box.test} is called and \texttt{df} in the output.
\medskip
\noindent $\bullet$ Hypothesis testing via the Durbin-Watson test.
The null hypothesis of the Durbin-Watson test is that the first $p$ autocorrelation coefficients are all 0, where $p$ can be selected by the user. In the \texttt{R} function \texttt{durbin.watson} in the \texttt{car} package, $p$ is called \texttt{max.lag} and has a default value of 1. The $p$-value is computed by \texttt{durbin.watson} using bootstrapping (The calculated $p$-value could have substantial random variation due to bootstrapping with the default of $B=1000$ resamples. Using a larger number of resamples will compute the $p$-value with more accuracy.). The \texttt{lmtest} package of \texttt{R} has another function, \texttt{dwtest}, that computes the Durbin-Watson test, but only with $p=1$. \texttt{dwtest} uses either a normal approximation (default) or an exact algorithm to calculate the $p$-value.
\section{AR(1) model}
{\bf Properties of an AR(1) process}. An AR(1) process $Y_t - \mu = \phi(Y_{t-1} - \mu) + \epsilon_t$ ($t=1,2,\cdots$) is weakly stationary if and only if $|\phi|<1$ and $Y_0$ satisfies
\begin{eqnarray}\label{init_distr_stationary}
E[Y_0] = \mu, \; \mbox{Var}(Y_0) = \frac{\sigma^2_{\epsilon}}{1-\phi^2}.
\end{eqnarray}
In this case, we have for any $t$ and $h$,
\begin{eqnarray*}
&& E(Y_t) = \mu\\
&& \gamma(0) = \mbox{Var}(Y_t) = \frac{\sigma^2_{\epsilon}}{1-\phi^2} \\
&& \gamma(h) = \mbox{Cov}(Y_t, Y_{t+h}) = \frac{ \sigma^2_{\epsilon} \phi^{|h|} }{1-\phi^2} \\
&& \rho(h) = \mbox{Corr}(Y_t, Y_{t+h}) = \phi^{|h|}.
\end{eqnarray*}
Moreover, the process has the {\it infinite moving average} [MA($\infty$)] representation:
\[
Y_t = \mu + \epsilon_t + \phi \epsilon_{t-1} + \phi^2 \epsilon_{t-2} + \cdots = \mu + \sum_{h=0}^{\infty} \phi^h \epsilon_{t-h}.
\]
If $Y_0$ does not satisfy condition (\ref{init_distr_stationary}), but we still have $|\phi|<1$, then $(Y_t)_{t=1}^{\infty}$ is not stationary but converges to the stationary distribution as $t\to\infty$. If $|\phi|\ge 1$, then $(Y_t)_{t=1}^{\infty}$ is either explosive or is a random walk and cannot stationary.\footnote{It is assumed the observations start at $Y_1$ so that $Y_0$ is not available.}
The ACF $\rho(h)$ of an AR(1) process depends only on one parameter ($\phi$) and has only a limited range of shapes (the magnitude of its ACF decays geometrically to zero). If the sample ACF of the data does not behave in one of these ways, then an AR(1) model is unsuitable. The remedy is to use more AR parameters, to switch to another class of models such as the moving average (MA) or autoregressive moving average (ARMA) models.
\medskip
{\bf Residuals and Model Checking}. To verify that an AR(1) process is a suitable model, we can check on its residuals; any autocorrelation in the residuals is evidence against the assumption of an AR(1) process. Tests include ACF plot and the Ljung-Box test of the residuals (see Section \ref{sect_stationarity}). Any residual ACF value outside the test bounds is significantly different from 0 at the 0.05 level. The danger here is that some sample ACF values will be significant merely by chance, and to guard against this danger, one can use the Ljung-Box test that {\it simultaneously} tests that all autocorrelations up to a specified lag are zero. \footnote{When the Ljung-Box test is applied to residuals, a correction is needed to account for the use of $\hat\phi$ in place of the unknown $\phi$. Some software makes this correction automatically. In \texttt{R} the correction is not automatic but is done by setting the \texttt{fitdf} parameter in \texttt{Box.test} to the number of parameters that were estimated, so for an AR(1) model \texttt{fitdf} should be 1.}
\medskip
{\bf Estimation of AR(1) processes}. Assuming the residuals are i.i.d. normal and $(Y_t)_{t=1}^{\infty}$ is Markov, we can use the {\it maximum likelihood estimator} (MLE) or the {\it conditional least-square estimator} to estimate $\mu$, $\phi$, $\sigma_{\epsilon}$. The default method for the function \texttt{arima} in R is to use the conditional least-squares estimates as starting values for maximum likelihood. The MLE is returned. See Ruppert \cite[page~218]{Ruppert11} for details.
\section{AR(\texorpdfstring{$p$}{TEXT}) model}
An AR($p$) process has the form
\[
Y_t - \mu = \phi_1(Y_{t-1}-\mu) + \cdots + \phi_p(Y_{t-p}-\mu) + \epsilon_t,
\]
where $\epsilon_t$'s are weak white noise process WN(0,$\sigma_{\epsilon}^2$). It can be shown that $\{1-(\phi_1+\cdots+\phi_p)\}>0$ for a stationary process. The ACFs of AR($p$) processes with $p>1$, although having complicated formulas, can be computed and plotted using \texttt{R}'s \texttt{ARMAacf} function.
Most of the concepts for AR(1) models generalize easily to AR($p$) models. The conditional least squares or maximum likelihood estimators can be calculated using software such as \texttt{R}'s \texttt{arima} function.
Residual autocorrelation can be detected by examining the sample ACF of the residuals and using the Ljung-Box test. Many statistical software packages have functions to automate the search for the AR model that optimizes AIC or other criteria, e.g. the \texttt{auto.arima} function in \texttt{R}'s \texttt{forecast} package.
One problem with AR models is that they often need a rather large value of $p$ to fit a data set. When AR models are not parsimonious, the remedy is to use an MA or ARMA model.
\section{MA Processes}
The idea behind AR processes is to feed past data back into the current value of the process. The effect is to have at least some correlation at {\it all} lags. Sometimes data show correlation at only short lags, while AR processes do not behave this way and do not provide a parsimonious fit. A useful alternative to an AR model is a moving average (MA) model. E.g., the MA(1) process is
\[
Y_t - \mu = \epsilon_t + \theta \epsilon_{t-1},
\]
where $\epsilon_t$'s are WN(0, $\sigma^2_{\epsilon}$). One can show that
\begin{eqnarray*}
&& E(Y_t) = \mu, \\
&& \mbox{Var}(Y_t) = \sigma^2_{\epsilon}(1+\theta^2), \\
&& \gamma(1) = \theta \sigma^2_{\epsilon}, \\
&& \gamma(h) = 0 \;\mbox{if $|h|>1$},\\
&& \rho(1) = \frac{\theta}{1+\theta^2}, \\
&& \rho(h) = 0 \; \mbox{if $|h|>1$}.
\end{eqnarray*}
The MA($q$) process is
\[
Y_t = \mu + \epsilon_t + \theta_1 \epsilon_{t-1} + \cdots + \theta_q \epsilon_{t-q}.
\]
One can show that $\gamma(h) = 0$ and $\rho(h) = 0$ if $|h|>q$. $\gamma(h)$ and $\rho(h)$ when $|h|\le q$ can be computed in \texttt{R} by the function \texttt{ARMAacf}.
MA($q$) models can be fit easily using, for example, the \texttt{arima} function in \texttt{R}. The Ljung-Box test is applied to the residuals to check the fitted MA($q$) model is adequate.
\section{ARMA Model}
The ARMA($p$, $q$) model is
\[
(Y_t - \mu) = \phi_1 (Y_{t-1}-\mu) + \cdots + \phi_p(Y_{t-p} - \mu) + \epsilon_t + \theta_1 \epsilon_{t-1} + \cdots + \theta_q \epsilon_{t-q}.
\]
Using the backwards operator $B$, this can be written as
\[
(1-\phi_1B-\cdots-\phi_pB^p)(Y_t-\mu)=(1+\theta_1B+\cdots+\theta_qB^q)\epsilon_t.
\]
The ARMA(1,1) model has
\begin{eqnarray}\label{ADF_arma(1,1)}
\rho(1) = \frac{(1+\phi\theta)(\phi+\theta)}{1+\theta^2+2\phi\theta}, \; \rho(k) = \phi\rho(k-1), \; k\ge 2.
\end{eqnarray}
So, after one lag the ACF of an ARMA(1,1) process decays in the same way as the ACF of an AR(1) process with the same $\phi$.
The parameters of ARMA models can be estimated by maximum likelihood or conditional least-squares. The estimation methods for AR($p$) models are very similar to those for AR(1) models. For MA and ARMA, because the noise terms $\epsilon_1$, $\cdots$, $\epsilon_n$ are unobserved, there are complications that are best left for advanced time series texts.
\section{ARIMA Model}
Often the first or perhaps second differences of nonstationary time series are stationary. A time series $Y_t$ is said to be an ARIMA($p$, $d$, $q$) process if $\Delta^d Y_t$ is ARMA($p$, $q$). We will say that a process is I($d$) if it is stationary after being differenced $d$ times; such a process is said to be ``integrated to order $d$."
For example, the AR(1) process is stationary and varies randomly about its mean; the integral of this process behaves much like a random walk in having no fixed level to which it reverts; and the second integral has {\it momentum}. Once the process starts moving upward or downward, it tends to continue in that direction. If data show momentum like this, then the momentum is an indication that $d=2$.
The \texttt{arima} function in \texttt{R} makes the assumption that a differenced process has mean zero. The \texttt{R} function \texttt{auto.arima} allows a differenced process to have a nonzero mean, which is called the \texttt{drift} in the output.
\section{Unit Root Tests}
For an ARMA($p$, $q$) process $\{Y_t\}$ satisfying
\[
(Y_t - \mu) = \phi_1(Y_{t-1}-\mu) + \cdots + \phi_p(Y_{t-p}-\mu)+\epsilon_t+\theta_1\epsilon_{t-1}+\cdots+\theta_q\epsilon_{t-q}
\]
to be stationary, all roots of the polynomial
\[
1 - \phi_1 x - \cdots - \phi_p x^p
\]
have absolute values greater than one. If there is a {\it unit root}, then the ARMA process is nonstationary and behaves much like a random walk. The {\it explosive case} is when a root has an absolute value less than 1. \texttt{R}'s \texttt{auto.arima} function automatically selects an ARIMA model and \texttt{polyroot} function can be used to check if the roots of $1 - \hat\phi_1 x - \cdots - \hat\phi_p x^p$ have absolute values greater than 1.
\medskip
{\bf Unit root tests}. Unit root tests are used to decide if an \underline{\it AR model} has an absolute root equal to 1. Popular tests include
\smallskip
$\bullet$ The augmented Dickey-Fuller test (\texttt{adf.test} in \texttt{R}'s \texttt{tseries} package). The null hypothesis is that there is a unit root and the usual alternative is that the process is stationary but one can instead use the alternative that the process is explosive.
\smallskip
$\bullet$ The Phillips-Perron test (\texttt{pp.test} in \texttt{R}'s \texttt{tseries} package). It is similar to the ADF test.
\smallskip
$\bullet$ The KPSS test (\texttt{kpss.test} in \texttt{R}'s \texttt{tseries} package). The null hypothesis is stationarity and the alternative is a unit root, just opposite of the hypotheses for the Dickey-Fuller and Phillips-Perron tests.
\medskip
{\bf How Do Unit Root Tests Work?} The Dickey-Fuller test is based on the AR(1) model
\begin{eqnarray}\label{DF_test}
Y_t = \phi Y_{t-1} + \epsilon_t.
\end{eqnarray}
The null hypothesis ($H_0$) is $\phi=1$ and the alternative ($H_1$) is stationarity $\phi < 1$. Model (\ref{DF_test}) is equivalent to
\begin{eqnarray}\label{ADF_test}
\Delta Y_t = \pi Y_{t-1} + \epsilon_t
\end{eqnarray}
where $\pi = \phi -1$. Stated in terms of $\pi$, $H_0$ is $\pi = 0$ and $H_1$ is $\pi < 0$. The Dickey-Fuller test regresses $\Delta Y_t$ on $Y_{t-1}$ and tests $H_0$.
The augmented Dickey-Fuller test expands model (\ref{DF_test}) by adding a time trend and lagged values of $\Delta Y_t$:
\[
\Delta Y_t = \beta_0 + \beta_1 t + \pi Y_{t-1} + \sum_{j=1}^p \gamma_j \Delta Y_{t-j} + \epsilon_t.
\]
The hypotheses are still $H_0: \pi = 0$ and $H_1: \pi < 0$. There are several methods for selecting $p$. The \texttt{adf.test} function has a default value of $p$ equal to \texttt{trunc((length(x)-1)\^\;(1/3))}, where \texttt{x} is the input series.
\medskip
{\bf Automatic selection of an ARMA model}. The \texttt{R} function \texttt{auto.arima} can select all three parameters, $p$, $d$, and $q$, for an ARIMA model. The differencing parameter $d$ is selected using the KPSS test. If the null hypothesis of stationarity is accepted when the KPSS is applied to the original time series, then $d = 0$. Otherwise, the series is differenced until
the KPSS accepts the null hypothesis. After that, $p$ and $q$ are selected using either AIC or BIC. $\mbox{AIC}_c$ in \texttt{auto.arima}'s output is the value of the corrected AIC criterion:
\[
\mbox{AIC}_c = \mbox{AIC} + \frac{2p(p+1)}{n-p-1}.
\]
\section{Forecasting}
Most time series software packages offer functions to automate forecasting. \texttt{R}'s \texttt{predict} function forecasts using an ``object" returned by the \texttt{arima} fitting function.
To know the uncertainty of the predictions, one first computes the variance of the forecast error. Then a $(1-\alpha)100\%$ prediction interval is the forecast itself plus or minus the forecast error's standard deviation times $z_{\alpha/2}$ (the normal upper quantile). The use of $z_{\alpha/2}$ assume that $\epsilon_1$, $\cdots$ is Gaussian white noise.
\medskip
{\bf Forecasting using an AR(1) process $Y_{n+1}=\mu+\phi(Y_n-\mu)+\epsilon_{n+1}$}. The general formula for the $k$-step-ahead forecast is
\[
\widehat Y_{n+k} = \widehat \mu + \widehat \phi^k(Y_n - \widehat \mu).
\]
If $|\widehat \phi|<1$, as is true for a stationary series, then as $k$ increases, the forecasts will converge exponentially fast to $\widehat \mu$. The $k$-step-ahead forecasting error is
\begin{eqnarray*}
Y_{n+k} - \widehat Y_{n+k} &\approx& \phi (Y_{n+k-1}-\widehat Y_{n+k-1}) + \epsilon_{n+k}
= \cdots
= \phi^{k-1}\epsilon_{n+1} + \phi^{k-2}\epsilon_{n+2} + \cdots + \phi\epsilon_{n+k-1} + \epsilon_{n+k},
\end{eqnarray*}
which implies $\mbox{Var}(Y_{n+k} - \widehat Y_{n+k}) = \left(\frac{1-\phi^{2k}}{1-\phi^2}\right) \sigma^2_{\epsilon} \to \frac{\sigma^2_{\epsilon}}{1-\phi^2}$ as $k\to\infty$. So the variance of the forecast error does not diverge as $k\to\infty$, but rather the variance converges to $\gamma(0)$. This is an example of the general principle that for any stationary ARMA process, the variance of the forecast error converges to the marginal variance.
For a random walk where $\phi=1$, the variance of the $k$-step-ahead forecast error is $k\sigma^2_{\epsilon}$ and diverges to $\infty$ as $k\to\infty$.
\medskip
{\bf Forecasting using an MA(1) process $Y_t - \mu = \epsilon_t - \theta\epsilon_{t-1}$}. The general formula for the $k$-step-ahead forecast is
\[
\widehat Y_{n+1} = \widehat \mu - \widehat \theta \widehat \epsilon_n, \; \widehat Y_{n+k} = \widehat \mu \;\; \mbox{for all $k>2$.}
\]
\medskip
{\bf Forecasting using an ARMA(1,1) process $Y_t - \mu = \phi(Y_{t-1}-\mu) + \epsilon_t - \theta\epsilon_{t-1}$}. The one-step-ahead forecast is
\[
\widehat Y_{n+1} = \widehat \mu + \phi (\widehat Y_n - \widehat \mu) - \widehat \theta\widehat \epsilon_n
\]
and the $k$-step-ahead forecast is
\[
\widehat Y_{n+k} = \widehat \mu + \widehat \phi (\widehat Y_{n+k-1} - \widehat \mu), \; k\ge 2.
\]
\medskip
For stationary ARIMA processes, the variance of the $k$-step-ahead forecast error variance converges to a finite value as $k\to\infty$, but for a nonstationary ARIMA process this variance converges to $\infty$.
\section{Partial Autocorrelation Coefficients}
The partial autocorrelation function (PACF) can be useful for identifying the order of an \underline{\it AR process}. The $k$th partial autocorrelation, denoted by $\phi_{k,k}$, for a stationary process $Y_t$ is the correlation between $Y_t$ and $Y_{t+k}$, conditional given $Y_{t+1}$, $\cdots$, $Y_{t+k-1}$. Let $\widehat \phi_{k,k}$ denote the estimate of $\phi_{k,k}$. $\widehat \phi_{k,k}$ can be calculated by fitting the regression model
\[
Y_t = \phi_{0,k} + \phi_{1,k}Y_{t-1} + \cdots + \phi_{k,k}Y_{t-k} + \epsilon_{k,t}.
\]
If $Y_t$ is an AR($p$) process, then $\phi_{k,k}=0$ for $k>p$. Therefore, a sign that a time series can be fit by an AR($p$) model is that the sample PACF will be nonzero up to $p$ and then will be nearly zero for larger lags.
If the sample PACF decays slowly to zero, rather than dropping abruptly to zero, then it is an indication that this time series should not be fit by a pure AR process. An MA or ARMA process would be preferable.
When computing resources were expensive, the standard practice was to identify a tentative ARMA model using the sample ACF and PACF, fit this model, and then check the ACF and PACF of the residuals. If the residual ACF and PACF revealed some lack of fit, then the model could be enlarged. Now many data analyst prefer to start with a relatively large set of models and compare them with selection criteria such as AIC and BIC. This can be done automatically by \texttt{auto.arima} in \texttt{R}.
\section{Cointegration}
{\bf Concepts}. Two time series, $Y_{1,t}$ and $Y_{2,5}$ are {\it cointegrated} if each is $I(1)$ but there exists a $\lambda$ such that $Y_{1,t} - \lambda Y_{2,t}$ is stationary. More generally, a $d$-dimensional multivariate time series is {\it cointegrated of order $r$} if the component series are $I(1)$ but $r$ independent linear combination of the components are $I(0)$ for some $r$, $0 < r \le d$.
Beside applications in finance, the notion of cointegration also helps with the understanding of regression analysis. The result of regression could be spurious when the residuals are integrated, which should make one cautious about regression with nonstationary time series. However, if $Y_t$ is regressed on $X_t$ and the two series are conintegrated, then the residual will be $I(0)$ so that least-square estimator will be consistent.
The Phillips-Ouliaris cointegration test regresses one integrated series on others and applies the Philips-Perron unit root test to the residuals. The null hypothesis is that the residuals are unit root nonstationary, which implies that the series are {\it not} cointegrated. Therefore, a small $p$-value implies that the series {\it are } cointegrated and therefore suitable for regression analysis. The residuals will still be correlated and so they should be modeled as such. In practice, though stationary, the residuals could have a large amount of autocorrelation and may have long-term memory. They take a long time to revert to their mean of zero. Devising a profitable trading strategy can be problematic.
The Phillips-Ouliaris test is run using the \texttt{R} function \texttt{po.test} in the \texttt{tseries} package.
\medskip
{\bf Vector error correction model (VECM)}.
\[
\Delta {\bf Y}_t = {\bf \alpha} {\bf \beta}^T {\bf Y}_{t-1} + {\bf \epsilon}_t,
\]
so that ${\bf \beta}$ is the cointegration vector and ${\bf \alpha}$ specifies the speed of mean-reversion and is called the {\it loading matrix} or {\it adjustment matrix}.
\medskip
{\bf Pair trading}. Pairs trading requires the trader to find cointegrated pairs of assets, to select from these the pairs that can be traded profitably after accounting for transaction costs, and finally to design the trading strategy which includes the buy and sell signals.
One of the risks involved is model risk: the error-correction model may be incorrect. Even if the model is correct, one must use estimates based on past data and the parameters might change, perhaps rapidly. Another risk is that one can go bankrupt before a stationary process reverts to its mean.
\section{Long-Memory Processes}
Economic and financial time series often show long memory (ACFs decay at a polynomial rate) and volatility clustering, while stationary ARIMA processes have only short memories (ACFs decay to zero exponentially fast) and constant conditional variance. We can use fractional ARIMA processes to model long memory and GARCH processes to model volatility clustering.
\medskip
We extend the differencing operator $\Delta^d$ to real number $d>-1$ as
\[
\Delta^d Y_t = (1-B)^d Y_t = \sum_{k=0}^{d} \left(\begin{matrix}d \\ k\end{matrix}\right) (-1)^k Y_{t-k}.
\]
where $\left(\begin{matrix}d \\ k\end{matrix}\right)=\frac{d!}{k!(d-k)!}$ is defined for all real $d$ except $0$, $-1$, $-2$, $\cdots$ via Gamma function $\Gamma(t)=\int_0^{\infty} x^{t-1}e^{-x}dx$ with the fact $\Gamma(t)=(t-1)\Gamma(t-1)$. $Y_t$ is called a {\it fractional ARIMA($p$, $d$, $q$) process} (FARIMA($p$, $d$, $q$) process), if $\Delta^d Y_t$ is an ARIMA($p$, $q$) process.
A more parsimonious model can sometimes be used if the differencing is fractional. Negative autocorrelation at lag-1 and little additional autocorrelation often indicate overdifferencing.
The function \texttt{fracdiff} in \texttt{R}'s \texttt{fracdiff} package will fit a FARIMA($p$, $d$, $q$) process. The values of $p$, $d$, and $q$ must be input; I am not aware of any \texttt{R} function that will choose $p$, $d$, and $q$ automatically in the way this can be done for an \texttt{ARIMA} process using \texttt{auto.arima}. Fractional differencing was done with the \texttt{diffseries} function in \texttt{R}'s \texttt{fracdiff} package.
\section{AR(1)/ARCH(1) Model}
{\bf Review of the AR(1) model}. The AR(1) process $Y_t - \mu = \phi (Y_{t-1} - \mu) + \epsilon_t$ has constant mean and variance:
\[
E[Y_t] = \mu, \; \mbox{Var}(Y_t) = \frac{\sigma^2_{\epsilon}}{1-\phi^2},
\]
nonconstant conditional mean:
\[
E[Y_t|{\cal F}_{t-1}] = \mu + \phi(Y_{t-1}-\mu), \;
\]
and constant conditional variance:
\[
\mbox{Var}[Y_t|{\cal F}_{t-1}] = E[(Y_t - E[Y_t|{\cal F}_{t-1}])^2|{\cal F}_{t-1}] = \sigma^2_{\epsilon}.
\]
The ACF function of $Y_t$ is
\[
\rho(h) = \mbox{Corr}(Y_t, Y_{t+h}) = \phi^{|h|}.
\]
\medskip
{\bf The ARCH(1) model}. The process $a_t$ is called an {\it ARCH(1) process} if
\[
a_t = \sqrt{w + \alpha_1 a^2_{t-1}} \epsilon_t
\]
where $\epsilon_t$ is an i.i.d. WN($0$, $1$), $\omega > 0$, and $\alpha_1 \in [0, 1)$. $a_t$ has constant mean and variance:
\[
E[a_t] = 0, \; \mbox{Var}(a_t^2) = \frac{\omega}{1-\alpha_1},
\]
constant conditional mean:
\[
E[a_t|{\cal F}_{t-1}]=0,
\]
and nonconstant conditional variance:
\begin{eqnarray}\label{cond_var_arch(1)}
\sigma_t^2 := \mbox{Var}[a_t|{\cal F}_{t-1}] = \omega + \alpha_1 a^2_{t-1}
\end{eqnarray}
The ACF function of $a_t$ and the ACF function of $a^2_t$ are, respectively
\begin{eqnarray}\label{ADF_ARCH}
\rho(h) = 0 \; \mbox{if $h\ne 0$}, \; \rho_{a^2}(h) = \alpha_1^{|h|}.
\end{eqnarray}
Equation (\ref{cond_var_arch(1)}) is crucial to understanding how GARCH processes work. If $a_{t-1}$ has an unusually large absolute value, then $\sigma_t$ is larger than usual and so $a_t$ is also expected to have an unusually large magnitude. This volatility propagates since when $a_t$ has a large deviation that makes $\sigma^2_{t+1}$ large so that $a_{t+1}$ tends to be large and so on. Similarly, if $a^2_{t-1}$ is unusually small, then $\sigma^2_t$ is small, and $a_t^2$ is also expected to be small, and so forth. Because of this behavior, unusual volatility in $a_t$ tends to persist, though not forever. The conditional variance tends to revert to the unconditional variance provided that $\alpha_1<1$, so that the process is stationary with a finite variance.
Note the ARCH(1) process provides an example of uncorrelated but dependent random variables. Also note since $a_t$ has zero mean and constant variance and $\rho(h)=0$ for $h\ne 0$, $(a_t)$ is weak white noise. The first step to see if a weak white noise has ARCH effect is to examine its ADF and the ADF of its square according to equation (\ref{ADF_ARCH}).
\medskip
{\bf The AR(1)/ARCH(1) model}. An AR(1) process has a nonconstant conditional mean but a constant conditional variance, while an ARCH(1) process is just the opposite. If both the conditional mean and variance of the data depend on the past, then we can combine the two models. E.g. an ARCH(1) process can be used as the noise term of an AR(1) process.
Let $a_t$ be an ARCH(1) process so that $a_t = \sqrt{\omega + \alpha_1 a^2_{t-1}}\epsilon_t$ where $\epsilon_t$ is i.i.d. N(0,1), and suppose that
\[
u_t - \mu = \phi(u_{t-1}-\mu) + a_t.
\]
The process $u_t$ is an AR(1) process, except that the noise term $(a_t)$ is not i.i.d. white noise but rather an ARCH(1) process which is only weak white noise. We note
\[
E[u_t] = \mu, \; \mbox{Var}(u_t)=\frac{\mbox{Var}(a_t)}{1-\phi^2} = \frac{\omega}{1-\alpha_1} \cdot \frac{1}{1-\phi^2},
\]
and
\[
E[u_t|{\cal F}_{t-1}] = \mu + \phi(u_{t-1}-\mu), \; \mbox{Var}[u_t|{\cal F}_{t-1}] = \mbox{Var}[a_t|{\cal F}_{t-1}] = \omega + \alpha_1 a^2_{t-1}.
\]
\medskip
{\bf ARCH($p$) models}. $a_t$ is an {\it ARCH($p$) process} if $a_t = \sigma_t \epsilon_t$,
where
\[
\sigma_t = \sqrt{\omega + \sum_{i=1}^p \alpha_i a^2_{t-i}}
\]
is the conditional standard deviation of $a_t$ given the past values $a_{t-1}$, $a_{t-2}$, $\cdots$ of this process. Like an ARCH(1) process, an ARCH($p$) process is uncorrelated with a zero mean (both conditional and unconditional) and a constant unconditional variance, but its conditional variance is nonconstant.
\section{ARIMA(\texorpdfstring{$p_A$}{TEXT}, \texorpdfstring{$d$}{TEXT}, \texorpdfstring{$q_A$}{TEXT})/GARCH(\texorpdfstring{$p_G$}{TEXT}, \texorpdfstring{$q_G$}{TEXT}) Models}
A deficiency of ARCH($p$) models is that the conditional standard deviation process has high-frequency oscillations with high volatility coming in short bursts. GARCH models permit a wider range of behavior, in particular, more persistent volatility. The GARCH($p$,$q$) model is
\[
a_t = \sigma_t \epsilon_t
\]
where
\[
\sigma_t = \sqrt{\omega + \sum_{i=1}^p \alpha_ia^2_{t-i} + \sum_{i=1}^q\beta_i\sigma^2_{t-i}}.
\]
Because past values of the $\sigma_t$ processes are fed back into the present value, the conditional standard deviation can exhibit more persistent periods of high or low volatility than seen in an ARCH process. The process $a_t$ is uncorrelated with a stationary mean and variance and $a_t^2$ has an ACF like an ARMA process.
\medskip
{\bf Residuals for ARIMA($p_A$, $d$, $q_A$)/GARCH($p_G$, $q_G$) models}. When one fits an ARIMA($p_A$, $d$, $q_A$)/GARCH($p_G$, $q_G$) model to a time series $Y_t$, there are two types of residuals. The {\it ordinary residual}, denoted $\widehat a_t$, is the difference between $Y_t$ and its conditional expectation and it estimates $a_t$. A {\it standardized residual}, denoted $\widehat \epsilon_t$, is an ordinary residual divided by its conditional standard deviation $\widehat \sigma_t$ and it estimates $\epsilon_t$. The standardized residuals should be used for model checking. If the model fits well, then neither $\widehat \epsilon_t$ nor $\widehat \epsilon_t^2$ should exhibit serial correlation.
\medskip
{\bf GARCH processes have heavy tails}. Stock returns have ``heavy-tailed" or ``outlier-prone" probability distributions. One reason for outlier may be that the conditional variance is not constant, and the outliers occur when the variance is large. GARCH process exhibit heavy tails even if $\{\epsilon_t\}$ is Gaussian. Therefore, when we use GARCH models, we can model both the conditional heteroskedasticity and the heavy-tailed distributions of financial market data. Nonetheless, many financial time series have tails that are heavier than implied by a GARCH process with Gaussian $\{\epsilon_t\}$. To handle such data, one can assume $\{\epsilon_t\}$ is an i.i.d. white noise process with a heavy-tailed distribution.
\medskip
{\bf Fitting ARMA/GARCH models}. An ARMA/GARCH model can be fitted using \texttt{R}'s \texttt{garchFit} function in the \texttt{fGarch} package.
\medskip
{\bf GARCH models as ARMA models}. If $a_t$ is a GARCH process, then $a_t^2$ is an ARMA process but with weak white noise, not i.i.d. white noise. More precisely, assume
\[
a_t = \sigma_t \epsilon_t
\]
and
\[
\sigma_t^2 = \omega + \sum_{i=1}^p \alpha_i a_{t-1}^2 + \sum_{i=1}^q \beta_i \sigma_{t-1}^2.
\]
Define $\eta_t = a_t^2 - \sigma_t^2$. Then $(\eta_t)$ is an uncorrelated process with zero mean and constant variance.\footnote{How to check the variance is a constant?} Define $\mu = \omega / [1-\sum_{i=1}^p(\alpha_i+\beta_i)]$, then we can show that
\[
a_t^2 - \mu = \sum_{i=1}^p (\alpha_i+\beta_i)(a_{t-i}^2 - \mu) - \sum_{i=1}^q \beta_i\eta_{t-i} + \eta_t,
\]
so that $a_t^2$ is an ARMA($p$, $q$) process with mean $\mu$. Consequently, the ACF of $a_t^2$ is the same as the ACF of an ARMA($p$, $q$) process.
\medskip
{\bf GARCH(1,1) processes}. If $a_t$ is GARCH(1,1), then $a_t^2$ is ARMA(1,1). So the ACF of $a_t^2$ can be obtained from formula (\ref{ADF_arma(1,1)}):
\[
\rho_{a^2}(1) = \frac{\alpha_1(1-\alpha_1\beta_1-\beta_1^2)}{1-2\alpha_1\beta_1-\beta_1^2}, \;
\rho_{a^2}(k)=(\alpha_1+\beta_1)^{k-1} \rho_{a^2}(1), \; k\ge 2.
\]
There are infinitely many values of $(\alpha_1,\beta_1)$ with the same value of $\rho_{a^2}(1)$ and a higher value of $\alpha_1+\beta_1$ means a slower decay of $\rho_{a^2}$ after the first lag. The capability of the GARCH(1,1) model to fit the lag-1 autocorrelation and the subsequent rate of decay separately is important in practice. It appears to be the main reason that the GARCH(1,1) model fits so many financial time series.
\medskip
{\bf APARCH models}. Standard GARCH models cannot model the leverage effect because they model $\sigma_t$ as a function of past values of $a_t^2$--whether the past values of $a_t$ are positive or negative is not taken into account, since the square function $x^2$ is symmetric in $x$. The APARCH (asymmetric power ARCH) models replace the square function with a flexible class of nonnegative functions that include asymmetric functions.
The {\it APARCH($p$,$q$) model} for the conditional standard deviation is
\[
\sigma_t^{\delta} = w + \sum_{i=1}^p \alpha_i (|a_{t-i}| - \gamma_i a_{t-i})^{\delta} + \sum_{j=1}^q \beta_j \sigma_{t-j}^{\delta},
\]
where $\delta>0$ and $-1<\gamma_j<1$, $j=1,\cdots, p$.
\medskip
{\bf Forecasting ARMA/GARCH processes}. Forecasting ARMA/GARCH processes is in one way similar to forecasting ARMA processes--the forecasts are the same because a GARCH process is weak white noise. What differs between forecasting ARMA/GARCH and ARMA processes is the behavior of the prediction intervals.
\section{Regression with ARMA/GARCH Errors}
When using time series regression, one often observes autocorrelated residuals. Residual correlation causes standard errors and confidence intervals (which incorrectly assume uncorrelated noise) to be incorrect. In particular, the coverage probability of confidence intervals can be much lower than the nominal value.
For this reason, linear regression with ARMA disturbances was introduced, assuming the residuals are stationary. The model was
\begin{eqnarray}\label{regress_arma/garch_a}
Y_i = \beta_0 + \beta_1 X_{i,1} + \cdots + \beta_p X_{i,p} + \epsilon_i,
\end{eqnarray}
where
\begin{eqnarray}\label{regress_arma/garch_b}
(1-\phi_1B-\cdots-\phi_pB^p)(\epsilon_t-\mu) = (1+\theta_1B+\cdots+\theta_qB^q)u_t,
\end{eqnarray}
and $\{u_t\}$ is i.i.d. white noise. To accommodate volatility clustering, instead of being i.i.d. white noise, we assume $\{u_t\}$ is a GARCH process so that
\begin{eqnarray}\label{regress_arma/garch_c}
u_t = \sigma_t v_t
\end{eqnarray}
where
\begin{eqnarray}\label{regress_arma/garch_d}
\sigma_t = \sqrt{\omega + \sum_{i=1}^p \alpha_i u_{t-1}^2 + \sum_{j=1}^q \beta_j \sigma^2_{t-j}},
\end{eqnarray}
and $\{v_t\}$ is i.i.d. white noise. The model given by equations (\ref{regress_arma/garch_a})-(\ref{regress_arma/garch_d}) is a {\it linear regression model with ARMA/GARCH disturbances}. To fit this model, a three-step estimation method is followed:
1. estimate the parameters in (\ref{regress_arma/garch_a}) by ordinary least-squares;
2. fit model (\ref{regress_arma/garch_b})-(\ref{regress_arma/garch_d}) to the ordinary least-squares residuals $\epsilon_t$;
3. reestimate the parameters in (\ref{regress_arma/garch_a}) by weighted least-squares with weights equal to the reciprocals of the conditional variances from step 2.
\medskip
Using linear regression model with ARMA/GARCH disturbances requires stationary residuals. However, many financial time series are nonstationary or, at least, have very high and long-term autocorrelation. When one nonstationary series is regressed upon another, it happens frequently that the residuals are nonstationary. In the extreme case where the residuals are an integrated process, the least-squares estimator is inconsistent.
The problem of correlated noise can be detected by looking at sample ACF of the residuals. The {\it Durbin-Watson test} can be used to test the null hypothesis of no residual autocorrelation. More precisely, the null hypothesis of the Durbin-Watson test is that the first $p$ autocorrelation coefficients are all 0, where $p$ can be selected by the user. In the \texttt{R} function \texttt{durbin.watson} in the \texttt{car} package, $p$ is called \texttt{max.lag} and has a default value of 1. The $p$-value is computed by \texttt{durbin.watson} using bootstrapping.\footnote{The calculated $p$-value could have substantial random variation due to bootstrapping with the default of $B=1000$ resamples. Using a larger number of resamples will compute the $p$-value with more accuracy.} The \texttt{lmtest} package of \texttt{R} has another function, \texttt{dwtest}, that computes the Durbin-Watson test, but only with $p=1$. \texttt{dwtest} uses either a normal approximation (default) or an exact algorithm to calculate the $p$-value.
When nonstationary residuals are detected to exist, differencing may help, esp. when the residual is $I(1)$.
\begin{thebibliography}{99}
\bibitem{Ruppert11} David Ruppert. {\it Statistics and data analysis for financial engineering}. New York, Springer, 2011.
%\bibitem{Tsay10} Ruey S. Tsay. {\it Analysis of financial time series}, 3rd edition. Wiley, 2010.
\end{thebibliography}
\end{document}
% Version 1.0.2, 2015-02-17: added materials on residual correlation and spurious regressions [Ruppert11] 13.2.4.
% Version 1.0.1, 2015-02-14: added materials on cointegration and GARCH based on [Ruppert11] 15, 18.
% Version 1.0, 2015-02-10: first draft.