Welcome to our article number three - the model we will be covering today is called ARMA. ARMA stands for autoregressive (AR) moving average (MA) and as the name suggests, it consists of two polynomials (and a constant), making it somewhat more sophisticated than the historical models we tested in our previous articles. ARMA models are not exclusively used by the finance community, but are also popular among data scientists, mathematicians and engineers. What makes them so attractive is that according to Wold’s decomposition theorem, they are considered to be a reliable forecasting tool for most kinds of weakly stationary stochastic processes (we will prove in the first section of our article that financial market volatility is a weakly stationary stochastic process). Thus we would like to conduct several backtests, in order to find out whether we can answer the following questions:

Does volatility qualify as an appropriate input into ARMA models? Can ARMA deliver better results for a volatility trading strategy than the simplistic historical models used in previous articles? What is a more appropriate parameter estimation method: Maximum Likelihood or Maximum Sharpe Ratio? Is a higher order of ARMA models associated with overspecification?

ARMA MODELS

Stationarity

To begin with, we want to determine whether the ARMA model is suited for predicting volatility of the S&P 500. If the data possesses a trend, we need to enhance the model further to an ARIMA model, in order to remove the trend from the series.

Figure 2: S&P 500 volatility stationarity check

As the figure above clearly demonstrates, the S&P 500 volatility does not exhibit a trend, which implies that we can continue with ARMA models. However, as looking at a time series and saying “I see no trend” is arguably not a common scientific approach, we also conduct an ADF (Augmented Dickey Fuller) test. ADF tests are used to test time series for stationarity, with the null hypothesis of the test being the absence of stationary. The output we received after conducting the ADF test was a p-value of 0.00, which allows us to reject the null hypothesis. This proves that volatility of the S&P 500 is indeed a stationary time series, implying that we can work with ARMA models.

ARMA Formula

The formula below illustrates how an ARMA model can be used to derive volatility. While it might look awfully complicated at the first glimpse, the following explanation might make it a bit more intuitive: ARMA assigns coefficients to the last p values of realized variance (sigma squared) and the last q differences between the previously estimated and the actual observations (i.e. the estimation error, or epsilon). Add to that the constant drift c, and you have your variance estimation!

The sum signs with p and q show how many previous values you take into account - i.e. in an ARMA (1,1) model, you would only consider the most recent estimation error and the most recent variance observations, whereas in an ARMA (2,2) model you consider the two most recent variance values and estimation errors.

An extremely simplistic example now - let us assume that the AR coefficient is 0.9, the MA coefficient is 0.5 and the constant drift is 1.

Table 1: Simplified ARMA example

If we know today's realized volatility, our ARMA (1,1) forecast from yesterday and therefore the forecasting error today, we have all we need to make the forecast for tomorrow. In such a way, according to our model, the volatility on Wednesday is expected to be 1 + 13 * 0.9 + 2 * 0.5 = 13.7

ARMA Order

The next step is to determine the order of the model, i.e should we use ARMA (1,1), ARMA (2,2) or models of an even higher order. For that, we can plot the PACF (partial autocorrelation function) and the ACF (autocorrelation function), which can help us to determine the order of the AR (PACF) and the MA (ACF) components.

Figure 3: PACF and ACF plots for determining the order of our ARMA models

As the figures demonstrate, both ACF and PACF peak at 2 (excluding 0 obviously), which means that an ARMA (2,2) model could be the most appropriate one. This should be taken with a grain of salt (particularly when looking at the ACF plot), but it seems to be a reasonable starting point. In addition to that, we will also use all lower-order ARMA models for the backtest, as their smaller amount of variables means that there is a lower risk of overspecification.

Hence, we will backtest ARMA (1,1), ARMA(2,1), ARMA(1,2) and ARMA (2,2) models and compare the results with the SMA model, which will act as a benchmark. The SMA model was established as the best model from the historical volatility model “family” in our previous article, thus it will be interesting to see how ARMA performs compared to it.

Fitting the model

The final step is to fit the parameters of the model. For the optimization process, we have chosen two different approaches:

Maximum Likelihood: optimizing for the best fit using Statsmodels’ ARMA Package

Maximum Sharpe Ratio: optimizing for the highest Sharpe Ratio using SciPy’s Evolutionary Algorithm

The reason we have decided to apply two optimization approaches is that the results of our previous research have shown us that optimizing for the Sharpe Ratio generally yields better results than the more widely applied ML-optimization. It is obvious that the ML optimized model will have a lower Sharpe Ratio than the SR optimized model in the in-sample period, but classical historical models have shown us that this is also the case in the out-of sample period. Therefore, we are really interested to see what will happen in the out-of-sample period when we apply these two approaches on ARMA models.

BACKTESTING METHODOLOGY

Once our ARMA models generated their forecasts, we can apply them to our trading models. As a starting point, we once again use findings from our previous article: We forecast realized volatility, not implied volatility, which is better suited for trading the VXX. Moreover, if the forecast of our model suggests a higher level of volatility in the future than the current level of the VIX, we allocate 25% of our portfolio towards long volatility by buying the VXX and else we short it. In order to make the forecast comparable to the VIX, the result has to be annualized and multiplied by 100. The backtesting period is split into in- and out-of-sample periods, where the in-sample is 01.01.2008 - 31.12.2015 and the out-of-sample is 01.01.2016 - 09.06.2020. We do not take the bid-ask spread, trading costs and shorting fees into account. Models will be compared based on the out-of-sample Sharpe Ratios.

RESULTS

Table 2: Sharpe Ratios of our backtests

The table above shows the Sharpe Ratios of each ARMA model we backtested. In the following section, we will separately discuss the results of the ML and SR optimized models.

Maximum Likelihood Optimized Models

Figure 4: Performance of ML optimized forecasts, In-sample [01.2008-12.2015] Out-of-sample [01.2016-06.2020]

As demonstrated in the figure above, except for ARMA (1,2), ML optimized models deliver attractive returns, with the out-of-sample Sharpe Ratios being close to, or even higher than 1. What is particularly interesting is that the out-of-sample Sharpe Ratios of ALL models are higher than the in-sample Sharpe Ratios. Another interesting finding is that a higher degree of in-sample specification does not lead towards overspecification, which contrasts the results we found in our previous article about the discrete volatility forecasting model.

Maximum Sharpe Ratio Optimized Models

Figure 5: Performance of SR optimized forecasts, In-sample [01.2008-12.2015] Out-of-sample [01.2016-06.2020]

As Figure 5 demonstrates, SR optimization unsurprisingly leads to higher returns during the in-sample period. While the Sharpe Ratios are lower in the out-of-sample than in the in-sample period, the results can still be considered to be very attractive. What is particularly impressive is that once again, a higher order does not lead to overspecification of the model. This is best demonstrated by the outstandingly high out-of-sample Sharpe Ratio of the ARMA (2,2) model, which is also yielding better results than all ML optimized models.

Benchmark Comparison

Figure 1: Best performing ARMA model vs. best performing HISVOL model, In-sample [01.2008-12.2015] Out-of-sample [01.2016-06.2020]

Before we move on to the interpretation of the results, we want to plot the results of the best performing ARMA model against the results of the best model from the “historical volatility family”, which is the SMA model. What is particularly interesting to see here, is that the outperformance of the ARMA model mainly stems from its outstanding performance in the out-of-sample period. While some of this performance can be traced back to suggesting a long volatility position in the events surrounding the meltdown of the inverse volatility ETN XIV in February 2018, the ARMA model still yields superior results even if the event is excluded. Therefore, it seems reasonable to assume that ARMA models are indeed able to deliver better results than historical volatility models.

INTERPRETATION

What is left, is to examine why ARMA models are able to deliver better results than simple historical models. The most obvious reason for the outperformance is that the model is autoregressive, which allows it to adjust the weights of historic observations in such a way that they fit subsequent observations. While this might yield the risk of overspecification, our results clearly demonstrate that overfitting is not an issue with the low-order ARMA models used in this article.

Another reason why our low order ARMA models show such great performance is the fact that they take only very recent observations into account. The analysis conducted in our previous articles about historical models clearly demonstrated that the models that generate the best performance are those that only consider at most the last 3 observations as input values. In contrast, models that try to take the mean-reverting property of the VIX into account by either considering more observations or adding an additional long-term moving average to the model deliver mostly poor performance. While ARMA models indeed take the mean-reverting property into account by adjusting the constant drift, it has to be noted that the constant drift value is usually really small for Sharpe Ratio optimized ARMAs, and therefore only marginally contributes to the forecasted volatility value . It is worth noting that the ML optimized models usually have a much higher constant drift value than SR optimized models as they aim to fit the model to the data.

One final point we briefly have to touch on is why optimizing for the maximum Sharpe Ratio yields better, or at least just as good results, as when optimizing for the best fit. The reason is the same that we already found when analyzing the forecasts for classic historical models: as we use a purely binary indicator, it is only relevant, whether the prediction was directionally correct. Therefore, it does not matter how close the predicted value is to the real value, as long as the forecast suggested correctly whether the algorithm should bet on an increase or a decrease in volatility.

CONCLUSION

As the out-of sample performance demonstrates, ARMA models are indeed suited to predict volatility and can be applied as an indicator for a VXX trading model. Interestingly enough, a higher level of specification does not automatically lead to over-fitting, which is very interesting, given that this was a frequent problem that we encountered when working with historical models. In addition, as autoregressive models assign individual weights to historic observations they appear to be better suited for forecasting volatility than classic historical models. What is more, this article also demonstrated that optimizing the model for Maximum Sharpe Ratio does not only yield outstanding results in the in-sample period, but keeps up impressive results in the out-of-sample period. To sum it up, the results of our backtests allow us to conclude that ARMA models are not only very well suited for forecasting volatility for a VXX trading algorithm, but are also able to deliver better returns than classical historical models.

This is it for our article #3. In our next article, we will take an in-depth look at GARCH models and determine how they can be used for trading the VXX. Thank you for reading. Please share your thoughts and opinions in the comment section, and don’t forget to follow us to not miss our future publications!

