Fool's Gold: Why Most Back-Tested Fund Performance Histories Are Bunk

by: Morningstar

By Samuel Lee

A version of the following article first appeared in the September 2013 issue of Morningstar ETFInvestor. Download a complimentary copy of ETFInvestor here.

As they are often used, back-tests are merely a legal way of fabricating a statistically bogus history of outperformance and implicitly taking credit for it. I don't think I'm being too cynical. Most back-tested strategies I've seen are problematic. The worst claims are from newsletters and trading-software providers, who can say almost anything without legal repercussions under the aegis of the First Amendment. I've seen claims of "low-risk" 30%-plus monthly returns (or 2,230% annualized), though most typically keep their back-tests in the range of 20%–50% annualized in a sad attempt to maintain a semblance of believability. It appears index providers and exchange-traded fund sponsors have also produced some unbelievable back-tests.

Last year, Vanguard published a study on ETFs tracking back-tested indexes. The authors, Joel M. Dickson, Sachin Padmawar, and Sarah Hammer (1), looked at a sample of equity indexes with at least five years of back-tested history and five years of live performance. In the five years prior to index live date, the indexes averaged 12.25% excess returns above U.S. equity market; five years after live date, they averaged negative 0.26%. In other words, most ETF index back-tests are garbage.

The Vanguard study's results aren't so surprising (and not just because it's from Vanguard, a zealous advocate of low-cost cap-weighted indexes). Many back-tested indexes in its sample were cooked up at the behest of ETF sponsors who wanted a fund to cover "hot" market segments. Few providers are willing to launch an ETF invested in a loathed asset class. While Vanguard's study didn't check whether strategy indexes on average beat the market after their live dates, I think the study's main points would remain unchanged. We have a massive back-test in the actively managed equity mutual fund industry. Most studies have found that an equity fund's historical outperformance doesn't persist in a statistically detectable fashion (beyond a short-lived momentum effect). If there's one verity that can be both relied on and reliably ignored, it's that past performance does not guarantee future results.

Back-tests are so often useless because there are many more false than true relationships in a typical data set. It's easy to tweak an experiment to discover a "statistically significant" relationship that is in fact nonexistent. This data-snooping bias is one of the reasons many--if not most--published research findings are false. (2)

Nothing logically precludes back-tests from being a useful way to uncover truth. Back-tests are an application of induction, a method of reasoning that, crudely stated, derives universal principles from specific observations. (I think it's safe to say induction works. It's valid to argue the sun will almost certainly rise tomorrow because it's done so every day for some 4.5 billion years.) Then again, you can lump together two of anything together if you go to a high-enough abstraction--a lump of coal and Richard Simmons' hair can both be categorized as belonging to the class "things that can be burned for fuel," but using the latter to power your home is impractical and probably a bad idea.

Reasonable investors derisive of back-tests acknowledge that in theory back-tests can illuminate something true about the world but believe in practice back-tests can rarely be relied on to do so. The Vanguard study, by finding that past performance does not reliably predict future returns of asset classes, joins a chorus of studies by many independent researchers demonstrating the same. A skeptic might conclude back-tests are to induction what Richard Simmons' hair is to the category of things that can be burned for fuel.

The problem with that argument is too many successful investors are or were back-testers. Benjamin Graham, father of value investing and Warren Buffett's mentor, devised trading rules based on studies of what would have worked in the past--back-tests, in other words. Early in his career, Buffett applied Graham's back-tested rules to identify "cigar butts," statistically cheap stocks that had assets that could be sold for more than they could be bought for. Another successful Graham acolyte, Walter Schloss, achieved 20% gross annualized returns over several decades by "selecting securities by certain simple statistical methods…learned while working for Ben Graham." (3) Ray Dalio, founder of Bridgewater Associates and arguably the most successful macro investor alive, uses back-tested strategies to run Pure Alpha, an unusual fund that uses only fundamental quant models. Mathematician James Simon's Medallion Fund, a quantitative, fast-trading strategy, has earned 35% annualized returns after fees since 1989.

In a real sense, any investor who observes a historical pattern has engaged in back-testing.

But what separates a good back-test from a bad one? Or, more generally, what separates a valid induction from an invalid one in the financial markets?

Most investors make bad inductions. I think it's in large part because amateur investors calibrate their hurdles for accepting a proposition based on what's worked for them in their everyday lives, leaving them susceptible to finding patterns in noise. For example, many investors see a three-year record of outperformance as evidence a manager is probably skilled. More demanding investors want five or 10 or more years of performance data before coming to a conclusion. They're all wrong (though the more-demanding ones are less wrong). Historical returns by themselves are rarely enough to reliably identify a skilled manager or a valid back-tested strategy. Markets are so random that blindly sifting by performance-related metrics for "winners" will give you a group dominated by lucky investors/strategies.

The most successful investors operate under a model, or an ensemble of them. They do not determine an asset's attractiveness in a vacuum. They have strong opinions on how humans behave, how institutions operate, how an asset's value is derived, the processes governing asset prices, and so forth. Their beliefs are reasonable and at least to some extent touch truth--otherwise they wouldn't work. (Some claim just having an investing discipline and sticking to it ensures success. I couldn't disagree more. If you believe in flat-out untrue things and stay the course, you will suffer.) With valid models, successful investors can extract information beyond what's encapsulated in the numbers.

The typical investor often doesn't have a well-articulated, reality-based model. He focuses on recent returns and too readily accepts propositions based on inadequate or faulty evidence. I spill much ink on philosophy and process in my newsletter as a corrective.

Don't Trust, and Verify
To my knowledge, there is only one truly comprehensive study looking at whether back-tested equity strategies end up being truly predictive. Two respected finance researchers, R. David McLean and Jeffrey Pontiff, slaved away (or had their graduate students slave away) on a monumental working paper titled "Does Academic Research Destroy Stock Return Predictability?" (4) They independently replicated and tested 82 equity characteristics published in academic studies purported to predict excess returns. Like the Vanguard study, they looked at the excess returns the characteristic strategies produced in back-tests ("in sample") and live ("out-of-sample").

If the characteristics were only predictive in-sample, there are two possible explanations: The market is efficiently arbitraging away the anomaly, or the observed pattern was the product of data-snooping. To distinguish the two effects, McLean and Pontiff cleverly split the out-of-sample data into pre- and post-publication periods. Because it can take years before a working paper is published, there's a period in which the characteristic is still out of sample but known only to a small group of academics. If the characteristic's predictive power decayed completely in the working paper phase, then we can point to data-snooping as the culprit. If its power decays after publication, then it's likely the markets at work arbitraging away the anomaly.

Interestingly, they could replicate only 72 out of 82 results. Of those, they found that the average out-of-sample decay due to statistical bias was 10% and the average post-publication decay was an additional 25%, for an approximately 35% decay from back-test to live performance. We can't take these results at face value. Their sample might over-represent the most cited and memorable studies, introducing survivorship bias.

By the standards of social science, that's suspiciously impressive. At least two big replication attempts of promising biomedical studies found the majority couldn't be replicated--I'd expected finance studies to do even worse because of the relative ease of back-testing many different models on financial time series data. There's also the fact that out-of-sample data are hard to come by. It can be many years before a model is rejected by fresh data, lowering the cost in reputation for academics who data snoop.

Despite possible issues, the study suggests back-tested equity strategies that have passed the academic publishing gauntlet are of higher quality than the ones produced by less rigorous and conflicted parties (like, say, index and ETF purveyors). Though I'm still skeptical of much of the academic literature, I do believe academics have been able to identify market regularities in advance. For example, the "big three" factors--size, value, and momentum--were all discovered and established by the mid-1990s. All three factors went on to earn excess returns in the subsequent two decades, though size has since been cast in doubt.

Taking some cues from academia and one from a practitioner, I've identified some characteristics that distinguish good back-tests from bad ones, roughly from most to least important.

  1. Strong economic intuition. Can you make a strong, evidence-based story beforehand that would justify the proposed effect?
  2. An intellectually honest source. Are the parties behind the back-test credible? Do they have any motivation to data-snoop or lie?
  3. Simple and transparent methodology. Complex models often underperform simple, robust ones in out-of-sample tests.
  4. Sample size. Academics usually expect at least several decades of data in a sample, at least when considering back-tested equity strategies. The highest-quality back-tests are conducted using big, high-quality data sets.
  5. Effect size and statistical significance. Many analysts look for high returns and high statistical significance in order to determine whether they should accept the validity of a proposition. While statistical and economic significance are necessary, they themselves are often weak predictors of a study's validity. Anyone can produce statistically significant results by data-snooping or even outright fabrication.
  6. Transaction costs. As quant-fund manager Ted Aronson gently pointed out to me, a strategy's costs are just as important as its gross returns. There are plenty of back-tested strategies that "work" in illiquid markets that wouldn't survive after all frictional costs are taken into account.

And even then you're not done. You want several high-quality studies from skeptical, independent researchers that broadly find similar results before you conclude something is likely "true."

These are high hurdles, yes, but necessary if you want decent odds of striking nuggets of truth rather than fool's gold.

(1) Dickson, Joel M., Padmawar, Sachin, & Hammer, Sarah. "Joined at the Hip: ETF and Index Development." Vanguard research, 2012.

(2) Ioannidis, John P. A. "Why Most Published Research Findings Are False." PLoS Medicine, 2005.

(3) Buffett, Warren E. "The Superinvestors of Graham and Doddsville." Hermes, 1984.

(4) McLean, R. David, & Pontiff, Jeffrey. "Does Academic Research Destroy Stock Return Predictability?" Working paper, 2013.

Disclosure: Morningstar, Inc. licenses its indexes to institutions for a variety of reasons, including the creation of investment products and the benchmarking of existing products. When licensing indexes for the creation or benchmarking of investment products, Morningstar receives fees that are mainly based on fund assets under management. As of Sept. 30, 2012, AlphaPro Management, BlackRock Asset Management, First Asset, First Trust, Invesco, Merrill Lynch, Northern Trust, Nuveen, and Van Eck license one or more Morningstar indexes for this purpose. These investment products are not sponsored, issued, marketed, or sold by Morningstar. Morningstar does not make any representation regarding the advisability of investing in any investment product based on or benchmarked against a Morningstar index.