Backtesting 101, Part 3: Be A 'Master Of The Universe'

by: Ryan Telford


In the first article of this series, we discussed proper backtesting methods and interpretation of results.

It is also important to understand your source of data; what data you are using.

Within that data, for meaningful results, an investor needs to set his or her “Universe,” to set limits on what is being tested.


In previous installments of this series, we have looked at how a proper backtest should be constructed (Part 1), and how the data from said backtest should be interpreted (Part 2).

In addition to the concepts we discussed previously, there are several other items that are to be considered in a backtest. While we have been discussing aspects that are rather "micro," I would like to zoom out and take a 10,000 foot view at backtesting.

In this article, we will take a look at two components of backtesting that can make a significant difference to backtest results, namely the data source, and the testing universe. My original intent was to limit this series to three articles. However, in writing this last piece, I realized there is much more to be discussed that should provide prospective backtesters with value, so I decided to extend this series to a fourth article. In the final article, we will wrap up with some other important considerations and an overview of what we have been discussing in this series.

The Data Source

Backtesting relies on historical financial data, which consists of raw data that companies are required to file each quarter. This is data found on such documents as company balance sheets, income statements and cash flow statements. While there are generally accepted accounting principles (or GAAP) on how business results are to be reported, they are just that, "general."

In backtesting, you are screening for very specific data.

You have data providers, whose business is to compile all of this financial information and provide it to institutional investors and backtesting service providers (such as Portfolio123, my backtest service provider of choice). These data providers have the task of taking all of this raw financial data, and standardizing it into line items that can easily be compared across different firms.

There are a few companies in this industry, namely Thomson Reuters, Standard & Poor's, and Compustat to name a few, each of which standardize this data slightly differently.

And then you have the backtest service providers, which then apply their own ratios and calculations to provide the end user with intuitive ratios and information to use in their backtests.

Depending on which data source is being used, there can be differences in results for a seemingly similar backtest.

Biases, biases and more biases

Using statistics to make conclusions about a set of data in the past (which is essentially what we are trying to achieve with backtesting), it is critical that the data from the past is actually included in the data set.

This may seem like an obvious consideration. But let's take a closer look.

Survivorship Bias

The world of business is very dynamic. Firms come and firms go. They go bankrupt, they get acquired, or they become delisted from stock exchanges for not submitting required quarterly documentation, or because they fall below a certain stock price, or for other infractions with the SEC. Suffice it to say that firms, and their stock, can disappear from a stock exchange.

What does this mean for backtesting? Let's illustrate by a practical example. Say you're screening for value stocks using a value screen. This particular screen is looking at traditional value stocks, Buffett's "cigar butts" for example. They may have "one last puff" left, or they may not and are on the verge of bankruptcy.

Scenario #1:

You screen for low P/E stocks back to 1999. The data source you are using does not include stocks that have gone bankrupt in this period. You have found that after backtests that the strategy has returned just over 7% per year.

Firm 1


Firm 2


Firm 3


Firm 4


Firm 5


Firm 6


Firm 7


Firm 8




Scenario #2:

You screen for the same low P/E stocks, but the data source does include the bankrupt stocks. With this data available, the backtest calculations take the defunct firms into account that pass the low P/E criteria. There is now a larger set of stocks available, including stocks that would have returned 0% return at some point (or less). The same backtest above, but with the defunct firms added:

Firm 1


Firm 2


Firm 3


Firm 4


Firm 5


Firm 6


Firm 7


Firm 8


Firm 9


Firm 10




Which scenario is more accurate? I hope you chose Scenario #2.

As we emphasized in the beginning of this section, we want to ensure we have data in time that was actually available at the time. Back in 1999 (the beginning of our backtest), these firms were not yet defunct, but we need to be aware that they did exist, and more importantly, that they collapsed. They were part of the low P/E screen.

Whenever data, or lack of data, skews a conclusion in a certain direction, it is called a "bias." When defunct firms are not included in the data, this is called "survivorship bias."

There is also a flip side to this. As noted earlier, other firms that become delisted from an exchange are those that are acquired by other firms. In most cases, these firms are acquired for a premium above their stock price at the time. So instead of biasing data artificially upwards with their exclusion, missing out on these firms can actually bias the data downwards.

What's the takeaway? Whenever you are relying on backtested data, ensure that the data source has done its best to eliminate or minimize survivorship bias.

Look Ahead Bias

With backtesting, we are going back in time to see how investors behaved. At that time, investors had a particular set of information to work with to inform their decisions and actions. This may seem obvious once again, but we need to make sure that we are backtesting the same data that the investors had available at the time.

Good quality databases apply accurate dates to their data for this purpose.

Let's take a look at another example. If we are seeing how a low P/E strategy has performed, it is critical that the information investors had at the time to classify it as a low P/E firm was in fact that. For a P/E of 3, for example, a firm may have been at $3 per share, with earnings per share, or EPS, of $1 (recall that P/E = price/share / earnings/share).

Let's say it's earnings season, and a firm has an earnings surprise (i.e. it has a quarter that has beat expectations), and EPS is much higher than the previous quarter. In real time, this could increase the share price, proportionately to the improvement in EPS, changing the P/E ratio. If these data are not aligned, the data may portray an inaccurate picture.

Survivorship and look ahead bias are two examples where data sets can skew the results of a backtest, if not minimized or eliminated. Data sources can vary in this regard, so backtest results based on different data sources can potentially provide different results.

Generally, Compustat is considered the "gold standard" when it comes to data sources. Thompson Reuters is another data source.

How else can you use this information? Some investors or funds actually create their own databases of data for their backtesting. One example is that of Joel Greenblatt. He reportedly had a team of analysts who have created their own database of financial data to do their own backtesting on quantitative investing methods.

A Potential Example of Differing Data Sets

On a related note, Greenblatt is the creator of the Magic Formula quantitative investing method. When he wrote about this strategy, he had the performance backtested to see how the strategy performed over time. His results are shown below:

(Source: Greenblatt results from "The Little Book that Still Beats the Market" and Author Graph)

Over the period, Greenblatt's returns are compounded annual growth rates of 19.7% and 23.8% for the largest 1,000 and largest 3,500 stocks respectively.

While these are very impressive returns, other investors have been unable to obtain the same backtest results that Greenblatt reports. In their book "Quantitative Value," Carlisle & Gray go to some length to get the same results as Greenblatt, but are unable to. While they show that the strategy has outperformed over time, the returns they find are less than what Greenblatt has claimed.

Through my own independent backtesting, I too do not find the same results as Greenblatt (keeping in mind that I backtested from 1999 to 2015).

Is a different data source the cause of this difference in results? Did Greenblatt's data source exclude defunct firms, skewing results upwards? Was data properly dated? Perhaps there is another reason.

The point of all of this is that as a quantitative investor, it is important to question where performance claims are coming from, how they were obtained. The quality of the data source can have a large bearing on the accuracy of the results.

Be a "Master of the Universe"

We have just discussed data sources, the pool from where you pull all of your data for a backtest. Before you test a thesis, you need to set bounds for the information you are testing. It's not impossible to test the entire data source, but you can get a more relevant result with a more selective set of data.

This smaller sample is called the testing "universe." The parameters that define your testing universe can have a significant impact on your results.

Let's run through some typical universe criteria, starting with broad strokes and develop more detail as we go.


Stocks are listed on stock exchanges, whether that means the US exchanges (NYSE), Canadian (TSX), etc. With backtesting, you test one market at any given time. Assuming you want to see how US stocks have done, you would limit your universe to stocks listed on the NYSE. You could also look at all stocks traded on the NASDAQ as well, for example. Each backtesting package will offer various market options.

Market Cap

The next common step is to limit your universe to stock size, usually by market cap. Conventional wisdom tells us that larger companies are more stable and provide decent returns, while stocks of smaller companies can have higher, but more volatile returns. There is of course the middle ground, which should be the best of both worlds. But like any conventional wisdom, there are exceptions to the rules. For this reason, it is helpful to look at a strategy's performance at different groupings of market cap.

For example, you may want to see how "Large Stocks" did with a particular strategy. For my large cap strategies, my cutoff is the largest 20% of companies in the market. Or you may want to check micro-caps, which could be the smallest 25% of stocks in a market. Or say you want to check the middle ground, so perhaps the largest 50% of stocks.

There are no hard and fast rules as to these cutoffs. It should be noted that depending on the market you are testing, these cutoffs may need to be adjusted. The Canadian market, for example, is significantly smaller than the US, so a cutoff at 20% may not yield enough stocks to provide a meaningful result. More on total test size later.

You may notice we have been talking percentages here, and not dollar values, i.e. all stocks with market cap larger than $500 million. As we are backtesting and going back in time, a $500 million company today may be considered medium size; however, 15 years ago, that was considered a large company. By classifying cut-offs by percentage we try and size the universe in relative terms.


While many investors may not feel comfortable investing in foreign markets, i.e. China, Brazil, Russia, etc., there is a way to invest in firms from these countries in the US market. ADRs, or American Depository Receipts, are stocks of non-US firms that trade on the US market. You can potentially screen these out of your testing universe, or you can keep them. Some investors may not feel comfortable investing in non-US firms; however, before coming to your own conclusion, the benefits should be weighed.

In many of the strategies I follow, I include ADRs when dealing with US securities. I have found that they generally improve performance over time. For your own strategies, be sure to check if there is a benefit from including them. Or if you'd prefer, exclude them.

Isolating by Industry

Another option is to screen out particular industries from your universe. There are some investing theses that are not well-suited to particular industries. As an example, strategies that rely on enterprise value (low EV/EBIT, or the Magic Formula for example) are not particularly suited to the financial or utility industries. As Joel Greenblatt said himself of his Magic Formula:

"Financial stocks don't work with EBIT (earnings before interest and taxes), since looking at a bank, for example, "before interest" is a bit meaningless." (Source)

In other words, a low EV/EBIT value for a bank or a gas utility may not be as an indicator that it is a value stock.

The point here is that an industry(ies) is being excluded for a logical reason. You may come across some strategies and backtest results that exclude a very specific industry. If this is the case, be sure to understand why the industry is being excluded. It could be because those criteria being used are not compatible with the financial reporting of the industry.

If a reason is not stated, then the exclusion of an industry may have been an effort to improve backtest results due to the industry volatility, etc. But be careful here; exclusion of industries seemingly at random may be a sign of data mining. Recall from our definition of backtesting that the investment thesis must have a sound basis to it.

The backtester may have found that if they happened to remove an industry, then performance improves. If this is the case, ensure that the strategy has similar performance was tested over several other periods of time and over rolling periods (as we discussed in Part 1).


Many investing books and super investors discount small and micro-cap stocks because they are too illiquid for any real investment, i.e. any real investment in small, low volume stocks would send their prices soaring. For most hedge funds or mutual funds managing hundreds of millions and billions of dollars, this is true. However, for the individual retail investor making a small position of less than $5,000 per stock, this may not be as significant an issue. Depending on the amounts, an investor should verify the trading volume of a stock to ensure it is within the limits of the investment. Screening software allows an investor to check this.

This said, there are some stocks that are so small and have such a small trading volume that they really are not practically investable (say those stocks trading for less than $0.05). It is for this reason that when an investor screens for stocks, that these types are excluded. Be aware that this needs to be spelled out in a backtest, i.e. liquidity is not set by default. These tiny stocks may show significant return over the backtest period, but if an investor cannot make a meaningful investment, then the data can be misleading.

The takeaway? When looking at backtest results, be sure to clarify what the cutoff is for illiquid stocks.

This issue of illiquidity is more of a concern at lower market cap cut-offs of course. Strategies looking at large cap stocks (say largest 20% in the universe), or medium and large cap (largest 50%), illiquidity concerns should not be an issue. It is the small and micro-cap stock universes where this should be considered.

The above is not intended to discount small and micro-cap stocks. On the contrary, when done prudently, small stocks can represent a very fertile hunting ground for the individual investor. Obscure, not very followed, these stocks are often more mispriced (or price more inefficiently) than large cap stocks. Once an earnings beat comes around the upside can be quite dramatic. Piotroski devised his F-Score to capitalize on this phenomenon. More on this strategy in a future article.

OTC stocks, or "over the counter" stocks, are very small and obscure, and may be illiquid. Some screens will exclude this class of stock entirely. Be sure to check if this is the case in the results you are reviewing.

Performance Enhancers

While a strategy may screen for a specific criteria, i.e. low EV/EBIT, it may also include other criteria to screen out potentially risky stocks, particularly in value screens. Value investors do their best to avoid the dreaded "value trap," or those stocks that are cheap for a reason. There are several quantitative ways to do this. As discussed in the low EV/EBIT strategy article, screening for bankruptcy risk or stocks showing signs of potential earnings manipulation are screened out.

Such criteria techniques include:

  • Altman Z-score (bankruptcy risk).
  • Beneish M-score (potential earnings manipulator).
  • Excessively high Short interest ("short money is smart money").
  • Excessively low insider ownership (management may not have a vested interest in the success of the company).

This subject deserves a discussion on its own. Suffice it to say that when comparing results, be sure to understand if there are considerations in addition to the main criteria.

To Conclude

In the 3rd installment of our Backtesting 101 series, we have looked at how the data source is prepared to provide an accurate input to backtests, and that they are not all created equal. We have also looked at the creation of a universe, or your testing sandbox, if you will, and the various factors that go into defining it.

In Part 4, we will look at some additional aspects of backtesting. Until then, happy investing!

Follow me on Seeking Alpha for more articles on quantitative value investing!

If you wish to get started in backtesting your own strategies, check out They offer a free 15-day trial period. I use P123 extensively in my investment research.

Disclosure: I/we have no positions in any stocks mentioned, and no plans to initiate any positions within the next 72 hours.

I wrote this article myself, and it expresses my own opinions. I am not receiving compensation for it (other than from Seeking Alpha). I have no business relationship with any company whose stock is mentioned in this article.

Additional disclosure: I am a user of and have included affiliate links in this article.