Tick data is the lifeblood of the capital markets. Unlike order book data, which can be stuffed, stale, and away from the inside market in the majority of cases, tick data represents actionable quotes and transpired trades that can be regarded as the "principal components" of capital market data. Within tick data, one can measure volume, quoting frequency, spreads, VWAPs, moving averages, volatility, and so forth. This article therefore emphasizes the capture and analysis of tick data as opposed to order book information, which can be loosely defined as orthogonal in certain respects.
There was once a time when even the attempt to capture and record tick data, specifically the CTS/CQS "tape" from the U.S. equity markets, was a sophisticated process involving a team of individuals. Even more highly regarded was the replay/analysis/backtesting of the tick data. This was often conducted only in the realm of investment banks or hedge funds.
I briefly, without code examples, want to describe how I effectively store, record, analyze, and backtest the "tape" easily and efficiently each day as part of my model construction and trading strategy deployment.
On average, the CTS/CQS produces about 30GB of information, plus or minus a few GB depending on precisely what fields are stored. I attempt to store everything (condition codes and such), and so my files tend to be a little bit larger. I receive the tick data through multicast UDP, and I proceed to immediately fire an event that strips it off of the network buffer and throw it onto a separate queue in memory. This is so as not to lose data during periods of intense volume (open, close, FED announcement), and so forth. Once it is in my in-memory queue, I then proceed to write each tick, represented as either a trade or quote. I use a common class to represent both trades and quotes as there are a lot of characteristics that are shared and useful between the two.
I begin recording at 09:00 each day (for the possibility of algorithmic "pre-market" analysis), and stop at 16:20. The roughly 20-30GB files are then compressed into a .gz format using standard software such as 7-zip and so forth. The original files are then discarded, and the compressed files are transferred over to my Microsoft Azure Cloud Storage account. I invariably can compress the files to 10% of the original size, or roughly ~2.5GB to ~3.5GB
I then download recent updates on a period (weekly) basis and distribute them across all my backtesting/analysis servers. I then replay the tick data by using the C#/.NET built-in uncompressing reader. Keep in mind that as each tick is being uncompressed, it is placed on a queue an and event is fired that processes the tick throughout my backtesting system and strategies. Therefore, I usually have 6 cores operational on a dual Xeon 8-core server at any given point. Backtesting a single day only requires a few minutes (depending of course on the complexity of the strategy), and then the entire set of trades and messages over the backtesting period is serialized and stored as a "Model" object.
I have created a WPF viewer for the model that displays the market data and various transformations (differencing, moving averages, volume, cumulative volume, quote frequency, and so forth). I use the Visiblox package to greatly facilitate this, and I include annotations on where I've placed my trades so I have a visual sense of the strategy. Additionally, because I have the full Model characteristics, I can compute various performance measures against the backtest (Sharpe ratio, annualized return, and so forth.).
Now, the entire process I described is necessary because I using machines with only 12GB of memory. Each day's worth of compressed CTS/CQS data is approximately 3GB. If I had access to a 64GB or 128GB machine, the backtesting procedure would be far quicker as I could load and entire month or two worth of data into memory and never have to access secondary storage (be it a HDD or SSD).
My current project is to move the entire backtesting apparatus onto the Microsoft Azure platform, so that I fully avail the "utility computing" model and backtest day and night with literally unlimited resources. As the trading volumes have decreased, it actually facilitates backtesting using home-grown software. That is another reason why I develop fully on the Microsoft stack - everything just "works" together, without headaches of which version of Linux I'm using and so forth. But's that's just a personal aside.
The gold standard, in the final analysis, for these sorts of systems is of course KDB+, which is incredibly fast and powerful. It is an in memory database with an exceptionally brilliant design and comes with its own extremely concise language (q). But, since I've been a freelancer, I've had to develop my own techniques for managing large amounts of tick data.
I hope this article is useful to other financial technologists who regularly record and analyze capital market data.
Copyright © 2013, Srikant Krishna
Srikant Krishna is a financial technologist and quantitative trader. He has a background in biophysics, software development, and the capital markets. He grew up in Holmdel, New Jersey, New York City, Boston, and currently resides in Stamford, CT
You can follow him on Twitter @SrikantKrishna, and on LinkedIn athttp://www.linkedin.com/in/srikantkrishna/, or e-mail him firstname.lastname@example.org.