Scarcely a day passes by where you don't see a headline about "Big Data" and how analysis of this big data is going to lead to huge efficiencies, targeted marketing and large profits. Some recent headlines that gained some traction were: Forbes -"How Target Knew a Teenager was Pregnant Before Her Father Did" ; WSJ - "Sears is planning to use analytics to drive new sales"; Businessweek - "New analytics technology is predicting behavior-and building businesses".
Recently, the market intelligence firm IDC put forth a forecast stating that the market for data analytics is poised to grow from $3.2 billion now to $16.9 billion per year by 2015 at a CAGR of 40%. There are a large number of existing and new firms about whom there is glowing coverage because of this "Big Data" hypothesis. Existing firms like IBM (NYSE:IBM), Teradata (NYSE:TDC), SAS and GE (NYSE:GE), newcomers like Opera Solutions and Mu Sigma, cloud firms like Salesforce.com (NYSE:CRM) and EMC (NYSE:EMC) come to mind. Also, some of the huge valuations of internet-based companies that collect data and offer data related services such as Amazon (NASDAQ:AMZN), Google (NASDAQ:GOOG) and Facebook (NASDAQ:FB), are based on the idea that the collected data can be monetized.
The basic hypothesis goes like this: 'It's becoming ever cheaper to collect and store data due to ever increasing networking, new database structures and rapidly falling data storage costs. Companies are collecting mountains of data. Now they want to extract value from all this data they have, but don't know how to do it. Companies that help them extract the insights will be well rewarded.'
There are a number of sub-hypotheses. A) This data collection is a recent phenomenon B) Value is not already being extracted from whatever data is being collected C) Companies will need outside help to extract insights D) Outsiders can help companies extract insights without having deep industry knowledge E) The insights gathered from ever larger data sets have more value and are more accurate than insights gathered from smaller data sets F) Unstructured and cross functional data have huge value waiting to be extracted.
So where does the truth lie? Are there huge profits to be made out of data analytics (or data mining as it used to be known), both for companies collecting data and their new data analytics hired guns?
To arrive at a verdict on the main hypothesis, I'll examine each sub hypothesis in turn.
A) This data collection is a recent phenomenon - Any industry that sees an immediate payoff from analyzing data has been doing it at least for the last 20+ years, if not earlier. The financial industry is perhaps the prime example of data analytics, ranging from credit scores at retail level to high frequency and quant trading. The airline industry has been crunching passenger data to optimize everything from ticket prices to flight schedules. The manufacturing and utility industries use SCADA systems to collect thousands of data points, typically at intervals of a few minutes, from thousands of sensors located in plants and other facilities, using this data to optimize operations. Consumer Packaged Goods (CPG) companies like P&G (NYSE:PG) owe their entire success to analyzing vast amounts of sales and marketing data. Pharmaceutical industries collect & analyze data from the drug discovery stage to large post approval population studies. It goes without saying that the insurance industry has been crunching actuarial data for a couple of hundred years now.
It's clear that collecting and analyzing large quantities of data is not really a new phenomenon.
B) That value is not being extracted by the companies that do currently collect data - As I have pointed out, airlines, manufacturing, insurance, banking - industries that collect data USE the data. Usually, when an IT system is put in place, a return on investment analysis is done. There is considerable effort and cost, in terms of hardware, storage (dropping fast!), software licenses (rising, except for products like Hadoop) and personnel costs to collect and store data (and thus the need for the cloud). Remember someone has to write a budget justification for all this data collection and storage each year.
As I've pointed out above, data collection and analytics is quite mature in a number of industries. The low hanging fruits in the form of industries that benefit most from data analytics are already doing it. For there to be untapped value, one would then have to look for industries which a) collect data and b) don't know the value of the data they collect and c) don't know how to analyze the data they are collecting. It's a pretty steep hurdle.
The real reason for the dissatisfaction that one often encounters, when talking to executives at many companies, is that they have been promised 'Moneyball' type results from data analytics, but have not gotten it.
C) That companies will need outside help to extract insights - This is where the data analytics companies step in offering to analyze the data and unlock the hidden value. Usually executives fall for this because they believe some huge insights will be delivered, that will enable them to win the metaphorical World Series.
There's a saying "Give a man a fish and he'll eat today, teach him to fish, he'll eat for a lifetime". Various data analytic tools have been popular for the better part of two decades (see a list here).
Despite the brouhaha, most data analytics work does not particularly necessitate a very high level of knowledge. The backbone of the industry is a recently graduated and poorly paid STEM graduate sitting in Bangalore with a few months of "data analytics" training. Those Ph.D.'s in math from MIT? Merely used to reel in the sales. They are usually far too few and expensive to actually work on the nitty gritty of client data. It's the law firm associate model again. The employee churn at data analytics firms is about 30-40% a year. There are no pools of deep institutional knowledge at data analytics firms.
The "Big Data" analytics vendors walk a fine line of claiming: It's becoming possible to analyze all this big data, but only they can do it - but not the average analyst or manager (who actually has the industry knowledge to boot). But the analytics software packages in the market have been steadily increasing in both processing ability and user friendliness (Minitab, SAS, SPSS, Systat, to name a few). Even as I write today, any reasonably competent person is able to perform tasks that a few years ago would have taken an entire team of analysts. That's also the reason why vendors can mint "Analytics Experts" from STEM graduates with a few weeks of training.
Given the low barriers to doing the analysis in house, I don't see all these companies like IBM and Opera selling data analytic services for the long-term, especially at $100-200 per hour.
D) That outsiders can help companies extract insights without having industry knowledge - this is perhaps the lynchpin of the big data hype. Broadly, there are two types of analytics. One type is purely statistically based analytics, which require little or no industry knowledge. These utilize various statistical techniques to identify trends, events and relationships. Finding those Target customers who look for maternity clothes and/or are pregnant, falls under this category.
The second type is model-based analytics. In this type of analytics, an industry expert develops a model for a particular process and the model is fed with data to calibrate the model. Various parameters are added and removed to fine tune the model. Finally, the calibrated model is used for predictive purposes. A model that uses engine SCADA data to predict jet engine performance under various conditions would be an example of this type of analytics. The most valuable type of analytics is this second type, because it directly provides information on a particular process, allowing it to be optimized and improved.
There are a staggering number of industries with a staggering number of processes in each industry. It's not possible for an IBM or EMC to have experts in every industry. Consulting firms by nature have people who are not experts in one industry but can be moved from project to project and industry to industry. Thus, usually the data analytics firms, except for firms that specialize in a single industry (i.e. Nielsen in marketing, or the credit bureaus) offer the first type of data analytics, offering to develop analytics based on statistical techniques. This is also the easiest expertise to develop in house due to reasons I've mentioned above. I'm deeply skeptical that outside companies can really step in and extract any really deep insights of the second kind without deep subject matter expertise.
E) The insights gathered from ever larger data sets have more value and are more accurate than insights gathered from smaller data sets - This is perhaps the most common misconception. Most statistics deals with samples of data, because it's usually not possible to capture the whole population. A sample of data from a larger data set gives you an estimate for various properties of the whole data set. As the sample size or as the number of samples get larger, the properties of the larger data set are estimated with a confidence interval that gets smaller. This is a fancy way of saying that with a given level of probability (95%, 99.5% or whatever) the range of possible values (with that given level of probability) get smaller and smaller.
Let's take the example of the number of people who walk into a store and buy a certain product. By studying 10 customers you can get an estimate of the probability of the purchase and predict sales; by studying 100 customers you get a better estimate. By analyzing 1000 customers you have a very good estimate. You could study a million customers, but the results are unlikely to be vastly different from the results of analyzing 1000 customers and probably not worth the expense. Once you have a statistically sufficient amount of data, more data is of steeply diminishing value. A very big data set is not really that more valuable than a sufficient and representative data set, since both of them will be analyzed using sampling techniques, yielding very similar results.
Of course, very large data sets are sometimes useful when an event occurs very rarely, i.e. a terrorist attack or a turbine blade failure. In such cases, the event that you are looking for happens so rarely that sampling techniques may not capture the event. Another type of large data that is useful is data that is widely dispersed such as weather data and other earth sciences data. But such examples are very specialized and not common. In fact, for most common business applications, the data gathered is cyclic and repetitive. The whole hype over "Big Data" contradicts the basic task of statistical techniques, i.e. sampling and estimation to reduce the task of analyzing the whole data set.
F) Unstructured and cross functional data have huge value waiting to be extracted - I've touched on this before. Statistical techniques work best on what is known as coherent data sets. Despite the beliefs of company executives, different pools of data in a company are usually different for a functional reason.
There are some obvious cases where this hypothesis is true and "Big Data" promoters and media love this particular storyline. If a person applies for unemployment insurance, you can usually predict he's going to default on his mortgage and is not in the market for a new car.
In the real world however, it's very hard for Kroger (NYSE:KR) to connect data from the number of workers who showed up late in a given year to sale of strawberries in summer. Or for Ford (NYSE:F) to use the data from how long an engine head gasket lasts to predict what shape of front grille will be preferred this year. Usually when consultants come in and apply statistical techniques without industry knowledge, some co-relations will be noticed and bizarre recommendations will be made - "I noticed x number of pilots ate bananas and with a t-value of 3.2 fog rolled in from San Francisco Bay; hence if we remove bananas from the crew lounge, there won't be any more fog". Co-relations are easy to find, but causation is not!
The notion that all cross functional data within organizations, along with voicemail, email, chats and web stats can be linked for extra value is way overblown. Additionally, the average manager or executive is deluged by reports, insights and metrics that are generated and delivered every day to his inbox. More insights, especially based on distantly related or unrelated data? I suspect the answer is a heartfelt no.
The big picture - So what does this mean in terms of the Big Data thesis? My personal opinion is that the potential market size and value proposition are both over-hyped. I'm reminded of the hype over biotechnology in late eighties and early nineties. For example, biotechnology was supposed to have brought us juicy red, delicious and stackably cubical tomatoes that would yield square slices that fit a slice of bread. I'm still waiting for my square tomato slices.
To believe the IDC forecasts, data analytics revenue in 2015 will be equal to the current revenue of Indian giant Tata Consulting Services ($10 billion) plus a couple of the nearest competitors (about $3-4B each) or to the combined current revenue of a whole handful of Indian companies such as Cognizant (NASDAQ:CTSH), Infosys (NYSE:INFY), Wipro (NYSE:WIT) etc. I believe that these optimistic projections make the classic mistake of extrapolating along straight lines.
In fact, when you look at the "success" stories of companies such as IBM, the ballyhooed results are almost comical ("Reduced Report Generation Time by 63%!"). Another common theme, found on all the data analytics companies websites, is that their focus areas seem to be internet, banking, healthcare, industrial, life sciences, CPG, insurance - the very areas which have traditionally crunched large amounts of data before it became "sexy". So where are the big inroads into new markets? One example of such an inroad is law. Law firms' revenues have been destroyed by revolutionary data analytics technology, which has replaced law associates with software, leading savvy clients to pay very much less for the same work.
It might be an inconvenient truth to say so, but most mature industries have realized the big optimizations - i.e. it is not possible to wring huge savings out of a car manufacturing plant or a power plant. With intensive data analytics, some incremental savings can be realized, but it's evolutionary not revolutionary. Where data analytics has been revolutionary, the results have not been as expected, i.e. with law firms.
Some people will point to the high demand for statistical skills in the employment market as proof of the value that these skills bring to companies. The relatively high pay for statistical skills in the U.S. is more a reflection of the American aversion to such careers than indicative of exceptional training, skills, talent or value they bring. Crack open the visa gates a bit and you'll have a zillion statisticians from India, Russia, China and Philippines and the pay will come crashing down.
Last but not the least, when everyone deploys the same solutions, no one has a competitive advantage and the data analytics/"Big Data" story becomes merely another cost of staying in business just like insurance and legal costs. When every team has a Billy Beane, no one is any better off.
I do believe that data analytics firms will deliver useful tools and services based on statistical methods, even without deep industry specific knowledge, that offer some value to their customers. But no one has a patent on the most commonly used statistical methods which are decades old or even hundreds of years old. Therefore, these services will quickly become a commodity and prices will race to the bottom. After all, there is really no barrier to entry. Buy a statistical package or two, hire a few programmers in Bangalore, a couple of ex-cheerleaders for sales and you are in business. Any existing software services company should be doing this, if they are not already doing this.
I would advise investors on SA to be very, very wary of buying companies that are big on the data analytics hype. Many of the players (SAS, Autonomy, emerging companies like Opera, Mu Sigma) are private but may go public at some point. Other players are a big part of the valuation of bigger companies such as IBM and EMC. Don't buy the hype.
Disclosure: I have no positions in any stocks mentioned, and no plans to initiate any positions within the next 72 hours.