Please Note: Blog posts are not selected, edited or screened by Seeking Alpha editors.

Social Media, Big Data And Stock Prediction: Knowing Where To Look

The challenges of understanding Big Data emerge even when relatively small amounts of data must be analyzed. By applying topical filters, however, firms can begin to make the problem more manageable and sharpen resulting trading signals.

Everyone these days is grappling with "Big Data" - even trying to define what the term describes is a challenge. While it is true that the total amount of information being generated by social media conversations, blog posts and tweets is "big," and growing at a breakneck pace, it is important to add a contextual dimension to the definition of "Big Data": Here, "Big" means any amount of information that a single person cannot interpret, analyze, and make an informed decision upon. When you define it this way, it is easy to see that the threshold for "Big" is actually very low. So one needs to add a filtering variable to pare down the amount of data to analyze and serve as an important topical filter in an effort to reduce the amount of noise in the information stream.

When analyzing the finance vertical and related conversations, there are many filtering variables that can be applied, and the choice of variables is key to identifying "where to look" for quality information. Examples of filtering variables can be company and product names, "cash-tags" (the "$" in front of a ticker) in tweets, keywords related to a company's input costs, etc. Once these variables are chosen, the amount of information that must be processed can be drastically reduced.

The key in defining the variables is to start with the approach of knowing what questions you want answered. Human intervention is necessary in order to put context around what information to extract; otherwise, the problem becomes too unwieldy. When it comes to stock prediction, there is a wealth of information to extract, but questions must be asked first. Whether you are a day trader or a quant hedge fund, what drives your strategy, and thus your potential returns, is unique, and therefore the filtering methods need to be customized.

Once the filtering process is complete, analytical approaches such as sentiment analysis, pattern matching, non-linear regressions and the like can produce potentially more relevant results. The analogy is to find accuracy, then strive for precision. Otherwise, you may wind up with spurious results and coincidental correlations.

But even after you have narrowed down your filtered data stream, it still is important not to use analytical tools blindly. One example of this is to use customized natural language processing (NYSEMKT:NLP) engines. It is relatively straightforward to apply off-the-shelf NLP engines to unstructured text data, and it is possible to get some correct predictability in sentiment. But in order to achieve much higher accuracy rates, one must apply human training. A machine is only as good as its inputs; especially when dealing with very narrow contexts and "trading-specific" language.

The last step - and certainly the most important and the hardest - is to "train" your machine to provide it with enough context so that it can be smart about recognizing domain-specific language, financial acronyms, and some of the harder and more nuanced language, such as sarcasm and negation. This can only be done with human input and, therefore, is the last component to ensuring that the signals that are being generated are as focused and relevant as possible to the goal of answering the original questions that were posed.

The amount of data being generated by social media is truly staggering, and it is easy to get overwhelmed with the daunting task of making sense of it all. But stepping back and answering the right questions in order to guide one's analysis can go a long way not only toward reducing the storage and analysis problem, but also toward being able to extract some useful trading signals from the noise.