Twitter, Sentiment Analysis, And An October Biotech Recap

by: Bill Koski

Twitter remains a valuable tool for biotechnology investors to disseminate information and share ideas.

In this article, I use social media data to analyze over 100,000 tweets about publicly traded biotechnology companies in October. I report on development of textual analysis tools using python.

I publish trending ideas in the investment community and propose using sentiment analysis to build predictive models focused on key catalysts.

I first got started in biotech investing by a mentor that had spent years investing and writing about biotech. His colleague, a bit of a guru in the industry, got me started on Twitter, citing it as one of the most important professional tools in his career in equity research. I have since come to realize that public forums, like Twitter and Seeking Alpha, are valuable tools for investors.

Love Twitter or hate it, to read this article, we must agree on two central postulates as they relate to information on social media:

  1. Twitter allows for the rapid spread of ideas, information, and news
  2. Twitter allows users to post their opinions, serving as a crowdsource for investor sentiment

As it relates to the investor, these postulates lead me to conjecture that, firstly, the information on Twitter can be used to identify macro-trends in an industry. More importantly, I hypothesize that analysis of individual and aggregate sentiment can be used to predict outcomes relating to important stock market catalysts.

Previously, Twitter use has been studied extensively in the context of predicting stock market returns. In one of the most highly cited papers in the field of sentiment analysis, "Twitter mood predicts the stock market", Bollen at al (2011) conclude that public mood can be used to improve the accuracy of DJIA predictions. Since the publication of that landmark paper, Twitter sentiment has been well-documented in a number of other academic publications. These studies have been a mixed bag. Some have been able to correlate areas of Twitter sentiment with volume market returns, where others have failed to demonstrate any form of a statistically significant relationship. One of the principal challenges facing sentiment analysis is the amount of noise present on Twitter. Between cat memes, Russian bots, and @realDonald Trump, winnowing useful information can make data mining an insurmountable challenge.

However, biotech has managed to build a close community on the internet that brings together scientists, investors, and physicians. As an investor, I am interested in biotechnology investing because it allows me to efficiently allocate capital towards developing new technologies to improve healthcare outcomes. The idea of using sentiment is particularly attractive to the biotechnology investor because the biotech market is so heavily catalyst driven. Clinical trial outcomes, regulatory decisions, and partnership agreements can all drive stock market outcomes into the triple digits. By narrowing in on a smaller objective than whole stock market returns over an entire network, I hypothesize that individual and targeted group sentiment can be used to predict catalysts and/or stock market outcomes.

Consider, for example, how analysis of publicly available information on Twitter could be used by the enterprising investor:

  • Analyzing the publicly-stated opinions of hematologists on Twitter in regards to a clinical trial evaluating a new drug for lymphoma to predict the outcome of an FDA advisory committee vote.
  • Incorporating trading algorithms to preempt market reactions to social media pundits who have hundreds of thousands of followers and the potential to move markets with a single tweet.
  • Using sentiment analysis to identify red flags in clinical trial results based on popular opinion. That is to say, does popular opinion or do highly insightful individuals have a meaningful ability to predict future outcomes based on past results?

My purpose in writing this article is not to highlight a polished research body, but to introduce an ongoing project, demonstrate a proof-of-concept for how this information can be used, and generate ideas and feedback from the Seeking Alpha community.

See bottom of page for more info on the methods used.

Results: Industry Analysis

To conduct an exploratory analysis regarding the information available on Twitter, I built a script in Python to download all of the Tweets in October that mention publicly traded biotech companies.

In the month of October, there were over 100,000 tweets about biotech companies.

Top 10 companies by number of Twitter mentions in October

Not so fishy fish oil: Amarin (AMRN) clocks in at first place with a whopping 5,700 tweets in October. This comes following the company's announcement at the end of September of positive results from the REDUCE-IT study, which showed that patients treated with their fish-oil formulation achieved a 25% relative risk reduction in cardiovascular outcomes. Following the news, the stock rocketed up over 300%. It comes as no surprise that Amarin has remained a hot talking point into October.

The clock ticks on BAN2401: Biogen (BIIB) presented additional data on BAN2401, their Alzheimer's drug partnered with Eisai, that sparked a hot debate when the company declared the trial positive at 18 months after previously announcing the trial had failed to meet prespecified endpoints at 12 months.

Weed stocks get "high"er: Tilray (TLRY) and Aurora Cannabis (ACBFF) continue to bask in the limelight as retail investors pile into and out of good 'ol legal Canadian dro. Weed stocks had a coming-to-God moment in October as investors realized that eventually, you must come down.

Results: Top Terms

To begin analyzing what people are talking about when it comes to those companies, I implemented several text learning tools in python.

Term frequency inverse document frequency (tfidf) is a valuable concept in text learning that measures how frequently one term appears in an individual document (or tweet), relative to in a corpus. It is a useful metric because it can be used to identify unique features about an individual tweet. Consider that a word that is used in every single tweet will not provide as useful information as a word that is present only in tweets that praise a trial's results. Tfidf measures that "uniqueness" of a word. Paired with a text learning algorithm, tfidf can be a powerful tool.

Here I present the top terms identified in my corpus of October biotech tweets.

Hot topics: Top terms mentioned in tweets about publicly traded biotechs

The first thing you should notice, is that there is a lot of garbage. A number of these terms ("EPS", "BUY", "ANASTS") are primarily derived from junk twitter accounts pumping sellside analyst ratings. This is further supported by the presence of "AMZN" and "AAPL" in the top 50.

However, there is at least one interesting piece of information to glean from this exercise. Andybiotech, a well-regarded biotech investor that is widely followed on Twitter, was more commonly mentioned more commonly than ESMO18 (a major medical conference taking place in October), the food and drug administration, and Novartis. To me, this quantifies that the views of influential users can be widely distributed on Twitter and provides a rationale for future study.

Results: Industry-wide Sentiment Analysis

To analyze industry sentiment, I performed textual analysis on all Tweets relating to publicly traded biotechs, using a lexicon of thousands of words previously categorized as conveying "Positive" or "Negative" sentiment to calculate a sentiment index based on the relative frequency of positive and negative vocabulary.

Biotech sentiment: October 2018

Biotech sentiment appeared negative on Twitter. IBB shares slid close to 20% during October. These data do not allow us to conclude whether this represents a novel difference in sentiment without a baseline. To evaluate how attitudes change on Twitter, I plan on publishing a monthly index. However, they do provide an important benchmark for assessing sentiment of individual firms.

Applying sentiment analysis to the individual firms I previously identified as the most tweeted about yields some interesting results. Here I subtract the whole-corpus sentiment scores (shown in the previous figure) from sentiment scores for individual firms to produce an adjusted score.

Panel A shows the October stock returns for the top 10 most-mentioned firms. The IBB returned -17% over the same period. AMRN (+17.4%), JNJ (+1.2%), and LLY (+0.7%) were the only firms to generate positive returns. These three firms also had the highest degree of positive sentiment out of the top 10, in order correlating to their October returns (Panel B, left). It is tempting to look at panel C and think there might be a detectable correlation between positive sentiment and market return. I plan on running this analysis again in a larger sample once I have developed more sophisticated preprocessing to parse out spam and improve handling of retweets. I further plan on running additional experiments to tease out any causal inference. If there is a significant relationship, does social media react positively because of a shift in intrinsic value, or in response to positive excess returns?

Key Takeaways and Future Work

This article should serve as an introduction to this project. As I scientist by training, I believe in the power of collecting good data and using exploratory analysis to develop and test hypotheses. There are lots of questions to ask here, and I look forward to continuing to share my work with you. I hope that you have learned something new from this article, and I would love to hear about and incorporate your ideas in future work as well.

A brief description of methods:

I started with all firms containing NAICS listings of 325412 (Pharmaceutical Preparation Manufacturing) and 325414 (Biological Product Manufacturing) and subsetting by firms with a total market capitalization of >$400 million as of YE17. This resulted in a total of 211 firms.

I built a script in python to download Twitter data using the Twitter API. I downloaded all tweets referencing the tickers for the 211 specified firms. The Twitter API limits this time of data to tweets published within the past 7 days, so the script was run weekly. From the raw twitter metadata I parse out the tweet text, tweet date, number of retweets, and number of favorites, and analyze the data in python.

I perform textual analysis in python using the sklearn package. Positive and negative sentiment analysis is based on this opinion lexicon.

Disclosure: I/we have no positions in any stocks mentioned, and no plans to initiate any positions within the next 72 hours. I wrote this article myself, and it expresses my own opinions. I am not receiving compensation for it (other than from Seeking Alpha). I have no business relationship with any company whose stock is mentioned in this article.