Welcome to the natural language processing (NLP) edition of my "under-the-hood" series on AI (artificial intelligence) technology. 2016 is shaping up to be the year of NLP. Most major tech companies are announcing initiatives, introducing products (e.g., Amazon (NASDAQ:AMZN) with its Echo device and platform) or experiments in public (e.g., Microsoft's (NASDAQ:MSFT) Tay debacle).
On Seeking Alpha (and I am sure in many other places as well), the topic of natural language processing seems to be confusing many people because they are not clear on what new technologies here can do and what they should expect from them in the immediate future. Even though I have touched on the subject before - e.g. on my explanation of IBM's (NYSE:IBM) Watson - I feel like I have not explained the specific topic of natural language processing comprehensively. There is a specific reason why the topic is getting so much attention and it has to do with the convergence of various subfields of computer science and statistics.
Let us begin with the headliner of the past two weeks. A reader has asked me the following about Tay's Twitter (NYSE:TWTR) disaster:
I've been asking myself whether all we can say about the situation is to make jokes about it, or whether it can be compared and contrasted to recent efforts by other entrants in the AI race, with some investment implications.
I would welcome an article from you on the incident, since your focus on AI may give you better perspective than the rest of us.
Because Microsoft has released the underlying code as an open source framework, we can get an in depth look! So let us first discuss what happened on Twitter.
First, Tay is not a substantial advancement in artificial intelligence. It does not illustrate a new development in natural language processing that is proprietary to Microsoft, so there is no investment implication from that end, although it certainly got Microsoft a lot of publicity. What it does illustrate is the problem with two-way communication with deep learning models. To explain this, we have to go back a little and get an overview of natural language processing.
Pattern Matching And The Birth Of NLP
The earliest example of conversational software range back to the 60s with the infamous chat bot ELIZA. Its design has influenced chat bots for many decades. The idea behind chat bots and online assistants in the early years of the web was simple pattern matching. Input from users is scanned for specific words or word combinations. Answers simply put these words in pre-programmed responses, e.g. "You are thinking x" - "What makes you believe I am thinking x?" This is "safe" in the sense that you could easily prohibit certain words and fall back to some default phrases when the input did not make sense. It is straightforward to see how you could extend this with some information retrieval so a bot could react to keywords to suggest helpful information.
Statistical Machine Learning
Early NLP was driven by linguists such as Noam Chomsky. He propagated an approach that attempted to capture the rules of language, to truly understand its structures. These rules were then supposed to allow computers to generate and understand natural language. However, it turns out that this approach does not generate good results. Human language in all its variants is simply too complex to capture everything in a finite set of hand-crafted rules. This is where NLP took a path that ultimately explains what happened with Tay. Statistical machine learning has taken over. Statistical models simply are not concerned with meaning as a meta-concept. They are great at automatically capturing structure between things, but they assign no meaning to those things (e.g. words). Let me illustrate this by explaining Hidden Markov Models (HMMs), an extremely popular approach for generative language models in the past two decades.
A Hidden Markov Model is called "hidden" because we assume there are some hidden (also called "latent") variables that we cannot observe that cause something that we can observe. For instance, let's say I cannot observe the weather because my room has no windows, but I can hear what's going on outside. The sound of children playing would be an observation that would give me information about a latent state that I cannot observe, e.g., children playing equals no rain.
In natural language processing, we observe some sequence of words and try to infer the likelihood of, for instance, the next word or character in speech recognition, e.g., the posterior probability of the next character given the previous characters. We don't know what exactly drives the process of generating words, we can only observe them and try to predict what will come next from past observations:
The great thing about an HMM is that it's relatively easy to get an intuition about how to do this. We train our HMM by showing it a lot of text and it essentially just has to count the occurrences of words. This explains intuitively why large data volumes ("big data") is so important for NLP. If we want to be good at estimating the probability of some word sequence, you need to have seen as many word sequences as possible.
Why do we want to estimate these sequences? To extract information about the structure of language. If we can predict whether "bank" is meant to be a verb (bank on), a money lending facility or something to sit on, we can label each word in a sentence, e.g., with its grammatical role. Then we can go on to extract entities, events or important keywords in a query to Siri.
Techniques like these have been used in the past two decades in every aspect of NLP, e.g., text-to-speech, speech-to-text, statistical machine translation, question answering systems, Watson, Siri - everywhere really. It is also relatively easy to see why these models often produce useless answers. First, the more complex a sentence, the longer it takes to evaluate possible sequences because the number of combinations increases drastically. Second, if the model has not seen a specific use of a word, it will have to use sequences it has seen for another use of that word. Words are just meaningless tokens, they only exist as probabilities.
Let me capture that structure for you ...
The issue with classic statistical models is that they are very powerful at capturing a certain structure, but only the structure we tell them to through the parameters of our model. For example, computational power has for a long time restricted such models to only consider the immediate surrounding context of a given word, e.g., the preceding two words. Of course, that is not enough to infer anything useful about context from two sentences earlier.
This also tells us a lot about why NLP is so strongly driven by cloud computing. NLP is highly computation intensive with regard to CPU power (determining answers by evaluating millions of possible sequences) and memory (holding all parameters of the learned model in memory). Hence, the only economical way to perform meaningful NLP tasks is for devices to connect to cloud services. This also explains why IBM Watson was already so powerful a few years ago. Early versions of Watson ran on IBM supercomputers as cloud computing was not as established yet.
The other reason NLP is exploding as a field is the advent of deep neural networks. I explained this really briefly in my article on DeepMind beating Go in the context of image recognition. Since this article is already getting fairly long, I will try to keep this explanation brief. If there is interest, I might do another article just on deep learning itself. In short, the most popular neural networks in NLP are recurrent neural networks (RNNs). A great blog post on them can be found here.
The fundamental thing to understand about RNNs is that they can automatically learn sequential structures such as word sequences (and generate new ones). Further, the most common type of RNN also has a memory: It uses previous input to make decisions instead of just considering the current sequence. It might be somewhat intuitive to see how this can be used for concepts like context in language. The key takeaway is that neural networks are largely black-box models, that is, they are really good at automatically detecting structure, but understanding why exactly that is so from a trained model is quite hard. This insight enables us to finally understand Tay.
The parameters of a neural network model are the weights and activation functions of each individual neuron, which determine whether a neuron fires on a given input. This works great for many tasks, but the final trained model is represented by just a matrix of numbers.
If we look at it, we would not be able to say anything about why the network outputs a specific sentence for some input. We could not determine which internal associations were made. To Tay, which was presumably trained with some form of neural network, tokens like 'Hitler' or 'Race' are not encoded in some annotated category structure that we could strictly control. They are simply input tokens that activate some neurons to fire depending on how the network was trained. Tay's training mechanism exposed to the public the ability to train the model through the 'repeat after me' and generate arbitrary associations.
Tay is, of course, more than just a simple neural network, but this is the core of how training such models works. We cannot censor associations within the model if we don't know specific associations. We either have to censor the training (e.g. filtering the input) or the output.
The Future: Compositional Semantics
If you followed the article until here, you should now have somewhat of an understanding of what NLP can do. Statistical models (neural networks are a class of them) rule everything. The more data and computing power in the cloud, the more precise the machine translation, speech recognition, question answering and so forth. Both data and computing power are now available enough to have caused an NLP explosion. So what is next? A look at research in the past year gives us an idea of the shape of things to come. As explained above, most of the work prior to deep learning was focused (for practical reasons) on relatively short sequences of words. This enabled you to ask Siri questions, but limited conversational coherence.
Current research is trying to build a new generation by considering compositional semantics, that is, capturing the composite semantics of longer/multiple sequences (recent paper on this topic). By incorporating concepts of memory and attention, talking to machines will indeed feel more natural in the coming years. This is also what underpins Microsoft's new chat bot initiative.
I hope this article has cleared up some questions about NLP. The convergence of big data processing, cloud computing and deep learning has finally delivered a clear path towards natural conversation with machines. What does this mean for the market? First, I would like to say that I don't see a clear leader in NLP. Alphabet (NASDAQ:GOOG) (NASDAQ:GOOGL), Microsoft, IBM and Facebook (NASDAQ:FB) are all strongly equipped to exploit these advances, with maybe slightly different bleeding-edge expertise in each company. Amazon's Echo is a neat device, but I would still like to point out that this is more one-way querying than conversation. AWS is actually behind in offering machine learning services.
The point is that no company has a decisive technological advantage that would decide the race in NLP (this is different for other subfields of AI). This implies that soft factors will be decisive, e.g. finding the right product for the current and future market demand and selling it. Alphabet and Microsoft have just announced initiatives of selling language technology through cloud services. IBM is already doing this through the Watson Cloud. If there is enough interest, I might do a deeper analysis of Microsoft's new services in another article. Other than that, it currently looks like this will be more of an oligopoly. I think contributors stating Microsoft has taken the lead are overestimating Microsoft's marketing here. We will have to wait and see how these initiatives play out.
Disclosure: I/we have no positions in any stocks mentioned, and no plans to initiate any positions within the next 72 hours.
I wrote this article myself, and it expresses my own opinions. I am not receiving compensation for it (other than from Seeking Alpha). I have no business relationship with any company whose stock is mentioned in this article.