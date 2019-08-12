The next step for Tesla's partial autonomy software will be when Tesla uses the full capacity of its new Full Self-Driving Computer, possibly in Q4. Also, future progress in computer vision, behaviour prediction and imitation learning will be interrelated.

Tesla’s advantage is that it has over 500,000 vehicles on the road that can collect data to train neural networks to predict road user behaviour.

A self-driving car has to continually answer the following question: what will the cars, pedestrians and cyclists around me do over the next five seconds? This problem is called behaviour prediction (or simply prediction). Two prominent self-driving car engineers have recently expressed their belief in the importance of behaviour prediction.

The importance of behaviour prediction

One is Anthony Levandowski, a former top engineer at Waymo (GOOG, GOOGL). Levandowski is now the CEO of the self-driving car startup Pronto. In a blog post in December, Levandowski expressed his view that the “predictive shortcomings” of self-driving car prototypes are what limit them from achieving “true level 4 or 5” autonomy. Levandowski writes:

...the reason why nobody has achieved this level of functionality is because today’s software is not good enough to predict the future. It’s still nowhere close to matching the instincts of human drivers, which is the single most important factor in road safety.

In an interview with TechCrunch, Levandowski reiterated this idea:

If you were to analyze all the disengages that people have done and try to break it down like what’s the actual real reason behind it… all of them were a software failure, and they’re mainly a software failure on mature companies at trying to understand what the vehicle’s gonna [be] doing - or the pedestrians around you - having a misunderstanding or miscommunication with them. And so that's where the value’s gonna be. …right now, the problem is not better lasers. The problem is… prediction.

A similar view is held by Chris Urmson, who led Waymo from 2013 to 2016 and who now runs the self-driving car startup Aurora. This is what Urmson said in a recent interview with MIT researcher Lex Fridman:

...if I could wave a magic wand, what part of the system would I make work today to accelerate it as quickly as possible? … It’s really that perception forecasting ability. So, if tomorrow you could give me a perfect model of what's happened, what is happening, and what will happen for the next five seconds around a vehicle on the roadway, that would accelerate things pretty dramatically.

Tesla’s advantage

Companies like Waymo and Tesla (TSLA) are trying to solve behaviour prediction with deep learning, an approach in which deep (i.e., multi-layered) neural networks are trained on large data sets. Neural networks achieve better accuracy with more data, so the more data a company can collect, the better its vehicles will be at behaviour prediction. Andrej Karpathy, Tesla’s Director of AI, explained how this process works at Tesla Autonomy Day:

With some applications of deep learning for self-driving cars, such as object detection, companies may suffer from a bottleneck created by the need to pay humans to manually label images or frames of video. (Unless they use self-supervised learning for object detection, but that’s a topic for another article.) With object detection, the data that is inputted into the neural network might be a frame of video that contains a pedestrian. The desired output might be a box drawn around the pedestrian appended with the label “pedestrian.” To train the neural network to do this, it needs to be fed thousands or millions of examples of video frames of pedestrians with boxes drawn around them and the label “pedestrian” appended to those boxes. Humans have to create those examples by looking at frames of video, drawing boxes, and appending labels. This makes neural network training a labour-intensive and therefore costly process. The input-output pairs used for training have to be made by hand.

With behaviour prediction, the input data might be what another car did over the last five seconds. The desired output data would be a prediction about what that car will do over the next five seconds. If you have a ten-second recording of what a car did, then you have an input-pair that can be used for training. You don’t need a human to manually label anything.

It may not even be necessary for a company to upload video. Instead, a vehicle can simply save a recording of an abstracted representation of what is going on around it. That abstracted representation might look something like the boxes, lines, and text labels overlaid on this video from Tesla hacker verygreen, as well as the green shading that represents drivable roadway.

Without video, it might look something like this:

Tesla’s advantage in behaviour prediction is that it has over 500,000 Hardware 2 and 3 vehicles on the road. Hardware 2 and Hardware 3 vehicles have eight cameras covering 360 degrees, a forward-facing radar, a computer for running neural networks, and the ability to save data while driving and later upload it via customers’ Wi-Fi when parked at home. Particularly if Tesla uploads only the abstracted representations from Hardware 2 and 3 cars, and not raw video, then the fleet is a massive source of training data for behaviour prediction. With abstracted representations, there is no need for humans to do any labelling. Since neural networks improve with more data, this is an advantage for Tesla in behaviour prediction. Since behaviour prediction is so important for autonomous driving, this is advantage for Tesla in autonomous driving technology.

Data isn't uploaded indiscriminately from the fleet, either. For instance, it's particularly valuable to upload an example where a Tesla’s behaviour prediction neural network made a wrong prediction. Training on corrected mistakes is a much quicker way to improve than training on random data. The value of data shouldn’t be measured simply based on the absolute number of examples. Sheer quantity isn’t the goal.

Take the so-called “long tail”: rare occurrences that might only happen once in a million miles. A company whose fleet drives a billion miles a month (as Tesla's will once it reaches about 1 million vehicles) could collect up to 1,000 examples per month of once-in-a-million-mile occurrences. These data sets of rare occurrences would be relatively small yet extremely valuable since dealing with the long tail is critical to autonomous driving.

The business impact

Nobody knows for sure when full autonomy will be achieved. It could be next year or it could be more than a decade from now. When full autonomy is achieved, firms such as ARK Invest, UBS (UBS), McKinsey, and Intel (INTC) believe it will create a robotaxi industry that will eventually grow to trillions in global revenue. That's why it’s worth diving deep into topics like deep learning, neural networks, and behaviour prediction. If Hardware 3 Teslas achieve full autonomy and can be operated as robotaxis, Tesla’s stock price may increase by more than 20x over the long term, according to a financial model created by ARK Invest.

Cruise, a General Motors (GM) subsidiary, is valued at $19 billion today with no revenue. In August 2018, Morgan Stanley's (MS) price target for Alphabet valued its subsidiary Waymo at $45 billion, more than Tesla's market cap today. The same month Evercore ISI (EVR) gave Waymo a discounted present value of $65 billion. Waymo itself has reportedly been seeking outside investors at a valuation “at least several times” that of Cruise. If behaviour prediction is truly the hardest and most important problem for autonomous driving, and if Tesla truly has a major advantage over Waymo and Cruise in behaviour prediction, then analysts and investors should value Tesla's opportunity in robotaxis and robotrucks more than they value Waymo or Cruise.

ARK Invest's robotaxi valuation forecast from its Big Ideas 2019 report.

Even if full autonomy never happens, there is an opportunity for Tesla in partial autonomy. Navigate on Autopilot, Enhanced Summon, and future partial autonomy features can differentiate Teslas from all other vehicles. Better behaviour prediction will make these features better. Until other car manufacturers (or possibly a Tier 1 supplier) copy Tesla’s approach of using a huge fleet of mass-produced cars for deep learning (what Tesla calls “fleet learning”), no company’s partial autonomy features will compete. Commentators often think of Tesla just as an electric car maker and argue that incumbents will eventually be able to make better electric cars. But possibly a bigger differentiator from incumbents than an electric powertrain will be Tesla’s partial autonomy software.

The company's cultures of incumbents are not software-oriented. For example, while Tesla has been doing over-the-air software updates since 2012, incumbents are only now starting to follow suit. Even if incumbents want to emulate Tesla with regard to partial autonomy software, they will likely lag years behind. This is a potential source of durable competitive advantage for Tesla over the long term.

What comes next

According to CEO Elon Musk, Tesla’s current neural networks and other autonomy-related software are only using 5-10% of the computational power of Tesla’s new, custom-designed Full Self-Driving Computer. (Cars with the FSD Computer are referred to as Hardware 3 cars.) Since more computationally intensive neural networks tend to do better on their assigned tasks, it would be logical for Tesla to use the FSD Computer’s full capacity. On the Q3 2018 earnings call, Karpathy said that bigger neural networks were coming, enabled by the FSD Computer. More recently, Musk tweeted that the functionality of cars with the FSD Computer would begin to diverge from that of cars without it in Q4 of this year. Keeping in mind Musk’s confessed and well-documented punctuality problem, we can look to Q4 for the next significant step for Tesla’s autonomy software, but not wait with bated breath.

The milestone to watch for is when Tesla starts using all or almost all of the FSD Computer’s power. Karpathy’s public comments suggest to me that his team has been developing new neural networks for some time. The new networks won’t just be bigger, but also architecturally improved (i.e., improved in terms of the types of artificial neurons and their interconnections). The ideal outcome for Tesla would be a sudden, major step up in performance.

How vision, prediction, and imitation interrelate

Performance of different subsystems may also be interrelated. If a computer vision neural network fails to detect a vehicle, a behaviour prediction neural network that is downstream of object detection will necessarily fail to predict whether the vehicle will cut into the Tesla’s lane. Similarly, the abstracted representations that may be uploaded to train behaviour prediction are only as good as the computer vision networks that generate them. In both training and inference (i.e., the real time application of a neural network on the road), improvements in computer vision may help behaviour prediction.

The same idea applies to imitation learning. Tesla has been using imitation learning for path prediction:

With imitation learning, the neural network takes some input data: possibly raw video but more likely in my opinion an abstracted representation generated by computer vision networks. The desired output data from the neural network is the action for the car to take, which is translated by the control software into steering, braking, and accelerating commands. With hundreds of thousands of Tesla drivers at the wheel, Tesla can collect abundant output data simply by recording what actions human drivers took. Coupled with the abstracted representation of the car’s surroundings, this constitutes the input-output pair that is used for training. In imitation learning, this input-output pair is referred to as a state-action pair: it includes the state of the world or the environment and the action taken by an agent, such as a human driver.

Imitation learning is similar to behaviour prediction in that the input-output pairs can be generated automatically without any need for humans to manually label the data. With enough state-action pairs, a neural network will learn which states prompt which actions from human drivers. With sufficient training, the neural network will be able to generate those actions on its own. It will thereby learn to drive.

If an abstracted representation is used for imitation learning, then reducing computer vision errors will also reduce imitation errors in both training and inference. Moreover, improving behaviour prediction could improve imitation. The input data for imitation need not be limited to what’s generated by the computer vision networks. A behaviour prediction network’s forecasts can be used as additional input data. Waymo’s imitation network, ChauffeurNet, does this. This allows the imitation network to correlate human actions not just to what the computer vision networks see now, but also to what the behaviour prediction networks forecast will happen over the next five seconds.

To learn all the correct correlations between environment states and drivers’ actions, the imitation network needs to work with all the same information humans are working with. Humans drive not just based on what we see, but also our ability to anticipate what other humans on the road will do. The computer vision networks attempt to recreate what humans see outside the car that causes them to take certain driving actions. The behaviour prediction networks attempt to recreate the predictive process going on inside the driver’s brain that also causes them to take certain driving actions, such as preemptively stopping for a fast-walking jaywalker before they enter the car's path. (In theory, all this predictive information is latent in what the computer vision networks see, but by the same token, the abstracted representations generated by the vision networks are latent in the raw pixels from the video cameras. Future robots may well generate actions directly from pixels, but for now machine learning engineers tend to prefer to decompose problems into discrete parts like vision, prediction and imitation.)

So, an improvement in prediction may translate into an improvement in imitation if prediction is used as an input for imitation. Prediction and imitation, in turn, will likely both be improved by an improvement in vision, which is an input to both. Vision improvements flow downstream to prediction and imitation, and prediction improvements flow downstream to imitation.

Conclusion

When the topic of Tesla’s fleet-wide data collection effort comes up, one of the most common retorts I hear is that Tesla has no advantage because of the cost of hand-labelling images under the standard supervised learning paradigm for computer vision. This retort overlooks behaviour prediction, in which hand-labelling can be avoided entirely. It also overlooks imitation learning, in which hand-labelling is also avoided.

Even with traditional supervised learning for computer vision, Tesla’s fleet can be used to surface examples of rare objects, rare lighting and weather conditions, and other rare edge cases. For example, a deep neural network that has been trained to recognize horses can run on the car and trigger the cameras to save a snapshot anytime the network thinks it sees a horse. This is a way to get examples of horses, which are a relatively rare class of object.

A promising future area of research and development for computer vision is self-supervised learning, in which the training signal (i.e., the source of ground truth or evaluation of the neural network’s inferences) comes not from human labellers, but from the data itself. Tesla has been experimenting with self-supervised learning for depth perception:

I hope we can retire these well-worn arguments:

It’s too expensive for Tesla to label much data. (Not all data needs to be labelled, and the rarity of labelled image data also matters, not just the sheer quantity. Deep learning-based upload triggers can be used to surface rare examples.)

It’s too expensive for Tesla to upload much data over cellular networks (Data is uploaded via Tesla drivers' Wi-Fi.)

If Tesla were uploading much data over customers’ Wi-Fi, we would know. (For raw video this may be true, but for abstracted representations this is more dubious. As a point of comparison, Mobileye says it can generate camera-based HD maps from about 16 kilobytes per mile. HD maps don't include road users or driver input like braking, so Tesla's abstracted representations would need to be bigger. But even if they were 1,000x bigger, 16 megabytes per mile, Tesla drivers who have observed their cars uploading on the order of 1 gigabyte per month could be contributing up to 60 miles of driving data per month to Tesla's training set. Across the whole fleet, that would be 32 million miles per month. These numbers are just for illustrative purposes; the point is that abstracted representations are much more compact than video.)

More data just gets you steeply diminishing returns, so it’s practically useless anyway. (This is not true when you go from no examples or very few examples of a rare class to many examples. Also, even with diminishing returns, simply piling on more data has been remarkably effective in instances like OpenAI’s language generation network GPT-2.)

Some investors and analysts may find these discussions too in the weeds or even esoteric. Yet Tesla’s future valuation depends quite directly on these somewhat obscure technical issues. Firms like ARK Invest, UBS, McKinsey, and Intel agree that this business opportunity for self-driving cars will eventually be measured in the trillions, although they disagree on the timing of the technology. Others like Morgan Stanley (NYSE:MS) and Evercore ISI see the market leader eventually growing into a triple-digit, billion-dollar valuation. My advice to sell-side analyst firms is to hire machine learning analysts (or assign existing ones) to look at Tesla’s potential data advantage relative to Waymo. If what I’ve postulated is correct, sell-side firms may be mispricing Tesla relative to their own valuations of Waymo. That’s my contrarian thesis.

Disclosure: I am/we are long TSLA. I wrote this article myself, and it expresses my own opinions. I am not receiving compensation for it (other than from Seeking Alpha). I have no business relationship with any company whose stock is mentioned in this article.