Seeking Alpha

Tesla: Self-Supervised Learning, Dojo, And Full Self-Driving

|
About: Tesla, Inc. (TSLA)
by: Trent Eady
Trent Eady
Tech, carmakers, long-term horizon
Summary

In 2020, self-supervised learning could advance the state of the art in computer vision.

Tesla is working on a dedicated computer, Dojo, for training neural networks using self-supervised learning.

Through a technique called active learning, Tesla can automatically curate only the most useful video clips for self-supervised learning from its fleet of roughly 750,000 camera-equipped, Internet-connected cars.

Tesla's large fleet also provides other advantages in computer vision and in behavior prediction and decision-making.

The trillion-dollar question: are robotaxis possible? If so, and if Tesla can deploy robotaxis at scale, a quadruple-digit stock price is possible.

From my perspective, the most important unknown variable that will affect Tesla’s (TSLA) long-term valuation is the company’s ability (or lack thereof) to launch a fully autonomous taxi (or robotaxi) service. This is why I keep close track of Tesla’s latest software updates and why I try to educate myself about the field of machine learning, so that I can hopefully better understand what’s happening under the hood.

Here’s the latest news. Tesla recently rolled out what it calls a “Full Self-Driving Visualization Preview.” The software update displays object detections on the car’s screen, presenting stop signs, traffic lights (with changing colors), lane lines, turn arrows, and even garbage cans. The visualization looks kind of like a video game with minimalist graphics. This vlog shows the new software in action:

Tesla is evidently continuing to evolve its computer vision capabilities, with the near-term goal of releasing new Autopilot features for city streets. In this article, I’ll explore what I believe Tesla is doing with computer vision and why I believe it has a competitive advantage in this area.

Weakly supervised learning for computer vision

In my previous article on Tesla, I discussed how human behavioral cues provide automatic labels for camera data. This automatically labeled data can be used to train artificial neural networks on computer vision tasks necessary for autonomous driving. The technical name for this approach is weakly supervised learning. The main example I explored was labeling areas where humans drive as “free space” (i.e., empty space) and everywhere else as not free space (i.e., occupied space). Another related example (which arguably extends beyond vision) is predicting the curvature of the road based on the human driver’s steering angle:

Beyond these examples, there may be all kinds of relationships between human behavior and what’s in the environment. For instance, a generally good (but slightly imperfect) predictor of whether a traffic light is red or green is whether human drivers stop or go. The advantage of weakly supervised learning for Tesla is the ability to collect roughly 1,000x more automatically labeled training data than competitors like Cruise (GM) and Waymo (GOOG, GOOGL) that have roughly 1/1,000th as many vehicles on the road. Based on research by Baidu (BIDU), with 1,000x more data, Tesla could potentially beat its competitors on neural network performance by 10x or more for tasks where automatic labels can be obtained.

Weakly supervised learning stands in contrast to fully supervised learning, which is the most common form of deep learning used in computer vision. In fully supervised learning, images or videos are laboriously hand-labeled by human annotators. Fully supervised learning can only be scaled up so much until the labor cost becomes prohibitive. However, it can be used in combination with weakly supervised learning and other approaches, so, personally, I expect it will continue to play an important role for the foreseeable future.

Self-supervised learning for computer vision

Self-supervised learning is another approach that attempts to avoid the labor cost of manually labeling data. As the name might suggest, in self-supervised learning, the data supervises itself. That is, the training signal that tells a neural network which outputs are correct and which are incorrect comes from the data itself. Let me give a concrete example.

A company like Tesla can collect a vast amount of image data from the cameras on its cars. Self-supervised learning can attempt to learn the internal structure of those images (i.e., the recurring patterns within them) by training on a task that is a proxy for what we really want the neural network to do. The technical term for this is a proxy task (also called a pretext task).

For example, the proxy task might be to take an image that has had random patches removed and to fill in the missing pixels. During training, the neural network can be given full, unaltered images. At test time, the neural network is shown a new set of images it has never seen before with random patches missing. The network’s accuracy can be judged by matching the generated pixels against the real missing pixels.

Here’s a short clip of this idea being discussed by deep learning pioneer and Turing Award winner Yann LeCun, who currently serves as Facebook’s (FB) Chief AI Scientist:

LeCun’s full talk is a wonderful introduction to self-supervised learning.

This “fill in the blanks” idea can be used for missing patches in frames of video. So, a company like Tesla could also use sequential frames from video clips rather than just still images.

As I understand it, in the process of training on a proxy task such as this, a neural network learns to internally represent aspects of the physical world, including objects like cars, people, and bicycles and surfaces like roadway, sidewalk, and grass.

If you’re curious, this video explains in depth how computer vision neural networks develop internal representations of objects and object features in the course of training:

Training a neural network on a proxy task (or multiple proxy tasks) is referred to as pre-training. The same neural network can then also be trained on manually labeled images or video in what’s known as fine-tuning. Annotators draw three-dimensional boxes around objects like cars and color code each pixel of surfaces like roadway. The neural network learns these explicit labels faster and better because it already has internal representations of these visual phenomena. During fully supervised learning, the network improves those existing representations and associates them with explicit labels. That’s how self-supervised pre-training boosts the performance of fully supervised learning.

In a recent paper published by DeepMind, researchers found that, with self-supervised pre-training, a neural network given only half as many hand-labeled training examples of images did better on image recognition than the same neural network given twice as many examples. So, self-supervised pre-training can more than double a neural network’s data efficiency.

In another version of the experiment, the researchers gave a pre-trained neural network just 1% of the typical hand-labeled training dataset. It beat the (otherwise identical) non-pre-trained network with 5% of the training dataset. That’s more than a quintupling of data efficiency.

Deep learning practitioners see self-supervised learning as an attractive area of research because, if the nut can be cracked, it will yield improvements to computer vision (and other deep learning tasks) that scale with data and compute rather than with human labor. Billions of hours of video exist on YouTube. Google has compiled an open dataset of 350,000 hours of curated clips from YouTube videos for use in deep learning research. The data is there.

Tesla has roughly 750,000 cars with eight surround cameras that likely drive about an hour per day on average. That’s about 20 million hours of 360-degree video per month across the whole fleet and about 170 million hours of video per month across the eight cameras on each vehicle. This is much more video than would ever be economically feasible to label by hand. But self-supervised learning, given the right proxy tasks, could likely extract good representations from an automatically curated fraction of that ocean of video. These representations could then multiply the data efficiency of fully supervised learning many times.

A side camera on a Tesla Model 3.Close-up of a side camera on a Tesla Model 3. Photo by Steve Jurvetson.

How Tesla is using self-supervised learning for computer vision

At Tesla Autonomy Day in April, CEO Elon Musk signalled that self-supervised learning is a priority for the company. (Note: “unsupervised learning” is a synonym for self-supervised learning.) Musk said:

The car is an inference-optimized computer. We do have a major program at Tesla — which we don’t have enough time to talk about today — called Dojo. That’s a super powerful training computer. The goal of Dojo will be to be able to take in vast amounts of data — at a video level — and do unsupervised massive training of vast amounts of video with the Dojo computer. But that’s for another day.”

In a more recent talk, Tesla’s Senior Director of AI, Andrej Karpathy, said that the goal of the Dojo training computer is to achieve an order of magnitude increase in performance at a lower cost. It’s not clear how far along Dojo is in development or when it will be deployed.

One specific area where we know Tesla is exploring self-supervised learning for computer vision is perceiving depth:

With Tesla’s fleet of roughly 750,000 camera-equipped, Internet-connected cars, it can use active learning to select which video clips to save and upload over Wifi. Active learning attempts to make learning as efficient as possible through various methods of selecting the most instructive training examples. For instance, Nvidia (NVDA) developed a method to automatically select video frames to use in training out of long driving videos. Disagreement between different neural networks about what's in the video frames means the frames will be used. Nvidia then compared this automatic method to paying humans to review the footage and manually select video frames. It found that neural network performance improved 3-4x more when trained on the automatically selected examples than the manually selected ones.

So, I strongly suspect Tesla will use active learning to automatically curate video clips from its fleet and that it will use those clips to automatically train neural networks via self-supervised learning, accelerated by Dojo. Yann LeCun, the aforementioned deep learning pioneer, predicts that researchers are on the cusp of a breakthrough in self-supervised learning on video. He expects that, in 2020, deep learning practitioners will finally be able to do self-supervised learning on video in earnest. When that happens, I believe Tesla will probably achieve the same sort of result on video-based tasks as DeepMind achieved with image recognition: a doubling, quintupling, or more of data efficiency.

It’s also worth noting that active learning can be applied to any form of data collection that Tesla does for machine learning. When training examples are manually labeled, active learning makes labour more efficient. When bandwidth, data storage, or compute become constraints, active learning allows Tesla to get more neural network performance within those constraints. With about 750,000 cars on the road, Tesla presumably encounters orders of magnitude more of the top-notch examples than its competitors. With computer vision, it’s not only self-supervised learning or weakly supervised learning where Tesla has an advantage in scale of data, but also good ol’ fashioned fully supervised learning, thanks to active learning. Active learning also applies outside of computer vision.

What about lidar?

Tesla famously doesn’t use lidar and has no plans to do so. Encouragingly, Mobileye (INTC) recently released an impressive demo video showing one of its autonomous test cars navigating Jerusalem traffic with only eight cameras as its sensors. The test car also used only a third of the computing power that is in Tesla’s latest vehicles:

My opinion is that if a competitor such as Waymo shows that robotaxis are possible with lidar and Tesla is unable to develop robotaxis without lidar, the pivot to lidar won’t be an overwhelming challenge for Tesla. Tesla will be very late to the game, but it will also have its large-scale fleet data advantage in almost every major area of autonomous driving except lidar perception, including computer vision. If Waymo is the first to launch a true robotaxi business at scale, I believe Tesla can be a fast follower. In this scenario, it may be strategically wise for Tesla to acquire an autonomous vehicle startup that has been working on lidar perception for a long time.

With the income from robotaxis, Tesla can make things right for customers who purchased Full Self-Driving by offering to retrofit their cars, to buy back their cars and retrofit them itself, or to give them a lump sum of cash. This should be possible because a) robotaxis are expected to be extremely lucrative and b) retrofitted cars can be deployed as robotaxis.

The same idea applies to compute hardware. If Tesla doesn’t have enough compute in production cars to run large enough neural networks for robotaxis, it can pivot to a Waymo-like approach with expensive, heavy-duty hardware on its robotaxis.

Self-supervised learning for behavior prediction

The three main components of Tesla’s autonomous driving system are computer vision, behavior prediction, and planning (which is sometimes also called decision-making). Computer vision is what a car sees. Behavior prediction means anticipating the actions and trajectories of pedestrians, cyclists, vehicles, animals, and other moving objects on the road. Planning is how a car decides what actions to take and how it determines its trajectory through space and time.

Tesla can train neural networks on behavior prediction in a very similar way to how LeCun predicts self-supervised learning for computer vision will soon be possible. Rather than predicting future video frames, a behavior prediction neural network only needs to predict the trajectory of an abstract representation like a three-dimensional bounding box around a vehicle. The training is self-supervised because the computer vision system will tell the prediction system whether the bounding box took the trajectory it predicted. No human annotation is needed. Karpathy explained this concept on Autonomy Day:

Imitation learning and reinforcement learning for planning

The same abstract representations like 3D bounding boxes that are used in prediction can be used in planning. Neural networks can learn planning in essentially two ways. They can learn by copying human behavior, in what’s known as imitation learning, or they can learn by trial and error, in what’s known as reinforcement learning. Karpathy discussed imitation learning at length on Autonomy Day:

Imitation learning and reinforcement learning can be combined to yield better performance than either technique on its own. Neural networks and hand-coded software can also be combined to give the system a better shot at handling novel situations for which it lacks training data. When the neural network is not confident, the system can fall back on a hand-coded planner.

I’ll have more to say about imitation learning and reinforcement learning in a future article.

The trillion-dollar question

In summary, Tesla’s large-scale fleet data combined with active learning bestows it with advantages in five distinct areas:

  1. Fully supervised learning for computer vision (i.e., training on hand-labeled images and videos).

  2. Weakly supervised learning for computer vision (i.e., using driver-generated labels on images and videos).

  3. Self-supervised learning for computer vision (i.e., using parts of videos to predict other parts of videos).

  4. Self-supervised learning for prediction (i.e., using the past behavior of abstract representations like bounding boxes to predict their future behavior).

  5. Imitation learning and reinforcement learning for planning (i.e., using human behavior and real-world experience to train neural networks to make driving decisions using abstract representations as input).

Per the Baidu study I mentioned above, in any area where Tesla is able to collect 1,000x as much training data as its competitors (i.e., with items 2, 4, and 5 on the list above), its neural network performance could end up being 10x better. Anywhere active learning is applied at large scale (i.e., all five items on the list, including 1 and 3), performance may also be several-fold higher. For example, with rare wildlife such as bears or moose or uncommon vehicles like tractors, Tesla may be able to collect 1,000x more examples than competitors whose fleets are a tiny fraction of the size.

Will this end up being enough to solve robotaxis? That’s the trillion-dollar question. ARK Invest’s financial model computes that 5 million robotaxis would earn Tesla a $1.4 trillion market cap and a $6,100 share price. That’s out of a combined market cap of $4 trillion for robotaxi companies worldwide:

In China alone, McKinsey sees robotaxis and sales of fully autonomous vehicles generating $2 trillion in annual revenue at the point when two-thirds of passenger miles are fully autonomous.

The question is: are robotaxis possible and, if so, how soon?

Perhaps the most encouraging news is that Waymo has, at long last, deployed driverless rides for some of its early access testers:

What remains to be seen is whether driverless rides can be scaled up safely and what statistical data Waymo has to show that driverless rides are safer than a human taking the wheel. I hope that Waymo will be able to make driverless rides the norm rather than the exception. I also hope it will release some rigorous safety data to prove to the world it’s making a prudent decision.

Cruise inadvertently provided some of that data to the public when an internal report was leaked to the press. The report included an internal forecast, made in mid-2019, that Cruise’s autonomous vehicles would be at 5-11% of the human safety level by the end of 2019. To me, this is encouraging because it suggests that, if Cruise’s forecast turned out to be correct, there is now “only” a 10x to 25x improvement needed to achieve greater-than-human safety. To me, this feels a lot more encouraging than if that number turned out to be, say, 1,000x. As shown by the studies from Baidu, DeepMind, and Nvidia, a 10x improvement isn’t unheard-of in machine learning.

The development I’ll be watching most closely is Tesla’s release (or continued delays) of “feature-complete” Full Self-Driving, which is essentially a version of Autopilot that can operate on city streets and in suburbs. It would not surprise me at all if the initial version of feature-complete Full Self-Driving came with the same sort of flaws as the first releases of Navigate on Autopilot and Smart Summon. However, I predict that Tesla will have a decently polished version out within 1-3 years of its initial release. I’m not confident in predicting when or if robotaxis will arrive, but I am confident in predicting Tesla is at least not far off from having futuristic driver assistance technology that will likely be the envy of traditional automakers. This, as much as a head-start on electrification, could help Tesla carve out a Toyota-sized (TM) spot in the global automotive market.

Disclosure: I am/we are long TSLA. I wrote this article myself, and it expresses my own opinions. I am not receiving compensation for it (other than from Seeking Alpha). I have no business relationship with any company whose stock is mentioned in this article.