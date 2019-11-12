Human input is a source of signal. Human behavior can tell the machine it's doing something wrong and, in some cases, can tell the machine how to do it right.

Engineers are also trying to use the labeling workforce as efficiently as possible. This means accumulating examples of computer vision errors from billions of miles of driving.

Andrej Karpathy, Tesla's (TSLA) Senior Director of AI, recently gave a talk where he unveiled the tongue-in-cheek "Operation Vacation". The idea is to automate Tesla's machine learning pipeline to the point where all the engineers can go on vacation and Tesla's partially autonomous driving features will continue to improve automatically. The fleet will continue to upload data automatically, a workforce of hand-labelers will continue to label data where necessary, neural networks will be automatically trained on the new data, and then the improved networks will be deployed to the fleet. The cycle repeats.

The idea of the engineers going on vacation is tongue-in-cheek because there is always work to be done; innovation never stops. Moreover, even if the whole process could be fully automated, it would be wise to have people overseeing the machinery to see that it's working as intended. But what's real is the concept of Autopilot, Summon, and other features improving largely automatically. Software development is traditionally a function of engineer labor. The goal of "Operation Vacation" is for Tesla's engineers to be like factory engineers. They put a lot of work in for a long time until the factory is running. Once it's running, it requires a large workforce of non-engineers to keep it running. The engineers are still needed to check up on the machines and fix problems as they arise or to make improvements. But, by and large, the factory is running independently of the engineers' labor. By contrast, traditional software development is more like engineers artisanally crafting goods by themselves.

A slide from Andrej Karpathy's talk showing Tesla's machine learning pipeline.

The idea for Tesla is to build an approach to autonomy that scales with data since Tesla has the largest fleet of sensor-equipped, computer-equipped, Internet-connected vehicles with which to collect data. The most significant bottleneck in this approach is the cost to pay human labelers. On computer vision tasks where Tesla needs to continually add new, hand-labeled camera data to its training datasets in order to keep improving, the approach doesn't scale with the data that's available to collect but with the subset of data that Tesla can afford to label. It might be possible for Tesla to collect a trillion photos of traffic lights, but not to pay people to label them as red, green, or yellow.

On tasks that require hand-labeling, Tesla's approach doesn't allow it to increase the sheer quantity of labeled data, but rather to speed up learning by attaining a higher quality of data. The most valuable examples to learn from are the ones where a neural network makes an error. Maybe it misclassifies an object, detects an object that isn't there, or fails to detect an object that is there. Tesla can source a higher quantity of these sorts of valuable examples than its competitors through the automatic processes that "Operation Vacation" represents.

Imagine a Tesla driving down a highway on Navigate on Autopilot. The car autonomously initiates a lane change, failing to notice a pickup truck in the adjacent lane. The driver turns the wheel, disengaging Autopilot and canceling the lane change. This human intervention can trigger a snapshot that might include a short video clip from the Tesla's eight cameras, radar data, GPS data, and so on. The video clip can be reviewed by Tesla's human labelers, who see the pickup truck is not detected by Tesla's neural network. The labelers draw a 3D box around the truck, label it "light truck", and send the labeled video clip along to Tesla HQ for inclusion in the training dataset.

Now imagine a Tesla driving down a highway under full human control. The minivan ahead of the Tesla slams on its brakes and, in turn, the Tesla driver slams on the brakes as well. Running silently on Tesla's computer is the Autopilot software. It is quietly watching the human driver's trajectory, at each moment estimating the probability that Autopilot would take the same trajectory. In this case, the Autopilot software fails to detect the minivan. From Autopilot's points of view, the human driver is randomly slamming on the brakes in the middle of an empty highway. Therefore, it assigns a low probability to the driver's trajectory; it is "surprised" by the driver's actions. This "surprise" or "disagreement" between the human-driven trajectory and the machine-generated trajectory can trigger a snapshot and an upload to Tesla's data labelers. (This is my interpretation of the approach Tesla calls "shadow mode".)

A fiendish thing about self-driving is that the lower your error rate gets, the more examples of errors you need to get that next halving of the error rate. As the quantity of errors your system generates decreases, the quantity of errors you need to sustain your rate of improvement increases. The more you need errors, the harder they are to find. Suppose Waymo (GOOG, GOOGL) has gotten serious computer vision errors (i.e. errors that would cause a vehicle misbehavior noticeable by a safety driver) down to one error per 10,000 miles. (This isn't a real figure, just a made-up one for illustrative purposes.) Since Waymo is driving approximately 1 million miles per month, only 100 serious computer vision errors will occur per month. Labeling error examples is now no problem, but getting them is hard. By contrast, if Waymo were driving 1 billion miles per month, 100,000 errors would occur. In this way, the number of error examples that can be collected scales with the number of miles driven.

Over the 12 months since its release in November 2018, Navigate on Autopilot (the version of Autopilot that can perform autonomous lane changes) has driven 1 billion miles. That's an average of 83 million miles per month. Going forward, that monthly figure will be far more since the fleet of Navigate on Autopilot-capable cars has roughly doubled over the last 12 months and it will probably increase by at least 50% over the next 12 months (assuming Tesla's current production rate at least doesn't decrease). The total fleet with Autopilot Hardware 2 and above currently stands at around 650,000. Assuming average miles driven is the same as the American average, the total miles driven (both fully manual and Autopilot) is about 725 million per month. Both modes of driving provide fodder for "Operation Vacation", as explained above. On Autopilot, human interventions flag machine errors or situations the human driver thought too difficult for Autopilot. In manual mode, human-Autopilot "disagreements" flag machine errors as well.

So far, I've only discussed elements of "Operation Vacation" where human labelers are in the loop. What about removing not just the engineers, but the labelers too? What about handing the whole process over to the machines?

A surprising finding in machine learning is that if you use poorly labeled data, sometimes you can achieve the same results as with well-labeled data as long as you use something like a thousand times as much. Facebook (FB) found that Instagram hashtags, which often tenuously relate to what an image actually depicts, can be used to train a neural network to accurately classify objects. The trick is that Facebook used 1 billion images with hashtags to achieve roughly the same accuracy as with 1 million hand-labeled images. Facebook has also found that if you combine both kinds of images, you achieve greater accuracy than either kind of image on its own.

Tesla has an abundant source of poor labels in the form of its 650,000 human drivers. In what's known as a weakly supervised approach (in contrast to the fully supervised approach in which data is carefully hand-labeled), the sort of actions I described above as ways to flag data for hand-labeling could instead be treated as low-quality labels. When a human driver proceeds into space that Autopilot detects as obstructed, that space can automatically be labeled as unobstructed. Conversely, when a human driver stops ahead of space that Autopilot detects as all clear, that space can be labeled as containing an obstacle. This is a messy way to go about it, but it's a way to supplement scarce, expensive labels with abundant, free labels. (You can read a research paper about this approach.)

Tesla's Autopilot job postings have long requested candidates who can "[d]evise methods to use to enormous quantities of lightly labeled data in addition to a diverse set of richly labeled data." At Tesla's Autonomy Day presentation, Karpathy described an approach similar to what I've been describing. With video clips labeled by human steering, Tesla trained a neural network to perceive and extrapolate the curve and slope of the roadway ahead:

Weakly supervised learning is an approach that truly scales with data. The limit is no longer human labor (neither engineers' nor labelers'), but rubber on the road, packets routed through the Internet, and the churning of GPUs at Tesla HQ. The machine learning process truly is just an elaborate machine that can operate on its own, as fast as its mechanical parts can move. The slow part is the development time the engineers have to spend setting up the process and getting it to work.

Prediction is a much easier job for "Operation Vacation" than computer vision. When it comes to predicting the trajectories of cars and pedestrians, there is an abundant, free source of high-quality labels: the future. Make an observation about a car's current trajectory and over the next five seconds that car will, in effect, label that observation with its future trajectory. If you make a prediction about where a pedestrian will walk, that pedestrian will promptly show you how accurate your prediction is. Errors can thereby be automatically detected and error examples can be automatically labeled with the correct future trajectory. This is a dream scenario. Labor isn't a constraint, money isn't a constraint; only cars and computers are a constraint.

Planning is also a dream. It's the same principle with a different focus. The planner is trying to determine what actions the car should take. The observation is the full driving scene, seen by the eight surround cameras and the forward radar. In manual mode, the neural network predicts what the human Tesla driver will do, automatically flagging an error if it predicts wrong. (Again, this is my interpretation of "shadow mode".) The driver labels the observed driving scene with their actions. On Autopilot, it's the same except an error is flagged whenever the human intervenes. This approach, known as imitation learning, was recently used by DeepMind to train neural networks that can play StarCraft better than over 70% of human players. Imitation learning has also been explored by Waymo and it's an approach favored by the self-driving car startup Aurora. (Waymo and Aurora, however, lack the scale of training data provided by Tesla's 650,000 human drivers.)

So, I posit that "Operation Vacation" has four main pillars:

Automatic flagging of computer vision errors that are later uploaded and hand-labeled (fully supervised learning). Errors are flagged when a human driver disengages Autopilot or when, in manual mode, the Autopilot planner "disagrees" with (or is "surprised" by) a human's trajectory.

Automatic labeling of camera data using low-quality labels from human driving behavior (weakly supervised learning for computer vision). Training examples are uploaded when the computer vision neural network and the human driver "disagree", resulting in Autopilot generating a different trajectory than the one the human driver took.

Automatic labeling for prediction. Future events label past events and show when a prediction is in error.

Automatic labeling for planning. Human driving behavior provides the labels. Error is presumed when a human intervenes while Autopilot is active or when there is a "disagreement" between the human and the Autopilot planner.

Prediction and planning depend on computer vision to track object trajectories and to observe the driving scene. A computer vision error may cause prediction or planning to fail. So, computer vision has to be solved for everything else to work optimally.

Conversely, a prediction or planning error may trigger the upload of a video clip that doesn't generate any computer vision errors. Incorrectly flagged clips increase the workload for humans doing manual review. Reducing prediction and planning errors can, therefore, save labelers time and let them focus on computer vision errors. By improving the accuracy of automatic flagging, progress on prediction and planning can help speed along progress on computer vision.

By developing an approach that scales as much as possible with data and as little as possible with labor, Tesla's engineers are out on the frontier of large-scale machine learning for autonomous vehicles. Automatic flagging of errors makes labelers' labor more efficient. Leveraging imitation learning reduces the work engineers need to put into a planner; rather than laboriously hand-coding every driving behavior, behaviors can be learned from data. With 725 million miles of driving to draw on every month, there is no precedent in the field of autonomous vehicles to indicate how effective Tesla's approach will be.

Some skeptics argue that solving computer vision is impossible. Maybe so. We won't know for sure unless and until it happens.

Developing a Level 2 system that fails and needs human intervention every 100 miles is a lot easier than developing a Level 4 or 5 system that fails every 1 million miles. Even if Tesla falls far short of the goal of full autonomy, it's almost a foregone conclusion that Tesla will be able to develop a Level 2 system for driving on city streets. The main question lingering over this prospect is whether, as machine errors become more and more infrequent, drivers will remain vigilant and intervene when necessary or whether they will be lulled into a false sense of security. Tesla may need to implement driver monitoring to ensure that drivers are paying attention. The combination of a driver-facing camera and the existing steering wheel torque sensor would presumably be more effective than the torque sensor on its own.

The hyper-bullish scenario for Tesla is that it develops full autonomy and deploys robotaxis. In that scenario, Tesla's market cap could roughly double or triple, if not dectuple. The mildly bullish scenario is that Tesla releases a Level 2 system for urban driving and it's so good that Tesla sells more cars and more units of the "Full Self-Driving" add-on. Sales growth and automotive gross margin are two key metrics investors are watching; Tesla's urban Level 2 system could be an unexpected contributor to both.

