Tesla’s cars collect so much camera and other sensor data as they drive around, even when Autopilot isn’t turned on, that the Autopilot team can examine what traditional human driving looks like in various driving scenarios and mimic it, said the person familiar with the system. It uses this information as an additional factor to plan how a car will drive in specific situations—for example, how to steer a curve on a road or avoid an object.
Efrati cites unnamed engineers who believe strongly in the imitation learning approach:
...Tesla’s engineers believe that by putting enough data from good human driving through a neural network, that network can learn how to directly predict the correct steering, braking and acceleration in most situations. “You don’t need anything else” to teach the system how to drive autonomously, said a person who has been involved with the team. They envision a future in which humans won’t need to write code to tell the car what to do when it encounters a particular scenario; it will know what to do on its own.
The optimism of these engineers may have been bolstered when, a few months later, DeepMind (GOOG, GOOGL) used imitation learning to achieve human-level ability on the video game StarCraft II, which for years has been considered an important challenge for AI research. This result was described as a surprise by Alex Irpan, a machine learning engineer at Google Brain:
...we expect long-horizon tasks to be harder for imitation learning. A StarCraft game is long enough that I didn’t expect imitation learning to work at all. And yet, imitation learning was good enough to reach the level of a Gold player.
The worry is that if the AI agent makes a mistake or otherwise goes off-script, it will find itself in a state where it has no examples of human behaviour to imitate. Irpan surmises that DeepMind’s agent was able to overcome this worry by (at least in large part) having a massive amount of data:
If you have a very large dataset, from a wide variety of experts of varying skill levels (like, say, a corpus of StarCraft games from anyone who’s ever played the game), then it’s possible that your data already has enough variation to let your agent learn how to recover from several of the incorrect decisions it could make.
In April 2019, Tesla’s Director of AI, Andrej Karpathy, confirmed on stage at Tesla Autonomy Day that Tesla is using imitation learning. In fact, he revealed that imitation learning is already used to some extent in the production version of Autopilot:
Karpathy also expressed his belief that certain tasks, such as deciding when to make a lane change, that are currently handled by hand-coded heuristics will most likely be better handled by imitation learning at some point in the future.
The ideal outcome would be for Tesla to achieve the same success with driving that DeepMind achieved with StarCraft. This is a more likely outcome for Tesla than other companies because of Tesla’s ability to collect and curate a very large dataset of human driving demonstrations. Tesla has approximately 600,000 vehicles with “full self-driving hardware”. These vehicles drive something in the ballpark of 20 million miles per day. A competitor like Waymo with roughly 0.1% as many vehicles can’t create a dataset of the same size.
How to get the right data
A complication for Tesla is that randomly or indiscriminately uploading data won’t be that helpful. Examples of the most common behaviours, like driving straight ahead on a highway, will swamp uncommon manoeuvres like U-turns. There needs to be a way to capture the full variety of driving behaviours, including the rarest examples.
The most straightforward way to do this is manually. That is, Tesla engineers design upload triggers to pull the data they want. For example, they might decide they need more examples of unprotected left turns. So, they design a trigger: when a car’s vision neural network detects a traffic light and when the steering wheel is turned left, the car saves a recording of what happens (starting a few seconds before the trigger was activated). The recording is later uploaded via Wi-Fi.
But what about the behaviours engineers can’t anticipate or can’t write a trigger for? What about when a neural network makes a mistake? In these cases, Autopilot interventions are the ideal trigger. CEO Elon Musk says Tesla’s approach is to “view all input as error”:
If an Autopilot intervention is seen as an error, then what the human driver does after they intervene could also be seen as a demonstration of the correct behaviour. Interventions could thereby provide a constant stream of useful training data.
Learning from interventions is an approach used by Aurora, a self-driving car startup co-founded by Chris Urmson formerly from Waymo, Drew Bagnell formerly from Uber ATG (UBER), and Sterling Anderson formerly from Tesla’s Autopilot division. In a June 2019 talk, an Aurora engineer, Asta Li, described how safety driver interventions provide training examples for imitation learning. The relevant part of the talk begins at 24:12:
Learning from interventions requires neural networks to be actively controlling the car, such as when Autopilot is turned on. However, it may also be possible to curate training examples when the car is being manually driven. Tesla refers to this as “shadow mode”. Amir Efrati describes shadow mode in the same article I quoted from above:
There’s also disagreement about the value of what has been dubbed Tesla’s “shadow mode” approach to researching new Autopilot software. That’s a reference to Tesla’s ability to run experimental software on any of its cars on the road without actually affecting what the cars do. Tesla’s engineers can compare what the car would do with the experimental software to what it is actually doing. Shadow mode thus allows the Autopilot team to compare how humans react to certain situations with what Autopilot would have done in the same situation at that moment.
A neural network trained using imitation learning could run passively on a car’s computer, outputting what it thinks is the optimal action for the car to take. If a human driver takes a different action, that could trigger an upload. When the neural network and the driver “disagree”, it is treated as an error on the neural network’s part. The driver’s behaviour is treated as a demonstration of the correct behaviour.
Another idea is to pull data whenever the neural network is unsure of itself. There are techniques that attempt to quantify a neural network’s uncertainty. Perhaps when a network’s uncertainty exceeds a certain threshold, that could trigger an upload.
So, broadly speaking, there are three ways for Tesla to pull useful data from the fleet:
Manually designed triggers
Neural networks running passively (i.e. shadow mode)
Since Tesla is pulling from approximately 20 million miles of driving per day, it can rapidly build up very large datasets of different driving behaviours and collect new demonstrations to correct neural network errors. For a competitor like Waymo which might take two years to drive 20 million miles, this approach at this scale just isn’t possible.
What if a team at DeepMind tried to use imitation learning to solve StarCraft with data from just a few hundred players? It’s easy to imagine them getting nowhere. That’s not a knock against DeepMind, who are arguably the best in the world at what they do. It’s just that the best neural networks that exist today have a vast, unforgiving appetite for data. We don’t know how to train a neural network to play StarCraft in an hour, even though a human can figure it out in that time. Neural networks demand millions or billions of examples before they learn all the different statistical relationships between all the different variables they’re dealing with.
Even if a network can learn to drive decently well with examples pulled from a few million miles of driving, getting to human-level safety requires driving down error rates to below 0.1%. Given the diminishing returns of more data, it may turn out this is only possible with a truly massive dataset. This is particularly the case if what causes errors are rare states that a neural network finds itself in that it hasn’t been trained to handle. To get 10,000 examples of a once-in-a-million-mile state requires 10 billion miles of driving to draw from. That’s inevitable for Tesla, impossible for Waymo. Without producing any new cars, Tesla would drive 10 billion miles over the next 17 months. For Waymo, it would take 800 years.
This is how Tesla could solve autonomy: imitation learning of driving tasks that are impossible to hand code and impossible to learn from datasets 0.1% the size. That combined with deep learning of computer vision and behaviour prediction, which also benefit from training on rare edge cases.
A Tesla Model 3. Photo by Taun Stewart.
The possibility of fast progress
Elon Musk believes Tesla will solve full autonomy by the end of 2020. Many commentators treat this timeline as ridiculous on its face. I’m often more surprised when Musk meets his timelines than when he misses them, but I don’t see this timeline as ridiculous on its face. I suspect the mental model of many of the commentators who find it ridiculous is that developing autonomy is a manual process. In this understanding, developing autonomy involves software engineers plucking away at their code, moving at human speed. Software development can only happen so fast, and adding more developers often doesn't make it go faster. If Tesla were a year away from full autonomy, the thinking goes, we’d know it, because its vehicles would be almost fully autonomous today. Humans only write code so fast, so the code would need to be mostly done today for it to be finished by the end of next year.
My current mental model is that developing autonomy is largely an automated process with several manual bottlenecks. Software engineers still have to finish software. Human annotators have to label video data. AI experts like Karpathy have to design neural network architectures, curate datasets, and figure out the best practices for training networks (e.g. when to stop). But once the bottlenecks are cleared, imitation learning happens at the speed of sensors, routers, and GPUs. DeepMind spent about three years in research and development on StarCraft and then used imitation learning to train its agent to human-level performance in three days. In a metaphorical sense, the training process “wrote” the “code” to play StarCraft in three days. The years of human work was to set up the training process, particularly the neural network architecture and the dataset.
By analogy, rocket engineers spend typically years designing and building a new rocket. For years, nothing leaves the ground. When the time is ready for the first test launch, everything comes down to the first few seconds after T-minus-0. If you measured progress by proximity to the ground, it would look like nothing is happening for years and then like everything is happening all of a sudden. What is really going on is that progress in research and engineering is hidden behind the scenes.
Right now, it appears Tesla is doing a lot of work to set up its various training processes. For instance, Tesla has yet to make use of the power of its new computing hardware. The neural networks designed for the new hardware still seem to be in development or testing.
The version of Tesla’s full autonomy software that it showed off in April was allegedly the result of only three months of neural network training and software development on off-highway driving. That software took investors and analysts on generally well-reviewed test rides and produced this demo video:
A year before that, Tesla was still working on highway lane keeping; a March 2018 update fixed a problem where Autopilot would “ping pong” back and forth between the two edges of a lane. That update was “a result of a fairly extensive rewrite”, according to Karpathy. My impression is that Karpathy’s team and others in the Autopilot division have been working on the foundations of Autopilot, especially computer vision, for quite some time, and that the public will only see the results of this work when an over-the-air update to the fleet enables new Level 2 autonomy (i.e. advanced driver assistance) features for city driving. Whether city driving features are good enough to deploy is a binary question, not incremental. We’ll only see these features once Tesla decides they’re ready; we don’t see the incremental progress that is observable to people inside the company.
If I were to venture into the realm of prediction, I would guess Tesla’s next steps are something like this:
Deploy a new computer vision neural network that uses the new hardware’s 21x increase in video processing capability. We know Karpathy’s team is working on such a network, but we don’t know when it will be deployed.
Notice computer vision errors in the wild from Autopilot interventions, the network’s self-rated uncertainty, and any other means available. Collect data from these error cases and retrain the network to reduce errors.
When computer vision accuracy reaches a level that Karpathy and others are satisfied with, it’s time to train or retrain neural networks that predict the behaviour of vehicles, pedestrians, and cyclists. To get accurate training data on road user behaviour, Tesla’s vehicles first have to be able to see road users accurately. The computer vision network labels the video data automatically, which is great because it removes the economic limit of hand labelling. However, it also means any errors in computer vision will flow through to behaviour prediction, hence the need for vision accuracy. (Technical detail: some or all prediction tasks may actually be bundled up together with vision tasks in Tesla’s giant neural network, which Karpathy says handles around 100 different sub-tasks. The same is true for imitation learning tasks. However, the important conceptual point is that accurate computer vision is needed for accurate automatic labelling of training data for behaviour prediction. Accurate vision and prediction are also needed for accurate automatic labelling of training data for imitation learning.)
Notice behaviour prediction errors. Collect data. Retrain.
When the above steps are complete, feed the computer vision output and the behaviour prediction output into neural networks designed for imitation learning. Given what the vision network sees and what the prediction networks predict, and given the driving behaviour of approximately 600,000 Tesla drivers (as observed via the car’s steering angle, speed, and other data), the imitation learning networks will attempt to learn the relationship between the situation as it is perceived and the correct driving behaviour. As DeepMind demonstrated with StarCraft, complex behaviours can be learned this way.
Deploy imitation learning piecemeal, either enabling new driving tasks to be automated (such as right turns at intersections) or supplanting a hand-coded heuristic that was previously used to automate a task.
Notice errors. Collect data. Retrain.
Keep deploying imitation learning until all driving tasks are automated.
Retrain until all automated tasks are performed better by the software than by the average human.
My current hunch is that the time between step 1, in which the new computer vision is released, and step 8, in which all driving tasks are automated by imitation learning or by a hand-coded heuristic, could be relatively short. Data collection, automatic labelling, and training are automated processes that will happen quickly. Manual data labelling, which is needed for computer vision, happens slowly, as does work by Karpathy’s team on designing neural network architectures, curating training datasets, and improving the training process itself (e.g. by balancing out the number of examples of rare and common objects in each batch of data to avoid biasing the network too much toward the common ones). At the point when the computer vision network is ready to deploy, the manual data labelling needed to train the network to a deployable state will already have been done. So will the R&D work of Karpathy’s team, insofar as it pertains to getting the new vision network production-ready. So, when the vision network is deployed, a lot of the slow manual work will be behind us.
With behaviour prediction and imitation learning, it’s less clear how much of the manual R&D work will be done by the time the vision network deploys. Tesla has been working on prediction and imitation for quite some time. The production version of Autopilot has been able to predict cut-ins for nine months. Imitation learning has been used to help Autopilot determine the car’s path for almost a year. Remember that production deployment is the end of the process, not the beginning. So, internally, work must have started much earlier.
One thing that is clear is that behaviour prediction and imitation learning won't be slowed down by manual labelling. Both use automatic labelling.
I don’t know the future, but I can see a scenario in which Karpathy’s team deploys their new vision network in, say, Q1 2020 and then in Q4 2020 Teslas can drive like humans. The progress doesn't have to happen at human speed; it can happen at neural network speed. Whether this will happen is somewhat unknowable because the most decisive factor is whether the approach Tesla is currently pursuing fundamentally works or whether more progress in research or engineering is needed. Whether the approach works is an empirical question that can only be conclusively answered by running the experiment and trying the approach.
Complicating things more, if it works, it works, but if it doesn’t, the reason why it doesn't work might not be clear. Is success just a few neural architecture tweaks away? Or does it require fundamental research progress in AI? The only truly conclusive result is success. Failure today is inconclusive because it could always turn into success tomorrow with a bit more R&D.
The central idea here is: if Tesla’s approach is right, and if most of the manual work has already been done, then the steps to implementing it can be executed quickly, some of them at computer speed. Training the imitation learning networks, for instance, might take only a matter of days. There is no reason to apply our intuitive idea of how long it would take if everything were hand-coded by humans because it’s not. The “code” is “written” by neural networks, so to speak, much faster than humans could possibly write it, and much better too. We should therefore not dismiss the possibility that Tesla will go from a somewhat janky Autopilot experience today to human-level driving a year from now. I’m not predicting that it will happen, but the intuition that progress in this proposed scenario is too fast to be realistic is not well-grounded in evidence. We have counterexamples like DeepMind’s work with StarCraft where progress is that fast.
From an investment perspective, this creates an unusual and possibly unique (at least for a large-cap company) situation where the valuation logic for Tesla depends on a somewhat unknowable scientific/engineering factor that could rapidly change, causing the company’s rational valuation to jump 10x or more.
The explanation for such a large valuation is that fully autonomous robotaxis would not only devour Uber, Lyft (LYFT), and conventional taxis but would also compete against personal car ownership. According to AAA, a new vehicle costs an average of $0.62 per mile to own and operate. A new medium sedan costs $0.58 per mile on average. The cheapest category of vehicle, small sedans, cost $0.47 per mile. A financial model created by the analyst firm ARK Invest predicts that the cost of producing and operating a robotaxi will be $0.26 per mile. (You can download a copy of the model and see whether you find its assumptions reasonable. You can also modify the assumptions and compute your own results.) Tesla’s own estimate is $0.18 per mile. Using ARK Invest’s cost figure, a robotaxi company could charge $0.45 per mile to slightly undercut the cost of small sedans and still generate $0.21 per mile in gross profit. What's more, for that lower price, robotaxis would be offering an immense convenience: you don't have to drive! The everyday unpaid labour of driving is a hidden cost that should not be forgotten. Consumers would surely be willing to pay some price to automate it.
At a price of $0.45 per mile, ARK Invest's model computes that a robotaxi driving 127,000 miles per year (about 1.8x the 70,000 miles per year driven by a conventional taxi in New York City or Denver, Colorado) would generate $26,200 in gross profit per year. A fleet of 1 million such robotaxis (the number that Musk hopes to eventually deploy) would generate an annual gross profit of $26.2 billion.
There would be ample room to grow from there. Assuming demand for passenger miles stays constant, each robotaxi would replace eight or nine conventional vehicles, which travel 13,500 miles per year on average. Since there are around 1 billion conventional passenger vehicles in the world, ultimately there could be global demand for 110 to 125 million robotaxis.
There are a few assumptions in ARK Invest's model that I will now adjust. The model assumes the cost to manufacture a robotaxi is $50,000. That's well above the sticker price of the cheapest Tesla Model 3, which sells for $39,000. I'll assume the $39,000 variant has a gross margin of 5% and, therefore, costs $37,000 to produce. This feels like a more reasonable figure to plug in when we're contemplating a Tesla robotaxi scenario.
To be more conservative, I will adjust down ARK Invest's mileage assumptions to put them on par with Uber and conventional taxis. I'll assume a Model 3 robotaxi drives 70,000 miles per year, like a conventional taxi. I'll also assume a robotaxi is only carrying a passenger 64% of the time, like an Uber in Los Angeles. Under these assumptions, ARK Invest's model computes that each robotaxi would generate $4,200 in gross profit per year. For a fleet of 1 million robotaxis, that would be $4.2 billion in gross profit.
Of course, gross profit is sensitive to pricing. $0.45 per mile is an aggressively low price; as mentioned above, the average cost of personal car ownership is $0.62 per mile for new vehicles. At a price of $0.60 per mile, the same Model 3 robotaxi would generate $10,800 in annual gross profit according to ARK Invest's model. Try plugging in your own assumptions if you're curious.
However you slice it, robotaxis will inevitably be cost-competitive with personal vehicle ownership. At a first approximation, that's because a robotaxi will be shared among multiple people who each only have to pay for a fraction of the car. However, vehicle lifespan is a key element. A typical gasoline car, which might survive for 200,000 miles, would quickly wear out under a robotaxi model. Electric vehicles offer an advantage. Current Teslas may last up to 500,000 miles. An individual car owner can't realize much economic value from that; at 13,500 miles per year, it would take them 37 years to drive 500,000 miles. What's more, Tesla is designing future vehicles to last 1 million miles. In essence, robotaxis unlock the latent physical capital of electric vehicles: the unused surplus of miles they can be driven before wearing out. Robotaxis and electric vehicles are synergistic in a way that might not be initially obvious.
A proportionately small amount of investment into extending vehicle lifespans (compared to the overall R&D and manufacturing costs of vehicles) could potentially translate into increased economic value by spreading the costs of expensive physical assets, i.e. cars, over a much larger number of uses. To use an analogy, it's as if by making a house slightly larger, it could comfortably accommodate a second family. An investment in batteries and electric motors that adds another 200,000 miles to vehicle lifespan is the economic equivalent of turning one car into two. (That is, under the robotaxi model.)
Extending vehicle lifespans has never been as much of a priority before because there is less of an immediate economic incentive for companies to invest in it. Companies want to sell more cars, not fewer. Consumers would not likely clamour for cars that last for 30 years rather than 15 even if they conceived of it as a live option. After 15 years, there's been enough evolution of cars' safety standards, technology (e.g. infotainment), fuel efficiency, and aesthetics that older models start to feel antiquated. If car manufacturers start optimizing for utilization, as the manufacturers of airplanes and semi trucks do, who knows how much more ground can be gained?
So, that more or less explains why companies like McKinsey, UBS (UBS), Morgan Stanley (MS), Evercore (EVR), RBC (RY), Jeffries (JEF), and Intel (INTC) see robotaxis as a long-term opportunity potentially measuring in the triple-digit billions or trillions. If robotaxis can actually be deployed, they will automate the labour of driving and progressively gobble up the entire auto sector.
However, the underlying uncertainty is the feasibility of the technology: in particular, the near-to-medium-term potential for deep learning to match human competence in vision, behaviour prediction, and driving behaviour. While I can’t resolve that uncertainty, I will make two claims. First, if deep learning can master these problems in the near-to-medium term, Tesla will be the one to prove it. Second, if Tesla solves full autonomy, there is a realistic possibility that, from the outside, progress will appear blazingly fast, catching many people by surprise. We should think about technical progress on this problem as a combination of subproblems solved at human coding speed or human R&D speed and subproblems solved at data uploading speed or neural network training speed. A neural network that takes years to develop might take only a few days to train. If we expect progress to be steady, smooth, and incremental, then, from the outside, we might miss this process. We don't want to miss it!
Disclosure: I am/we are long TSLA. I wrote this article myself, and it expresses my own opinions. I am not receiving compensation for it (other than from Seeking Alpha). I have no business relationship with any company whose stock is mentioned in this article.