Since reinforcement learning and imitation learning are brittle (i.e. they have poor generalization ability), the main unanswered question is whether the rate of fundamentally new situations on the road is low enough for these techniques to surpass human driving.

In some respects, path planning and driving policy are actually easier than the games machine learning has already mastered. For instance, the time horizon is much shorter.

Tesla's fleet, and only Tesla's fleet, is large enough to do reinforcement learning on a comparable scale to what we've seen with video games. Other companies' fleets don't come close.

Reinforcement learning has met with astonishing success when applied to complex video games (sometimes in combination with imitation learning). Some of the behaviors learned are surprisingly sophisticated and human like.

The near-term feasibility of self-driving cars depends on the limits of current machine learning approaches. This article is about using reinforcement learning to solve path planning and driving policy.

Introduction: Are self-driving cars science fiction?

I like to break down the problem of developing a self-driving car into three parts:

Computer vision: Using cameras (sometimes supplemented with other sensors) to detect objects, identify drivable roadway, read traffic signs, and so on.

Behavior prediction: Predicting what other road users (i.e. vehicles, pedestrians, and cyclists) are going to do a few seconds in advance.

Path planning and driving policy: Making good driving decisions and executing driving tasks well, given the input from the computer vision and behavior prediction modules.

The near-term tractability of these three sub-problems will determine Tesla’s (TSLA) future. If self-driving taxis are science fiction, as some analysts believe, then Tesla’s main business model will be selling cars, just like it has done for the past decade. If these sub-problems are within reach of being solved, then Tesla will become a radically different company and pretty much every existing valuation model will be out the window.

Ark Invest is one of the few analyst firms that has a self-driving taxi model for Tesla. For 2023, Ark’s model projects a $6,000 share price and $1.4 trillion market cap for Tesla, based on a fleet of 5 million self-driving taxis generating $52 billion in annual cash flow. Other firms have made estimates in a similar ballpark. UBS (UBS), for instance, forecasts $2 trillion in revenue for the self-driving taxi supply chain in 2030.

If the technology can’t be commercialized, then self-driving taxis won’t generate any revenue. But if it can, then I don’t see any economic reason why these sort of estimates would be way off base. Whether the technical challenges of developing a self-driving car are solvable in the near term is therefore a multi-trillion-dollar question. That’s the question I’m trying to answer for myself.

In this article, I’ll explore the third part of the self-driving car problem: Path planning and driving policy. In particular, I’ll explore whether path planning and driving policy can be solved with reinforcement learning. (If you’re interested in computer vision or behavior prediction, I touched on those topics in a previous article. If you’re specifically interested in the LIDAR debate, I recommend this video.)

Reinforcement learning, explained

Reinforcement learning is an intuitive concept because it’s essentially just trial and error. A robot (like a self-driving car) or a virtual agent (like a self-driving car in a simulation) takes an action. It then attempts to determine how good or bad the outcome of that action is. This is what’s known as the reward. The reward, in turn, influences the probability that the agent will take that action again.

The actions that an agent takes initially can be selected randomly. Or it can learn those initial actions via imitation learning (also known as learning from demonstration). In reinforcement learning, an agent selects an action because in the past it led to a good outcome (or high reward). In imitation learning, an agent selects an action because it previously observed humans taking that action in a similar situation. Imitation learning can be used to “bootstrap” reinforcement learning by providing a non-random set of actions to try at first, learned from watching humans.

This bootstrapped approach is what led DeepMind, a subsidiary of Alphabet (GOOG, GOOGL), to a dramatic victory over two professional StarCraft players:

DeepMind’s agent, AlphaStar, learned from a database of millions of StarCraft games played by humans. This was necessary because the researchers found that starting with random actions led to the agent getting stuck on a suboptimal strategy. The basic premise of StarCraft is that the player uses workers to mine resources, and then uses resources to construct buildings, which in turn use resources to produce military units. Military units are then used to attack and ideally destroy the opponent’s base. DeepMind’s non-bootstrapped agent didn’t grasp this premise. Instead, it sent its workers over to the enemy base to attack.

DeepMind’s victory is a significant achievement because AI researchers have been trying to solve StarCraft for about a decade. StarCraft is arguably much more challenging than Go, which DeepMind previously solved with AlphaGo. A Go player chooses between a few hundred possible moves per turn. DeepMind estimates that, in StarCraft, there are 100 septillion possible actions at any moment in time. Moreover, unlike Go, in StarCraft a player can’t see what actions their opponent is taking without sending a unit across the game map to scout.

Another recent success of reinforcement learning is OpenAI Five, which has defeated multiple professional Dota teams. Unlike AlphaStar, OpenAI Five didn’t make any use of imitation learning. It learned remarkably advanced behaviors from reinforcement learning alone:

Reinforcement learning for self-driving cars

Reinforcement learning has two major drawbacks that make it difficult to apply with real world robots, as opposed to virtual agents in a video game:

Brittleness: An agent has a very limited ability to generalize beyond the situations it has trained on. Anything too new or different will stump it.

Sample inefficiency: An agent typically needs a massive amount of experience to learn anything.

Brittleness and sample inefficiency are interrelated. An agent’s inability to generalize much beyond its past experience is at least partly why it takes so much experience to learn how to do something.

OpenAI Five played Dota continuously for 45,000 years (sped up and in parallel on many computers) in order to reach the level of skill where it could defeat professional teams. Given an average game duration of thirty-five minutes, that’s 675 million games of Dota.

DeepMind hasn’t said exactly how many years of experience AlphaStar accrued, but I believe it trained 300 versions of its agent for 200 years each, so that comes out to 60,000 years of continuous play. If a typical StarCraft game is 15 minutes long, then that translates to 2.1 billion games.

Accruing experience at this kind of scale is difficult in the real world. To get 10,000 years of continuous driving experience, a fleet of 1,000 cars driving for twelve hours per day would need to operate for 20 years.

This is where Tesla’s fleet comes in handy. Tesla has approximately 450,000 cars with the newest generation of its sensor suite, which Tesla says is designed for full self driving. Assuming that Tesla drivers spend about one hour driving per day, this means Tesla’s fleet does 1,500 years of continuous driving per month. That’s an annualized rate of 18,000 years of continuous driving per calendar year. For Tesla, unlike companies with fleets of just a few hundred vehicles, it's possible to do real world reinforcement learning at scale.

For now, most of that driving is fully manual, with Autopilot off. However, that manual driving can be used for imitation learning. As with AlphaStar, Tesla can use imitation learning to bootstrap reinforcement learning. As more and more driving functions become automated via imitation learning, reinforcement learning can be increasingly used.

In an interview in April, Elon Musk may have alluded to Tesla’s planned use of reinforcement learning:

...let's say we're trying to figure out what is the optimal spline for traversing an intersection. Then the ones where there are no interventions are the right ones. … And really like the way to look at this as view all input as error.

One straightforward way to define the reward for autonomous driving is distance travelled or time elapsed between human interventions. The notion of treating input as error suggests that Musk was referring to the trial and error approach of reinforcement learning.

In a recent blog post, Alex Kendall, CTO of the self-driving car startup Wayve, wrote the following about applying reinforcement learning to real-world robots:

Autonomous driving is the ideal application to work on. Here’s why; the action space is relatively simple. Unlike difficult strategy games like DOTA, driving does not require long term memory or strategy. At a basic level, the decision is either left, right, straight or stop. The counter point to this is that the input state space is very hard, but computer vision is making remarkable progress here.

A similar view is held by Jeff Schneider, a robotics professor at Carnegie Mellon and a former engineering lead at Uber ATG (UBER), Uber’s self-driving car division. In a talk at Carnegie Mellon, Schneider first emphasized the need for greater sample efficiency:

This is especially important for self-driving cars because our problem is the long tail, which means you’re just not going to get very many examples of the things you need your system to learn. It’s critical if reinforcement learning is going to have a chance to really solve this.

Schneider then echoed some of Kendall’s points:

But not everything is bleak. The problem is not quite as hard as some of the reinforcement learning problems that we’re used to. So, with self-driving cars you have dense rewards, you have a constant reward signal, you have very modest time horizons — you really only have to look ahead a few seconds to drive well — and it’s the rare events that you’re trying to go after. It’s not doing a maze. It’s not playing Montezuma’s Revenge and trying to find the magic incantation and sequence of hallways to go through that hits the sparse reward at the end. It’s not that at all. It’s very dense rewards, short time horizons. We just have to do it efficiently because it’s hard to get data for it.

AlphaStar’s reward was “sparse” because reward was defined simply as winning a game, something that might only occur every 15 minutes of play time. By contrast, if the reward for a self-driving car is defined as time or distance between human interventions, then the reward is determined on a moment-by-moment basis.

AlphaStar has to plan its actions over the course of a game that might last upwards of 20 minutes. A self-driving car only needs to plan long enough to complete the task at hand, such as completing a turn at an intersection. So, a self-driving car’s time horizon might be just a few seconds.

In those respects (although not in all others), the games that have been solved with imitation learning and reinforcement learning are more difficult than autonomous driving.

Shadow mode

Musk has long discussed a concept called “shadow mode,” wherein self-driving software runs passively on the car’s computer, without actually controlling it. Stuart Bowers, a VP of Engineering at Tesla, elaborated on the concept at Tesla’s recent Autonomy Day event:

When we initially have some algorithms we want to try out, we can put them on the fleet, and we can see what they would have done in a real world scenario…

Bowers showed an example of how this works, which you can see at 2:55:49 in the Autonomy Day video.

Shadow mode is one potential way to intelligently filter what data gets uploaded from the fleet. If Tesla’s driving agent is running passively on the car’s computer, and if its decision disagrees with the human driver’s decision, that disagreement could be used to trigger a data upload. (That is, data from the last few seconds and next few seconds is stored until the car connects to wifi.)

Uploads don’t have to include raw video data (although they might). A car can just upload what’s known as a replay log: An abstracted representation of the scene outputted by the computer vision module. A replay log might look something like this:

Shadow mode could be used to identify failures of imitation learning since ostensibly disagreements between the imitation agent and human drivers are failures. On a reinforcement learning approach, in shadow mode the reward could be defined as time or distance between disagreements. This is analogous to defining reward based on driver disengagements when the agent is controlling the vehicle.

When disagreements between the agent and human drivers are very few, that's evidence that a new feature is ready to deploy. Once a feature begins actually controlling vehicles, driver disengagements (or perhaps some other reward) can be used to further train the agent.

The unanswered question

Companies like Waymo and Uber ATG, which have small fleets numbering in the hundreds, have two main challenges in getting reinforcement learning off the ground. First and foremost, there's the sample inefficiency problem: Their fleets are too small to obtain the massive scale of experience that has been required to solve games like StarCraft and Dota. Simulation is not a solution to this problem because simulations lack important empirical information about real world driving, such as how human drivers behave. Real world experience is necessary.

Second, these companies lack access to vast numbers of human driving demonstrations which could be used to bootstrap reinforcement learning via imitation learning. Starting from random actions is too dangerous to do in the real world. It may be possible to bootstrap with driving behaviors that are programmed by hand. However, this seems to me like a far more difficult route than imitation learning. Humans are capable of doing much more than we are capable of rigorously explaining or decomposing into code. For instance, we can walk and talk effortlessly, but explaining how is a hard scientific problem, let alone engineering a machine that can do the same. Imitation learning circumvents this difficulty by simply watching human drivers and copying their behaviors.

Let’s suppose that by sampling from tens of thousands of years of continuous human driving, Tesla’s agent learns to drive like a human via imitation learning. Then, Tesla’s agent improves on its performance via reinforcement learning by making use of tens of thousands of years of partially autonomous driving under human supervision. Are path planning and driving policy therefore solved? Not necessarily.

Reinforcement learning still suffers from brittleness: The inability to generalize much beyond past experience. The same is true for imitation learning. Rare, long tail events that continually recur can, in theory, be solved with a large fleet. On the other hand, new situations that are unlike anything the agent has been trained on are an intractable problem. They can’t be solved using any existing machine learning technique. If new situations are too frequent, then self-driving cars are impossible with the techniques we have today.

This is an open, empirical question. If new situations arise every thousand miles, then the problem can't be solved until machine learning researchers make some new breakthroughs. If new situations arise every million miles, then the problem may well be solvable using only current methods.

Conclusion

StarCraft and Dota would be unsolvable with existing machine learning techniques if researchers were restricted to using as little data and experience as can be gleaned from a few hundred self-driving car prototypes. Even at an average speed as low as five miles per hour, Waymo’s approximately 15 million miles of autonomous driving would only amount to about 350 years of continuous driving. That’s a far cry from the tens of thousands of years of play that AlphaStar and OpenAI Five trained on. If the path planning and driving policy component of autonomous driving seems unsolvable, maybe that’s only because so far companies have been working with one arm tied behind their back.

As Tesla continues to develop its self-driving software stack, including its computer vision and behavior prediction modules, it will gain an increasing ability to filter, automatically label, collect, and train on data from its hundreds of thousands of cars. This process, and not what Waymo or Uber ATG are doing, will test the limits of imitation learning and reinforcement learning for self-driving cars.

There's no guarantee that fully autonomous driving is a solvable problem using current machine learning techniques. For me, the most troubling uncertainty is the frequency of new situations that will expose the brittleness of these techniques. I can’t think of any way to ascertain the frequency of new situations except to simply try applying imitation learning and reinforcement learning at scale. We’ll only know the rate of new situations is low enough if this succeeds in producing a self-driving car that can empirically drive well over billions of miles.

