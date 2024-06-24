Daniel Chetroni

Energy cost is a bottleneck issue

While generative AI offers seemingly unlimited potential, the systems are costly – at least at the current stage. And a good understanding of their cost is key to widespread adoption. A crucial bottleneck involves energy consumption, which leads to the thesis of this article.

In this article, I will argue that Nvidia Corporation’s (NASDAQ:NVDA) next-generation platform, the Blackwell platform, finally addresses this key issue. Nvidia's Blackwell platform boasts significant improvements in cost and energy efficiency compared to its predecessor, Hopper. As to be detailed later, these improvements pave the groundwork for much wider deployments of NVDA chips, especially when combined with its maturing software ecosystem (see my earlier article entitled Nvidia: Let's Talk About Software)

I assume readers are all familiar with the various issues facing AI’s expansion, ranging from technical to legal and ethical. Among all these issues, a less-often discussed issue is energy cost. The amount of energy consumed by modern AI applications is mind-boggling, and the energy cost HAS been growing exponentially and will be unsustainable in my view if we do not have more efficient chips. For example, each of Nvidia's H100 chips consumes 700W of energy at peak operation. This is more than the power consumption of the average American household. Collectively, NVDA’s high-performance AI chips are estimated to consume more energy than many small nations, as illustrated by the chart below. Even for one data center, which typically employs thousands to tens of thousands of these chips, the peak power consumption can overload the power grid of a region or even a state according to the following remarks from Microsoft engineers (see the second chart below).

The issue does not stop here. Out of all the energy consumed, only a tiny fraction is used for the actual computing and the dominant portion of it becomes heat, which then needs to be dissipated to prevent overheating and system failure (this is where my professional training lies). So energy-efficient AI chips, by reducing the energy consumption at the source, can have ripple effects to substantially lower the overall energy consumption for both computation and cooling.

Next, I will explain how NVDA’s Blackwell addresses these issues – in a very effective way.

NVDA stock: Blackwell’s 25x energy efficiency boost

NVDA has recently begun rolling out its next-generation platform, Blackwell, which promises significant performance upgrades, as well as cost and energy savings. The performance upgrades have been the focus of many other articles, and hence I won’t further add on here. Here, I will concentrate on the energy aspect. Nvidia claims its Blackwell can achieve up to 25x lower costs for workloads like training massive AI models (see its benchmarking data in the next chart below), translating into a significant reduction in power consumption for a given level of performance.

I expect these improvements to not only help sustain NVDA’s dominant role in this space, but also to make the use of its chips actually sustainable economically. Buying its chips is expensive but running them is even more so. As a budget breakdown, I gather the following inputs from this Forbes report to help readers see the overall picture (slightly edited with emphases added by me for coherence purposes):

The cost of the chips amounted to millions of dollars. According to a technical overview of OpenAI’s GPT-3 language model, each training run required at least $5 million worth of GPUs.

of GPUs. These models require many, many training runs as they are developed and tuned, so the final cost is far in excess of this figure. When asked at an MIT event in July whether the cost of training foundation models was on the order of $50 million to $100 million, OpenAI’s cofounder Sam Altman answered that it was “more than that” and is getting more expensive.

OpenAI’s cofounder Sam Altman answered that it was “more than that” and is getting more expensive. The cost doesn’t end there. Running inference on the models, once trained, is also expensive. Estimates suggest that in January 2023, ChatGPT used nearly 30,000 GPUs to handle hundreds of millions of daily user requests. Sajjad Moazeni, a University of Washington assistant professor of electrical and computer engineering, says those queries may consume around 1 GWh each day.

The average cost of electricity per kilowatt-hour (kWh) in the U.S. falls in the range of 12–17 cents per kWh (depending on the source you consult). I pay about 14 cents, so I will plug this number in. At this rate, 1 GWh each day translates into a dollar cost of $140,000 per day and $51 million per year – that is about 10x more than the chips’ upfront cost and almost on par with the cost to train the model itself.

With a 25x fold reduction (plus the ripple effects as aforementioned), I view this as a game changer. It fundamentally changes the budget allocation for end users. And I don’t use this term often. I think this is the first time I used this term to describe a development in all my Seeking Alpha articles.

And Blackwell’s successor, dubbed Rubin, is already in the pipeline and penciled in for a 2026 launch. I expect Rubin to improve further upon Blackwell in terms of performance and energy saving thanks to its various technological improvements (e.g., Versa CPU, HBM4 Memory, etc.) and further widen NVDA’s moat.

Other risks and final thoughts

In terms of downside risks, valuation risk is a top one. This is a risk that many bears tend to emphasize, and there are good reasons here. Notably, as illustrated by the next chart below (a chart often quoted to highlight the valuation risks), NVDA is currently trading close to 40x of its sales, a multiple Cisco Systems (CSCO) enjoyed at the peak of the dot.com bubble. What happened to Cisco – which is history now, as the sayings go - could certainly happen to NVDA.

However, this chart failed to show a crucial difference here: NVDA is actually making A LOT of money now, way more than CSCO did in the 2000s. CSCO’s P/E ratio was over 200x, as seen in the second chart, compared to NVDA’s 47x (on an FWD basis).

Another downside involves competition intensification. Besides the pressure from other chip companies (such as AMD and Intel), many non-chip companies (such as Apple, Google, Meta, etc.) are also developing their own AI chips. For example, AMD’s R&D budget has more than doubled in the past 3 years as seen in the chart below, outpacing NVDA. Notably, the acquisition of Xilinx significantly strengthens AMD’s market presence with many critical IPs and the expansion potential to adaptive computing markets. Xilinx is a leader (and inventor) of the FPGA (field programmable gate array) chips, which I consider a promising technology in the AI era too.

The top advantage of FPGA chips in my mind is flexibility: the flexibility to be reprogrammed or functionally upgraded even after they are manufactured and installed. This is a key strategic advantage considering the costs of advanced chips and the quickly evolving computing landscape.

All told, despite the competitive landscape and a high valuation (although note that AMD trades at about 46x FWD P/E too), my verdict is that a BUY thesis can be justified for Nvidia Corporation stock given the growth catalysts. As argued in this article, I view the 25x energy saving from Blackwell as a game changer, paving the path for widespread deployment of NVDA’s chips. A good pipeline of new products such as the software ecosystem and the Rubin platform help to sustain the growth trajectory.