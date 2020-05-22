Intel will one-up Ampere within 1.5 years with Ponte Vecchio: on a 2x denser process with small, high-yielding chiplets and other innovations.

Nvidia announced its newest Ampere architecture and GPU with much fanfare after three years of Volta. The chip measures 826mm2, as large as any monolithic chip can be.

Nvidia (NASDAQ:NVDA) recently announced Ampere after three successful years with Volta in the data center. Its newest GPU architecture will be featured first in the A100 data center GP-GPU.

Nvidia has used pretty much all available silicon real estate possible for the 826mm2 chip with 54 billion transistors, as it is close to the so-called reticle size limit. This (likely) makes it a bit larger than the Xilinx (XLNX) Versal Premium, which is about 50 billion transistor.

Furthermore, while Intel (INTC) has been criticized in recent years for the power draw of its 14nm chips, the A100 does not fare much better with its 400W TDP, despite the 7nm process. Though, performance per watt is likely substantially higher than Volta, given the even larger increased performance.

While Intel’s work on the Xe HP Arctic Sound GPU and Xe HPC Ponte Vecchio is still underway, and final specifications are unknown, some preliminary comparisons can already be made. Here’s how Intel’s data center GPU chips might stack up against Ampere.

Architecture: numeric format support

Ampere's third-gen Tensor Cores support INT1 all the way up to FP64, a much wider dynamic range than Volta. Compared to Turing, FP64 is new. It also adds Google’s (NASDAQ:GOOG) (NASDAQ:GOOGL) BF16, which Intel is also backing, and further adds Nvidia’s new TF32 format.

Ponte Vecchio supports INT8 up to FP64 as well as BF16. So, while it lacks ultra-low precision support, for large-scale commercial deployments, this should not be an issue as INT8 likely remains the standard for inference.

One novelty in Ponte Vecchio is that it also contains SIMD units, the acceleration hardware as implemented in CPUs. Intel’s reasoning is that, by including SIMD units, it can cover a wide range of vector widths. Intel claims that the combination of both units can improve performance by over 2x in certain workloads.

Architecture: FLOPS

The A100 tops out at 20 TFLOPS FP32.

Arctic Sound is leaked to consist of four chiplets, each with 512EUs. At a relatively conservative 1.3GHz, this would yield 10.4 TFLOPS (FP32 precision) per chiplet. This would result in 40 TFLOPS of ‘classical’ performance. Hence, this means Arctic Sound will deliver double the performance within a year or so of A100.

This (comparison) does not include accelerator hardware such as Tensor Cores. Without such additional hardware, Arctic Sound tops out at 80 TFLOPS FP16, well below even the V100 using Tensor Cores.

The inclusion of the Tensor Cores, and potentially Intel’s own variant in Arctic Sound, makes comparisons difficult, but the tentative victory in 'classical' compute power goes to Intel.

AI acceleration

Ampere introduces the third generation of Nvidia’s Tensor Cores, meant for AI acceleration. Since Turing, it is targeted at both training and inference of deep neural networks.

The A100 achieves 312 TFLOPS of FP16 acceleration, and half that number using Nvidia’s new TF32 format. For FP16, this implies a 2.5x improvement in performance over the V100.

One thing to note, though, is that theoretical peak performance is just that, theoretical. When Intel announced its Nervana accelerator, one of its claims to success was that it could achieve much higher utilization of its hardware. So, despite having a similar amount of TFLOPS, Intel claimed it could deliver higher performance. Since that time, Nvidia may have improved its software stack, and Ampere might further include changes to help achieve higher utilization, so in the absence of benchmarks, the FLOPS number is the best way to compare across vendors.

For Intel, its Xe HPC architecture introduces a ‘data parallel matrix engine’, Intel’s equivalent of Tensor Cores. As Arctic Sound is based on Xe HP, it is unlikely that this will find its way into this product.

Based on Intel’s claims, we can do some basic math. Intel claims up to 32x vector rate per EU. This is likely a comparison of FP16 vs. INT8; the smallest supported formats of the regular EUs and the matrix engine, respectively. So, for FP16, the relative improvement might be 16x.

If we assume that Arctic Sound does have this matrix engine, then 16x 80 TFLOPS would be 1.3 PFLOPS, roughly 4x as much as the A100. As I said though, it’s unlikely/unconfirmed that Xe HP includes this engine, but it may give some indications about what could be expected from Ponte Vecchio.

Lastly, Intel claims this engine delivers a 40x improvement in FP64 performance per EU. The A100’s Tensor Cores achieve 20 TFLOPS FP64, while the CUDA cores reach 10 TFLOPS; a speed-up of just 2x. This suggests Ponte Vecchio will have the uncontested lead in FP64 performance.

Summing up, Ponte Vecchio’s data parallel matrix engine, tentatively, might give the A100’s Tensor Cores a hard time.

Interconnect

For inter-GPU interconnection:

Nvidia has NVSwitch.

Intel has Xe Link, based on the CXL standard.

Other features

As already mentioned, a differentiating feature for Intel is the inclusion of larger-width CPU SIMD units. For Nvidia, it has lower precision support.

Intel further has a Rambo Cache. In Ponte Vecchio, there is one such chip per two compute chiplets. It has high bandwidth and serves to conserve peak performance through all matrix sized – Intel claims that, in general, performance falls off as matrix sizes become larger.

The Rambo Cache also contains Xe Memory Fabric hardware, Intel’s intra-GPU interconnect. Intel says it is scalable to thousands of EUs, so as Moore’s Law further progresses throughout the decade, Intel should be able to keep scaling its Xe architecture, and its future successors to ever more transistors.

In any case, Ponte Vecchio might be the most feature-complete GPU when it launches.

Time to market

Ampere A100 is in production, according to Nvidia, so general availability is likely to follow in the second half of the year.

Intel’s 10nm Arctic Sound might see an early 2021 launch, at best. The 7nm Ponte Vecchio is slated for Q4 2021.

So, Nvidia has a considerable lead in time to market. However, as Ponte Vecchio is built on Intel’s 7nm process, which is likely to fall between TSMC’s 5nm and 3nm processes, a lag of less than 1.5 years might be considered competitive, especially given some of the preliminary specs discussed.

Summary

With Ampere, Nvidia has unified its Turing and Volta architectures, which served inference and training in the data center – as well as other graphics workloads.

Intel, on the other hand, has created one architecture, Xe, with three derivative micro-architectures: Xe LP, Xe HP and Xe HPC.

In the data center, this means there will be two GPUs to go up against Ampere. Arctic Sound and Ponte Vecchio seem set to launch in the first and second half of 2021. This will give them a lag in time to market, but they seem to pack advanced technology and features to make up for this.

While Arctic Sound may lack Intel’s equivalent of Tensor Cores, its data parallel matrix engine, it consists of up for four chiplets. This should give Intel an inherent manufacturing cost advantage. But also gives Intel the ability to go beyond the ~850mm2 monolithic silicon area limit that Ampere faces.

Indeed, for general CUDA hardware, the A100 tops out at 20 TFLOPS FP32, while Arctic Sound may go up to ~40 TFLOPS, assuming four 512EU chiplets.

Ponte Vecchio, within 1.5 years from now, will go even further. It has 4x as much chiplets as Arctic Sound, although likely smaller ones, on its 7nm process node, which doubles the transistor density compared to the A100. It has some other unique features such as SIMD units, Rambo Cache and the Xe Link, which is based on the CXL standard.

Scorecard:

Numeric format support: Nvidia

AI acceleration: Ponte Vecchio?

Interconnect: Similar?

Other features: Ponte Vecchio

Time to market: Ampere vs. Arctic Sound, but Ponte Vecchio is likely to arrive before Nvidia's 5nm successor.

Takeaway

At 54 billion transistors and 826mm2 in size, the A100 may seem like a chip that will dominate the data center for years to come.

However, the landscape has changed drastically since Volta in 2017. Entered the market have numerous dedicated AI accelerators, which do not have to carry the burden of containing legacy graphics hardware, including Intel’s own Habana effort.

Also entering the market, as described above, is Intel with its own full-stack Xe effort. Intel will detail more about Xe at HotChips this summer, but what Intel has revealed so far is promising. Most notably, Intel will go above and beyond Nvidia by combining multiple smaller chips in one large-scale GPU. Just like Nvidia, it can then combine multiple GPUs with its Xe Link.

While Intel is the incumbent in dedicated GPUs, it is bringing novel innovations to bear, which is surely to give Nvidia a hard time, even if Intel starts from zero customer momentum, and may also need several more years to catch up in software ecosystem, as its oneAPI just launched in beta late last year.

When Ponte Vecchio launches in late 2021, it may be the most advanced GPU on the market, as I also noted in a previous article.

So, in terms of financial implications, it is hard to make forecasts, but when benchmarks will be run, it is unlikely that Ampere will still be at the top by the end of 2021, at the latest. Hence, Nvidia investors should hope that Ampere is on a closer to 2-year cycle than Volta’s 3-year one. To that end, initial 5nm rumors concerning Nvidia have popped up recently.

