Nvidia Releases Its Pascal Weapon

| About: NVIDIA Corporation (NVDA)
This article is now exclusive for PRO subscribers.

Summary

Nvidia was previously expected to release its Pascal architecture several weeks or months after AMD's Polaris, but Nvidia is already delivering its new products to cloud and HPC customers.

Pascal's new architecture and flagship chip show staggering and astonishing specifications that were not expected at these levels.

Very high frequency, 16 nm FF+, Asynch Compute available, 3840 CUDA cores, HBM2 for 4096 bit, 15B transistors, 56 SMs, 28 TPCs, 5.3 FP64 TFLOPS, and 10.6 FP32 TFLOPS.

Tesla P100 (full production), DGX-1, Drive PX2, GP104 (GTX 1080 - in production), GP106 (GTX 1070 - in production), and HBM2.0 - this a lot of iron on fire.

P100 is being already delivered to supercomputers/cloud customers, and GTX 1080 and 1070 are to be released at before Computex. AMD has lost the predicted timing advantage.

Nvidia (NASDAQ:NVDA) has recently announced and shown its new Pascal architecture. This architecture looks highly beefed up - it provides massive high performance per watt, shows 2:1 ratio between FP32 and FP64 performance, seems to have corrected the Async Compute issues and, most importantly, it is already on the enterprise market, while it will hit the consumer market between May and June.

Some analysts have been taken off guard, since they expected Pascal to arrive a lot after AMD's (NASDAQ:AMD) Polaris. But given the actual situation, AMD has lost at least one of the expected advantages: timing.

The Pascal Architecture - Overview

GPU PASCAL GP100

Credits: TechSpot

The Pascal architecture has been highly optimized and tweaked, using the latest technologies and following the Maxwell optimization path. We can consider the GP100 GPU to make the comparison against the older flagship GM200.

  • 16nm - First of all, Pascal is being produced by TSMC foundries, exploiting the enhanced 16 nm FF+ miniaturization. This technology enables +65% higher speed, nearly double density and -70% power consumption in comparison to the old 28nm HPM technology.

    This made it possible to massively reduce power consumption and increase performance, but it is only a part of the power consumption renovation.

  • FP32:FP64 ratio - The Streaming Multiprocessor (SM) has been modified in order to provide a 2:1 ratio between FP32 and FP64 units. This change targets the HPC and enterprise segment that highly exploit FP64 calculation. The FP32 units per SM have been reduced to 64, and the entire CUDA composition of each SM has been divided into 2 partitions, with 2 dispatch units, a warp scheduler and a large instruction buffer unit.

    This modification simply enables the upcoming Pascal enterprise solution to become a serious pretender for the HPC market, which is highly dependent on FP64 calculation.

  • Registers, multi-threading, throughput, CUDA subset - Another tweak is that each SM has the same registered number of Maxwell architecture, but Pascal provides half CUDA cores per SM (but more CUDA cores per GPU). This means each Pascal CUDA core has free access to twice registers, each Pascal GPU has more multi-thread capability, each thread hits twice the registers and there is a lot more throughput.

    Nvidia Pascal SM

    The registers amount to 14 MB, and the L2 Cache (overall shared through the GPU) amounts to 4 MB, resulting in a double bandwidth inside the chip compared to Maxwell GM200. All of this translates into more warps for each instruction scheduler to chose from, more bandwidth per thread to shared memory and more load to initiate, implying a lot higher efficiency at code execution.

    The overall result is that each new SM requires less energy and area to execute data transfer even compared to Kepler SMX, and therefore, the power efficiency is increased even on this aspect: even if the new 16 FF+ process node has been really important, all the various architecture tweaks have a fundamental role.

  • Frequency, CUDA - The 16nm FF+ process node enabled Nvidia to push hard the GPU frequency, without compromising the power consumption: the transistors are able to switch faster, implying higher frequencies and faster response. The result is that the GP100 shows a frequency increase about +33% over GM200, an increase of +17% of CUDA cores, +17% of FP32 units, +1767% of FP64 units (or +87% against GK110 Kepler). In particular, the FP64 addition is like adding twofold the number of FP32 units in term of power consumption, and it also eats up a lot of precious die area size. The higher frequency for such a big die makes me think that the consumer graphic cards will be way beyond the 1.5 GHz frequency, reaching astonishing performance and providing a still big overclock potential (2.0 GHz with air cooling will be an affordable target).

FP32:FP64 Ratio - Memory Coherency

I have already mentioned that Pascal (at least the professional version) implements 2 FP32 units every FP64 unit, while Maxwell implements 64 FP32 units every FP64 unit. This is a great enhancement that will help Nvidia to run in the HPC and server market, since double-precision calculation (FP64) is one of the most important factors.

It is peculiar to notice that the SM has been set to 64 CUDA cores per SM - a ratio that resembles the AMD GCN architecture. The scheduler has also been tweaked in order to power advanced scheduling and advanced asynchronous computation. Not to mention that Pascal is compliant with IEEE 754-2008 single- and double-precision calculations, and it supports Fused Multiply Add at full speed.

Nvidia has also decided to widen the memory coherency, supporting FP64 add instructions in global memory - a big advancement from Maxwell capabilities. This is a coherent choice, given the inherent FP64 potential of the Pascal architecture.

Real FP16 Finally

The Pascal architecture has finally implemented a full-power FP16 (half-precision) calculation. The FP16 calculations are executed at a double rate compared to the FP32 calculations, since the CUDA cores are able to use the FP32 units to calculate two instances of FP16 calculations, while each 32-bits register is able to store two FP16 values: this translates into pure FP16 performance at a double rate of FP32 performance.

And where such FP16 performance delivers great and convenient performances, while its low calculation precision is no real matter?

Deep learning is the obvious answer: the half-precision calculation is sufficient to handle the massive data concerning deep learning topics, and it is very convenient, since FP16 calculation is less energy-consuming. Not to mention that FP16 is less memory-consuming, and it enables working on bigger networks, providing faster and more efficient deep learning. Consider that GP100 is able to deliver 21.2 TFLOPS at FP16, while the Drive PX2 (flagship SKU for automobile and autonomous field) is able to deliver something beyond 15 TFLOPS at FP16.

But deep learning and automobile/autonomous fields are not the only sectors suitable to exploit FP16. FP16 calculation is fundamental in the mobile sector, since precision is not required, while power efficiency is very important. The Tegra X1 had some limitations caused by the Maxwell architecture: Maxwell has not full FP16 capability - in fact, it is able to run FP16 calculations at double rate only for restricted and specific kinds of operations, and only if each couple of operations per register is equal. This limitation could not enable Maxwell to achieve its full potential in FP16 calculation.

However, Pascal is now able to fully exploit the FP16, which is highly required in mobile graphics. Considering the good performances achieved by Maxwell with Tegra X1 (at 20 nm, quite similar to the old 28 nm), Tegra X2 (or Tegra P1) built at 16nm and powered by a Pascal GPU with FP16 at full thrust, it is set to deliver astonishing performance per watt. Therefore, Nvidia has finally created an SoC able to run at its maximum from the graphical and production point of view: it will be interesting to see if the company will also release it for multimedia players, tablets and other multimedia devices.

HBM 2.0

But it is known that latest modern GPUs have always suffered some feeding issues, since the memory bandwidth has not been able to efficiently increase speed and bandwidth with reasonable power consumption. Nvidia has partially counterbalanced this issue using effective memory compression, but GP100 is a totally disruptive GPU, and such tweaks are sufficient no more.

In order to sufficiently and efficiently feed the new architecture and potential, the company decided to adopt HBM 2.0 modules. These modules enable GP100 to reach 720 GB/s of bandwidth (1.4 GHz frequency and 4096 bit interface), which is three times the Tesla M40 version.

With HBM 2.0, Nvidia has been able to massively improve memory power efficiency, space requirements, GPU feeding requirements and size capacity, which is a very important aspect when considering deep-learning dynamics.

But the very interesting point is that GP100 is already hitting the market (new Teslas are being delivered to HPC and deep-learning customers right now), meaning that Nvidia has already HBM 2.0 supply, while AMD expects to employ HBM 2.0 only in 2017.

In addition, this Tesla P100 with GP100 does not look like the ultimate product from Nvidia: the memory HBM 2.0 frequency sets at only 1.4 GHz (while Jedec specifics speak about 2.0 GHz), and the memory amounts to 16 GB, well below the Tesla M40 with 24 GB. Nvidia is likely to deliver additional boosted Tesla cards in the following months.

Asynch compute and scheduler

Another important aspect for the audience (but that is actually marginal from the performance point of view) is the fact that Nvidia has highly tweaked and updated the scheduler, meaning that Asynchronous Compute has been probably corrected and tweaked too.

Async compute enables the GPU to simultaneously run different tasks from different queues, exploiting the GPU potential better. Maxwell architecture couldn't use it at its fullest due to scheduler issues, and once the queue gets over the 32-deep, its scheduler gets over-headed and slows down (since it is partially software-driven).

Pascal comes with a tweaked and improved scheduler and a more modular and balanced CUDA/register/warp composition that resembles more the shader balance present in GCN architecture. Pascal is very likely to exploit the async compute with deeper and longer queues, given also the massive registers and the inherent massive bandwidth inside the GPU.

But what does this mean? Async compute is likely to improve gaming performance by 5-10% if draw calls are massively used - a performance boost that can be easily counterbalanced with a higher focus on architectural brute force and driver optimization. In any case, the difference for async compute between Polaris and Pascal is likely to be lower than the difference between Maxwell and Fiji.

And what about the professional field? Async compute has no relevant impact in those professional fields like HPC, deep-learning, rendering, 3D motion and so on. Given that the future revenue growth is set to be driven by deep-learning, HPC and automotive, while the consumer market is stagnant and Nvidia has already 80% of that market, if async compute will still deliver performance difference in the consumer market, it practically won't affect the company's financial performance.

Tesla P100 And Delivery

Another very important topic is the product delivery. Months ago, the most common expectation was that AMD would have released the Polaris architecture some months before Nvidia's Pascal. In previous articles, I have written that the release time difference, if present, would have been about some weeks.

But Nvidia decided to surprise everyone not only showing Pascal before Polaris, but releasing it for real: GP100, which powers the top professional Tesla cards, is already being delivered to HPC customers, and it is directly powered by HBM 2.0 memory.

Here, we can see a double advantage from Nvidia: it is able to exploit a very consistent time advantage in the professional field, which is highly profitable (given that AMD will release Polaris for the consumer market first). In addition, it has a time advantage considering the HBM 2.0 adoption, since AMD has already announced it will employ HBM 2.0 starting from 2017.

Such a time and technological advantage in the professional field is consistent, and it may be a serious issue for AMD professional hopes.

DXG-1

Nvidia not only is already delivering GP100 to HPC customers, but the company also offers its brand new small supercomputer system.

The DXG-1 is powered by 8 Tesla GP100 cards and a Dual Intel (NASDAQ:INTC) Xeon E5 2698 v4, and it delivers roughly 170 TFLOPS at FP16. This system enables customers to heavily accelerate deep-learning velocity and increase efficiency, also thanks to the NVlink system: 4 NVlink interconnections link a quad GP100 packet, which is connected to an additional quad GP100 packet through 2 additional NVlinks. This system enables the GPUs to sustain a massive and speedy aggregate bandwidth that fills the feeding requirements to better exploit the Pascal potential.

DXG-1

Credits: QTH.com

Anyway, this is essentially a solution keen to deep learning, and given the high performance/size ratio and the very high performance/power consumption ratio (DXG-1 consumes a maximum of 3.2 KW), I have no doubt it will be a successful product in this research field.

Against Knights Landing

Tesla P100 will have to face the Knights Landing solution from Intel. If you want a deep insight of what Knights Landing is offering, you can read this article.

Knights Landing is a card with a different approach in comparison to GPU cards: GPUs go better once the parallelization is fundamental and taken to its fullest, while a Knights Landing card goes better when programs and calculations are more related to Haswell binary compatible codes. In addition, Knights Landing is able to run in standalone mode, without the need of a central processor to drive it (because it is essentially a CPU).

We can say that Pascal and Knights Landing cover different aspects of the same market (for what concerns HPC), and they are able to coexist. Obviously, a Tesla GP100 is able to calculate 5,300 FP64 GFLOPS and 10,600 FP32 GFLOPS, while Knights Landing is able to calculate only 3,000+ FP64 GFLOPS and 8,000+ FP32 GFLOPS. Pascal performance are even beyond my previous expectations. Therefore, Nvidia may even be able to increase its market share in the HPC field a little, attacking where there is a partial but consistent parallelization, or where Haswel binary synergies are not present or required.

Drive PX2

And here it comes, another important product that has a lot of interesting specs.

DrivePX2

CREDITS: WCCFtech

Drive PX2 is equipped with 2 Tegra X2 (or Tegra P1) and two discrete Pascal GPUs. The Tegra X2 shows off 2 tweaked Denver cores and 4 A57 cores, built with 16 nm FF+ process node. The discrete GPUs instead are believed to be something similar (or equal) to the future GP106 or GTX 1060, and it may make sense: Drive PX2 delivers 8 TFLOPS at FP32 precision, but a couple of Tegra X2 is probably able to deliver something beyond 3 TFLOPS. The remaining 5 TFLOPS (or less) cannot be delivered by a couple of future GTX 1060 cards at full thrust, given that 2 GTX 960 already provide 4.7 TFLOPS FP32. I expect Drive PX2 is employing a couple of detuned GP106 chips (aka GTX 1050) or a couple of GP107 without strict thermal limitations.

In any case, such a powerful mobile deep-learning solution, which is able to calculate 8 TFLOPS FP32 and 16 TFLOPS FP16, will be a very interesting product for self-driving car developers and industries. There are already a lot of Drive PX customers: given that Drive PX2 nearly quadruples the FP32 calculation and more than quadruples the FP16 calculation, customers will very likely move onto this high-speed product. This product is also AutoSAR compliant, and it is able to make 24 trillion operations - a suitable specific for car safety mechanisms. In addition, the overall system is optimized for redundancy and mission-critical system safety.

This is a real and consistent business, and the latest Nvidia financials are a great confirmation. The company increased automotive revenues by 68% QoQ, and given the early-stage development, it is only the tip of the iceberg.

And here it comes, the business change from Nvidia. Drive PX focus was on powering the infotainment. Drive PX2 focus changes to heavily enter the ADAS business, powering the DDN-based autonomous systems, with a price tag of nearly $15K, and it is already being shipped to high-end customers. Rest assured that a LIDAR/DDN system for autonomous driving at $15K is extremely cheap nowadays. I expect automotive revenue to grow at a consistent and high-double digit rate in the next quarters.

Upcoming GTX 1080, 1070, 1060

And what about the consumer market? Nvidia and AMD area going to release their solutions, but timing and product range are very important aspects to consider.

While AMD was believed to greatly anticipate Nvidia in architecture releasing, not only has the latter already hit the HPC market with Pascal (while AMD will hit the same market only after several months), but it is going to anticipate AMD in the consumer market too.

While AMD will probably release some mid-range products at the Computex (Polaris 10 for desktop and high-end laptop products, Polaris 11 for low-end and mid-end laptop products), Nvidia is expected to release mid-end and high-end products in May at a dedicated event, with product availability in early June.

Nvidia is likely to release the GTX 1080 and GTX 1070. Considering the leaked die size images, the GP104 chip die size will provide nearly 8 billion transistors, half of the size of the bigger GP100. Given that the FP64 are expensive (higher die size), and that they are useless in the consumer field, the GP104 is likely to provide a FP32:FP64 ratio, similar to the Maxwell architecture, in order to reduce costs and power consumption.

I personally expect to see the GP104 chip powering the GTX 1080 (GP104-400-A1) and GTX 1070 (GP104-200-A1) only, while the GTX 1080 Ti may be powered by a GP100 chip.

The upcoming GTX cards are likely to use more CUDA cores, while the frequency is likely to be well beyond 1.5 GHz. At the same time, power consumption will set at very competitive levels, thanks to the tweaks described in the first part of the article. The interesting point is that Nvidia is going to hit the market starting from the top-end in the HPC sector, and from the mid-end and high-end in the consumer sector: in this way, the company will push on marketing, showing what its GPUs are capable of. As for the mid/low-end market, Nvidia will release the GTX 1060 starting from Fall 2016. AMD, instead, will start from the low-/mid-end market between June and July, while it will attack the high-end range from Fall 2016.

I think Nvidia's strategy is more sensible, since it shows from the start the company's performance capability, while AMD will take additional months to really provide its entire "picture". This different timing is likely to drive sales in Nvidia's favor or maintain its share advantage.

In addition, the GTX 1060 sounds very interesting, given the rumors and bigger GP100 chip capabilities. GP106 is expected to be 200 mm2 with 1280 CUDA cores, and if we consider the GTX 960 frequency, GTX 1060 would hit 2,885 FP32 GFLOPS. But the Pascal architecture and the process node enable Nvidia to run the GPU at very high frequencies: considering that the top GP100 runs at 1480 MHz, the GP106 is expected to run well beyond 1500 MHz, reaching something around 3,700-3,900 FP32 GFLOPS. This would be a great achievement, because it would reach and surpass the GTX 970 3,494 FP32 GFLOPS, which is the Nvidia minimum requirement to meet VR specifications. In practice, with a mid-entry level price and a power consumption around 100W, Nvidia would be able to greatly widen the VR customer base. Given that the VR market is the next "big thing" for the graphics producers, providing such an economic but VR-capable solution would be a great and smart move. (Honestly, I would not exclude that Nvidia may use a GP104-150-A1 chip instead of a GP106 in order to power the GTX 1060.)

AMD?

AMD is going to release its Polaris architecture at Computex (or around that time), and it will hit the consumer market with the mid-end range between June and July. For what concerns the professional and HPC field, we will have to wait some additional months, while the company will release ZEN CPUs in 4Q 2016. For what concerns the new APUs, AMD will release them only by H1 2017.

Not considering for a while the performance/watt ratio, since these two architectures still do not show benchmarks or real performance data, timing gets a very important consideration, and AMD has to face a not simple problem: AMD was expected to anticipate Nvidia on multiple fronts, but it looks like Nvidia is at a far more advanced stage in the HPC and professional field, it has no real competition in deep-learning at its price levels and it has a timing advantage even in the consumer graphics release. Since there still isn't consistent performance, power consumption and price comparison, it will be difficult to think that AMD could be a serious competitor for Nvidia. We have to wait some time more.

This means that Nvidia "may" retain its market share, while generating more revenue from the HPC computing field (where Maxwell was not keen to, and Knights Landing has a completely different approach) and from the deep-learning field (automotive too). Also considering the massive buyback, the company is set grow even more if it continues to deliver on the programmed schedule. NVDA is a long-term buy.

Disclosure: I/we have no positions in any stocks mentioned, but may initiate a long position in NVDA over the next 72 hours.

I wrote this article myself, and it expresses my own opinions. I am not receiving compensation for it (other than from Seeking Alpha). I have no business relationship with any company whose stock is mentioned in this article.

Additional disclosure: The author does not guarantee the performance of any investments and potential investors should always do their own due diligence before making any investment decisions. Although the author believes that the information presented here is correct to the best of his knowledge, no warranties are made and potential investors should always conduct their own independent research before making any investment decisions. Investing carries risk of loss and is not suitable for all individuals.