AMD: The APU And High Bandwidth Memory - Maintaining Graphics And Total Compute Performance Leadership

| About: Advanced Micro (AMD)


AMD is set to transition to High bandwidth memory (HBM) across their GPUs and APUs to address the bandwidth requirements going forward.

AMD could implement a high capacity/high bandwidth L3 cache using 2.5D stacked memory on silicon interposer for their first HBM implementation on APUs.

HBM should allow AMD to maintain graphics and total compute performance leadership over upcoming Intel system on chips like Broadwell and Skylake.

AMD is working on a new high performance x86 CPU microarchitecture, which will debut in 2016. This should improve the competitiveness of AMD APUs on pure CPU workloads.

AMD remains a good long-term investment for investors who are looking at a timeframe of 2+ years.

AMD's (NYSE:AMD) Accelerated Processing Unit (APU) strategy revolves around heterogeneous compute and maintaining graphics and total compute performance leadership. Total compute performance is the total performance of a system measured in FLOPS (FLoating-point Operations Per Second). This includes the combined performance of the Central Processing Unit (CPU) and Graphics Processing Unit (GPU) integrated in a modern system on chip (SOC).

AMD and Intel (NASDAQ:INTC) have been adding significant GPU performance with each generation of SOCs. The GPU provides the vast majority of FLOPS in a SOC or APU. AMD had an unquestionable lead in graphics and total compute performance until Intel launched the Haswell SOC with Iris Pro graphics. Intel's high end Iris Pro graphics more than doubled the number of graphics execution units from previous generation chips like Ivybridge.

Intel used embedded Dynamic Random Access Memory (eDRAM) on the CPU package in a multi chip module design. The eDRAM was utilized as a high bandwidth L4 cache to address the bandwidth requirements of the powerful Iris Pro GPU. This allowed the Iris Pro GPU to perform well by reducing bandwidth bottlenecks. Intel will bring more models with on package eDRAM with their upcoming generation system on chips (SOCs) - Broadwell and Skylake. This poses a significant threat to future AMD APUs as leadership graphics performance is one of AMD's key strengths.

AMD has designed an elegant and efficient architectural integration of CPU and GPU in the APU with the Heterogeneous System Architecture (HSA). AMD Kaveri is the first HSA APU to support heterogeneous unified memory (hUMA) and heterogeneous queuing (hQ). Anandtech has a detailed write-up on the HSA features.

AMD's APU roadmap calls for further architectural improvements, which gives the GPU within the APU greater degree of control in the overall system architecture. AMD is also investing significant effort in getting independent software vendors (ISV) like Microsoft (NASDAQ:MSFT) and Adobe (NASDAQ:ADBE) to utilize HSA features and to improve performance on mainstream applications. AMD's Mantle Application Programming Interface (API) reduces software bottlenecks on AMD APUs and should expand the performance lead in graphics and gaming. AMD has quickly gathered excellent support for Mantle from the top game development studios.

The GPU in the AMD Kaveri APU is powerful and has a sophisticated general purpose compute architecture. The GPU architecture found in AMD APUs is the same industry leading Graphics Core Next (GCN) architecture found in AMD's high end GPUs like R9 290X and the next generation consoles like PlayStation 4 and Xbox One. In spite of the superior GCN architecture in the AMD Kaveri APU, it faces significant bottlenecks due to low system main memory bandwidth and cannot beat Intel Iris Pro graphics consistently. The dual channel DDR3 memory controller supports memory speeds up to 2133 MHz and provides a maximum bandwidth of 34 Gigabytes/second (GB/s).

The GPU in Kaveri APU has 512 GCN cores (stream processors) and is similar to the HD 7750 GPU. The HD 7750 GPU has 16 render output units (ROPs) while the Kaveri APU has 8. I refer to this comparison of HD 7750 GDDR5 and HD 7750 DDR3 to give an idea of how much memory bandwidth is critical to gaming performance. The HD 7750 GDDR5 version is 1.8x-2x as fast as HD 7750 DDR3 in the most demanding titles like "Battlefield 3," "Crysis 2," "Sleeping Dogs," "Witcher 2 EE," and "Alan Wake." The average performance increase across the test suite of games was just above 50% (1.5x).

Kaveri shines with extra memory bandwidth and does not scale in performance with increased GPU core clock speeds as the GPU is severely bottlenecked by memory bandwidth. This article from PC Perspective highlights the bandwidth problem on AMD Kaveri APU. Increasing the memory bandwidth by 50% from DDR3 1600 to DDR3 2400 brings a 22%-33% increase in performance at stock GPU core clock. On the contrary, increasing the core GPU clock by 42% brings 3%-8% performance improvement at stock memory clock. The memory overclock had a much more significant impact on gaming performance. All these point to the fact that AMD's current APUs are bottlenecked by system memory bandwidth.

AMD's past 3 generation of APUs (Llano/Trinity/Richland) have been built on Globalfoundries' 32nm silicon on insulator (SOI) process and the current generation Kaveri is built on Globalfoundries 28nm Super High Performance bulk process. The next 2 generations of APUs should feature a transition to Globalfoundries 20 nm Low Power Mobility (LPM) process in 2015 and Samsung (OTC:OTC:OTC:SSNLF) / Globalfoundries 14 nm Low Power Enhanced (LPE) FINFET process in 2016. These process nodes bring twice the transistor density and significant power reduction (30%-55%). AMD APUs could easily break the 4 billion transistor mark at these nodes and get close to 5 billion transistors. The GPU in AMD's next generation APUs can easily double in core count to 1024 stream processors. This would mean the next generation APUs stream processor count is similar to the HD 7850 and R7 265 graphics cards. AMD could also implement 32 render output units (ROPs) on the APU to achieve smooth 1080p gaming at high quality in the latest games. The PlayStation 4 (PS4) APU and HD 7800 series graphics cards sport 32 ROPs. The HD 7800 series graphics cards and PS4 APU sport 256 bit GDDR5 memory and have memory bandwidth of 153.6-176 GB/s.

AMD's memory bandwidth requirements multiply enormously by a factor of more than 4x when considering upcoming APUs. The solution to AMD's memory bandwidth requirements lies with JEDEC standard High bandwidth memory. AMD has long been the driver of new memory standards in the graphics market. AMD was the first to GDDR4 memory (X1950XTX) and GDDR5 memory (HD 4870). The bandwidth requirements on AMD APUs are primarily driven by the massive GPU component. AMD needs to have predicted this problem given their expertise in building GPUs. AMD must have been working on a solution to this bandwidth problem for quite some time.

AMD has been working for at least 3-4 years on future memory technologies for their GPUs and APUs. The first evidence of their efforts came in 2011 when AMD showcased 2.5D stacked memory on silicon interposer using thin silicon vias (TSV). These efforts were discussed in presentations from Amkor (NASDAQ:AMKR) and Hynix (OTC:OTC:HXSCF) from AMD's Technical Forum and Exhibition (TFE) 2011. Recently last year AMD made a bold statement in one of their keynote presentations that HBM and die stacking are finally going mainstream. AMD will be partnering with Hynix for using HBM in high end GPUs and APUs. The question that arises now is, how would AMD use HBM in their APU products? The first option is to transition to system main memory using 2.5D stacked HBM. The second option is to use a 2.5D HBM stack as a high bandwidth/high capacity cache. This patent filed by AMD on June 25, 2012 and published on December 26, 2013 could give us clues as to how AMD will integrate HBM in APUs initially.

AMD talks about an "Integrated circuit with high reliability cache controller and method therefor" in the patent number US 20130346695 A1:

"FIG. 3 illustrates in block diagram form a computer system 300 that supports a high reliability mode according to the present invention. Computer system 300 generally includes an accelerated processing unit ("APU") 310 and a dynamic random access memory ("DRAM") memory store 340. APU 310 generally includes a first central processing unit (CPU) core 312 labeled "CPU0", a second CPU core 316 labeled "CPUI", a shared L2 cache 320, an L3 cache and memory controller 322, a main memory controller 328, and a register 330. CPU core 312 includes an L1 cache 314 and CPU core 316 includes an L1 cache 318. DRAM memory store 340 generally includes low power, high-speed operation DRAM chips, including a DRAM chip 342, a DRAM chip 344, a DRAM chip 346, and a DRAM chip 348. DRAM memory store 340 uses commercially available DRAM chips such as double data rate ("DDR") SDRAMs"

A single 4 chip (4 Hi) memory stack with HBM will have a capacity of 1 GB and memory bandwidth of 128 GB/s. This is more than sufficient to handle the bandwidth requirements of AMD APUs in the near future. AMD could maintain socket compatibility with existing FM2+ sockets, but there is no guarantee of the same. AMD could also support HBM on their high end A10/A8 chips alone. AMD could disable the L3 cache controller on the low end A4/A6 chips, disable portions of the GPU, and ship without any HBM based L3 cache. This could be done to keep costs low on the entry level APUs and for yield reasons. The other reason for HBM to be used initially as L3 cache is the availability of only 2 Gigabit chips in a single 4 chip stack (4 Hi) configuration. This allows a maximum of 1 GB capacity in a single 4 chip stack. This is a very low capacity for system main memory, but ideal for a high capacity/high bandwidth L3 cache. SK Hynix expects to launch higher capacity 8 Gigabit stacked DRAM chips in 4/8 chip stack configurations in 2016 and at double the bandwidth of 256 GB/s. This would allow capacities of 4GB and 8GB with a single HBM chip stack and be suitable for system main memory. Here are a couple of slides from a recent SK Hynix presentation that I found to be very important:

The eventual transition to stacked HBM based system main memory with AMD APUs could happen sometime in 2017 or later when 4GB and 8GB HBM stacks are available in high volume and economical prices.

AMD is also working on a new high performance x86 CPU microarchitecture, which will debut in 2016. This should improve the performance and efficiency of AMD APUs in CPU workloads. This new CPU architecture should improve single thread integer performance, which is the main weakness with current AMD CPUs and APUs. This new CPU architecture should improve total compute performance and efficiency.

In summary, AMD will transition to HBM on their high end APUs and discrete GPUs. The indications are that the transition could happen as soon as 2015. These developments will bolster AMD's graphics and total compute performance leadership on upcoming APUs against Intel SOCs like Broadwell and Skylake. AMD's discrete GPU products will also benefit immensely from HBM and this development augurs well for their discrete GPU business. AMD remains a good long-term investment on the strength of the various developments happening now. The transition to HBM followed by a new high performance x86 CPU microarchitecture in 2016 should help AMD to maintain and extend the performance leadership of their APUs. The adoption of HSA features by ISVs will bring improved performance in mainstream applications. These developments combined should help drive market share gains against Intel and also increase the average selling price of AMD APUs.

Disclosure: The author has no positions in any stocks mentioned, and no plans to initiate any positions within the next 72 hours. The author wrote this article themselves, and it expresses their own opinions. The author is not receiving compensation for it (other than from Seeking Alpha). The author has no business relationship with any company whose stock is mentioned in this article.