This is the second in a three-part series on the Micron (NASDAQ:MU) and Samsung (OTC:SSNLF) Hybrid Memory Cube technology. Please read Yodeling Into The Hybrid Memory Cube if you aren't familiar with the rationale behind the HMC's design and why it does not compete directly with the parallel High Bandwidth Memory that recently hit the streets.
The Hybrid Memory Cube ("HMC") is enabled only by recently-developed, high-bandwidth 3D chip stacking and manufacturing technologies. While these technologies are initially expected to add significant cost, that cost is expected to come down dramatically. From a related Micron HMC patent application, we can see that they fully expect to see them span from smartphones to supercomputers:
End products of the above designs will find a wide variety of applications including, among others, in mobile electronic devices such as so-called "smart phones," laptop and notebook computers, supercomputers, BLACKBERRY® devices, iPHONE® and iPAD® devices, and DROID® devices.
Extrapolating is fun. And while the initial HMC products will be standalone in design, it doesn't take much extrapolating in order to figure out that, now that the HMC's logic base chip is built on the same fab line as a CPU, the natural progression of integration will marry the two into a single chip (or 3D stack). For the overwhelming majority of the market, we're already past the tipping point for this.
As pictured above, Intel's (NASDAQ:INTC) tiny Xeon D system on a chip illustrates the sheer miniaturization of computing. This is an extremely powerful device and, yet, it is only the size of a dime. Clearly, in some instances, there is an opportunity to integrate a CPU into a stack of memory in order to avoid the need for complex external memory altogether.
A smartphone, tablet, laptop or desktop, for example, could easily get by with a bunch of CPU cores that sit beneath a stack of DRAM and 3D XPoint. Using technologies that are available today, you can conjure 1 gigabyte of DRAM sitting on top of 100 gigabytes of 3D XPoint. In turn, this all sits on an Intel or ARM (NASDAQ:ARMH) CPU, for example.
- Side Bar: I fully realize that the vast majority of people still think of 3D XPoint as "storage" instead of "system memory" so it is important to note that this stack of technology is going to perform light years better for the overwhelming majority of use cases out there. Regardless, just know that the performance and energy efficiency will be astounding.
Apps will always be in a running state, ready to hop into the high-speed DRAM cache, if required. And, when you shut off the screen on your battery-powered device, this DRAM will evacuate into the 3D XPoint in an instant and the whole device will enter a powerless hibernation (awaiting a wake-up call from the user or the cellular modem).
The stack of chips will be connected just like the interior of the HMC - with an extremely high-speed parallel bus. Although the HMC was invented in order to facilitate an external serial memory bus, its very achievement eventually mitigates such a need in many computing applications (though the high-speed serial bus may be retained for other useful non-memory purposes like discrete graphics in some cases).
The separate chips in the stack will be so closely coupled that they will function as if they are a single chip. This reduces the need for CPU cache, further increasing performance and power efficiency. The whole idea is called Processor in Memory ("PIM") and Micron has already arrived at this conclusion of CPU and memory integration. From a Micron HMC patent:
The logic base layer 215 may include one or more processors to implement the functions described, and an HMC can be a processor in memory ("PIM") device.
PIM isn't a new idea. The industry has been chasing it for years but it is only recently enabled by the aforementioned 3D chip stacking technologies. Furthermore, Micron has become an active participant in the renewed interest in Near Data Processing ("NDP"), a technique which leverages the power that is enabled by PIM and other ways of getting computing elements close to the memory.
To clarify that comment at the bottom of this slide. The "NFAs" are the processing elements in the Automata memory. "NDP" is the aforementioned "near data processing" capability and, in this case, is provided by way of an Altera (NASDAQ:ALTR) FPGA. Automata has a very narrow set of processing capabilities and, as such, makes heavy use of the FPGA in accomplishing work without involving the distant CPU.
System of a DIMM
As processing moves closer and closer to the data, the need for the dated "central processing unit" architecture is reduced. Over the past decade, this new NDP architecture ecosystem began developing. When the FPGA makers decided to embed FPGA with general purpose ARM processor cores, Intel was forced to react. They paid a premium for Altera but this really keeps an ARM-based data center ecosystem from developing.
Recently, Intel showed off a 512GB Optane DIMM that they are readying for shipment. Since the Optane line uses the 3D XPoint chips, we know that they are using 32 of the 16-gigabyte chips to achieve this density.
See very important related pictures and commentary here. There's an FPGA on the back of the DIMM and nary little else. All of the 3D XPoint is on the front in just four packages (three packages under the big clip and one under the little clip - the thermal paste can be seen oozing out from each).
Now we know that the Optane DIMM has what must be stacks of 3D XPoint. Are these stacks in the form of HMC? Since we know that 3D XPoint is transistor-less, we know that the control logic must reside in the form of a HMC "logic base". Just prior to the 3D XPoint announcement last summer, Micron's Ward Parkinson filed this patent application for transistor-less phase change memory. Some quotes (where "single-crystal" and "crystalline" is simply a reference to conventional silicon transistor technology):
A thin-film memory may include a thin-film transistor-free address decoder in conjunction with thin-film memory elements to yield an all-thin-film memory. Such a thin-film memory excludes all single-crystal electronic devices and may be formed, for example, on a low-cost substrate, such as fiberglass, glass or ceramic. The memory may be configured for operation with an external memory controller. [...]
A memory controller 605 may be formed on the same substrate 604 and configured to operate the standalone memories 602 in a manner previously described. The memory controller may be formed using a conventional CMOS process, then connected to the standalone memories formed on the non-crystalline substrate through processes such as those employed in hybrid circuit manufacture, for example. The memory system 600 may communicate with other components using conventional interconnection components, such as edge connector 608 or other self-contained connector 610 which may be a high speed optical or coaxial connector, for example.
From this, I can see that they have left the door open for both options - 3D XPoint "may include" chalcogenide switch technology in order to completely replace the need for conventional silicon transistor logic. Alternatively, they might use regular silicon transistors in the form of an HMC logic base chip and then stack the 3D XPoint on top. From what we know about the Optane DIMM, it looks like they're using HMC.
But the inclusion of an FPGA on the Optane DIMM is extremely noteworthy as it not only provides near data processing (like encryption, data compression, checkpointing and scrubbing) but also reminds me that an Automata/3D XPoint/HMC integration is possible. To revisit a Micron patent from Purple Swan:
Autonomous memory has a profound advantage in the case where a linear search is performed on a large database. By way of example, using pipelining for one autonomous memory device 102 having 1 GB memory density containing 8 banks of 2M pages of 64 B each, a page can be compared to a target pattern at a beat rate of about 10 nsec per page resulting in a possible search time for the 1 GB die of about 20 mS. While this is an impressive result by itself, the value is that this solution is scalable, and thus, the search time for two autonomous memory devices 102 each having 1 GB memory density would also be about 20 mS as would the search time for a peta-byte of memory, or for any sized pool of memory. Using autonomous memory devices 102 in a distributed sub-system 10 to perform linear searches would be limited by the cost of the array of memory devices 102, along with thermal management and power constraints.
System administration functions also may take advantage of autonomous memory devices 102 in a distributed sub-system 10. For example, a data center may perform a virus scan on distributed sub-system 10 and when a virus is detected, the data center would be "downed" for 20 mS during which time a search and destroy algorithm would be executed on every byte to isolate and disable any occurrence of the target virus.
And with Intel integrating their Xeon server CPUs with Altera FPGAs, the possibility of a complete "System on a DIMM" becomes possible. The FPGAs could marshall all of the IO and memory traffic over the "DIMMfrastructure", the which could then be used for the interconnection between devices in addition to general purpose stuff like Ethernet. To revisit this Micron HMC patent application with my changes in red/blue:
Note: "SoC" is an acronym for "system on a chip" (like the Xeon D).
This hypothetical system on a DIMM would require some application-dependent mix of DRAM "near memory" in order to fulfill the vision put forth in the Intel "two level memory" system that Intel has outlined. And it appears that they have plenty of room for this.
If you were to scour the US Patent office for related work, you'd come across a very interesting patent that is assigned to Xockets IP, LLC, a stealth-mode startup that was founded in 2012 and is staffed by a prominent former Cisco (NASDAQ:CSCO) engineer. They're using "offload processor" logic that "in one very particular embodiment, can be an FPGA" in order to timeshare the memory bus with self-contained systems on a DIMM (where "host processor" is a conventional CPU):
Such offload processors can be in addition to any host processors connected to the system memory bus, and, in some embodiments, process packets transferred over the system memory bus independent of any host processors. In very particular embodiments, processing modules can populate physical slots for connecting in-line memory modules (e.g., DIMMs) to a system memory bus.
Their diagrams should hopefully look familiar:
In another "very particular" example in this patent, the system on a DIMM is able to boot an OS and run apps independently of the host system (the Apache web server is cited as an example). So the inventors have expectations that are inline with what I have hypothesized above (i.e. - a general purpose CPU is included with the FPGA on the Optane DIMM). Figure 2-3 and 2-4 shows "Mem." instead of DRAM, which is what you'd normally expect to see on a DIMM. So my assumption is that the inventors are very involved with Intel with respect to the Optane DIMM.
Even if the Optane DIMM doesn't yet contain a general purpose CPU, it still provides extremely high-bandwidth, FPGA-based near-data processing capabilities in what appears to be the first high-volume use of the HMC.
This radically changes the structure, power and efficiency of the data center - add more DIMM channels and DRAM cache to scale up performance or daisy chain the DIMMs with less cache to save on cost. Slap these on the other end of Knights Landing's six memory channels and you'll have a supercomputing node that will chew through data like a tornado in a trailer park.
Google (NASDAQ:GOOG), Facebook (NASDAQ:FB), Amazon (NASDAQ:AMZN), Microsoft, Cisco and others must be salivating over this forthcoming Optane-based near data processing architecture. New, inventive capabilities will leverage this for sure. With HMC's eventual CPU and Automata integration, it is a wonder that Micron is still an independent company. It is a wonder that they aren't trading at a substantial premium.
Avago (NASDAQ:AVGO) holds the core SerDes communication technology at the core of the HMC, for example, and they're rightfully trading at their 52-week high. In 30 years, I'll be telling the grand kids about how computers used to cost hundreds of dollars and couldn't even drive a car or perform surgery. And Micron holds the bulk of what is going to enable this very soon.
While the HMC Consortium has many members, it was initially spearheaded by rivals Micron and Samsung. With Samsung taking a back seat role, Micron has amassed the bulk of the surrounding intellectual property for HMC. I believe that this IP is extremely valuable (the markets, not so much).
I don't believe that Samsung missed this opportunity accidentally. I believe that this was coordinated by a larger entity. There is no DDR5 incremental improvement coming - the next step is a radical overhaul of memory, compute and networking. While Intel is absolutely joined at the hip with Micron to this end, I don't believe that Intel is responsible for this coordination.
Disclosure: I am/we are long MU, INTC. I wrote this article myself, and it expresses my own opinions. I am not receiving compensation for it (other than from Seeking Alpha). I have no business relationship with any company whose stock is mentioned in this article.
Editor's Note: This article discusses one or more securities that do not trade on a major U.S. exchange. Please be aware of the risks associated with these stocks.