Intel: The 3 Failures Of Brian Krzanich - Part 2
Summary
- Failure #2: GPU computing.
- The Larrabee misfire.
- The Koduri recruitment.
- Investor takeaways.
Rethink Technology business briefs for June 29, 2018.
Failure #2: GPU Computing

Summit at Oak Ridge. Source: servethehome.com.
The state of high-performance computing today is exemplified by the U.S. Department of Energy's Summit supercomputer at Oak Ridge National Laboratory. With the recent completion of Summit, the U.S. now possesses the world's fastest supercomputer, according to Top500.org. According to a standard benchmark, Summit is rated at 122.3 petaflops (10^15 floating point operations per second). The previous record holder was the Sunway TaihuLight, located in the People's Republic of China, rated at 93 petaflops. According to the DOE, Summit has a maximum theoretical performance of 200 petaflops.
The main computing engine of Summit is not made by Intel (NASDAQ:INTC) or any other provider of CPUs, but it instead uses the Volta GV100 graphics processing unit (GPU) made by Nvidia (NASDAQ:NVDA). Supervision of the GPUs is provided by IBM's (IBM) Power9 RISC processor. Summit consists of 4,356 nodes which are essentially rack-mount servers with two Power9 processors and six Volta GPUs in each.
To facilitate data flow and management of the vast fleet of GPUs, each Power9 has three built-in second generation Nvidia NVLink high-speed data interfaces. Each GPU, in turn, uses NVLink to communicate with each of two other GPUs and a Power9, as shown in the diagram below from a presentation at the Super Computing 17 Conference.
Source: Tom's Hardware
Summit has a sister supercomputer located at Lawrence Livermore National Laboratory called Sierra. Sierra consists of 4320 nodes each containing two IBM Power9 processors and four Nvidia Volta GPUs. Sierra is now ranked third in the world with a rating of 71.6 petaflops.
As noted by top500.org, Nvidia GPU accelerators are used in 98 of the top 500 supercomputers, including 5 of the top 10. Intel's hope for a processor to compete with Nvidia's GPUs, the Xeon Phi, is present only in 27 of the top 500 supercomputers, including two Cray XC40s ranked ninth and tenth place (the highest rankings for computers with Xeon Phi). The performance of the Cray XC40s pales in comparison to Summit, with ratings of about 14 petaflops each.
During Brian Krzanich's tenure as CEO, Intel failed to field a viable competitor to Nvidia's GPU for general purpose compute acceleration. Even as Intel saw expanding revenue in the datacenter, a quiet disruption was taking place in high-performance computing. This disruption has begun to filter into hyperscale cloud providers as they embrace machine learning.
Only belatedly, with the recruitment in November 2017 of Raja Koduri, who headed AMD's (AMD) Radeon Technologies Group, has Intel seemed to grasp the value of the GPU architecture.
The Larrabee Misfire
Larrabee was to be Intel's answer to the GPU, a discrete graphics chip for the consumer market. Instead of the massively parallel architecture of the GPU, Intel tried to incorporate a lesser number of small x86 cores. This offered greater programmability, Intel claimed:

Source: Anandtech.
In 2010, Intel announced that it was abandoning the consumer GPU effort to focus on using the Larrabee architecture for high-performance computing. Although Intel wouldn't admit it, it was clear at that point that Larrabee was not competitive as a graphics processor with GPUs from Nvidia or AMD.
An article in PC Perspective from 2012 summed up Larrabee:
The problem with Larrabee and the consumer space was a matter of focus, process decisions, and die size. Larrabee is unique in that it is almost fully programmable and features really only one fixed function unit. In this case, that fixed function unit was all about texturing. Everything else relied upon the large array of x86 processors and their attached vector units. This turns out to be very inefficient when it comes to rendering games, which is the majority of work for the consumer market in graphics cards. While no outlet was able to get a hold of a Larrabee sample and run benchmarks on it, the general feeling was that Intel would easily be a generation behind in performance. When considering how large the die size would have to be to even get to that point, it was simply not economical for Intel to produce these cards.
Intel rebranded Larrabee as Xeon Phi and trumpeted its performance as an accelerator for high-performance computing.
Source: PC Perspective.
Larrabee already was water under the bridge by the time Krzanich became CEO in 2013, but under Krzanich, Intel continued to push Xeon Phi as the superior alternative to GPUs. These are claims that Intel continues to this day, based on performance measurements made in June 2016.
The performance comparison was self-serving and specious since it compared Xeon Phi to an already obsolete Nvidia Tesla K80. The K80 consisted of a pair of Kepler GPUs that were two generations behind Nvidia's latest Pascal GPU architecture, which had been announced at its GPU Technology Conference in April 2016.
Not only was the K80 obsolete in terms of architecture, it also was obsolete in terms of process. The Pascal generation used the TSMC 16 nm FinFET process, whereas Kepler had used the 28 nm process. The first Pascal accelerator that Nvidia announced, shown below, was light years ahead of anything Nvidia has done before.
Nvidia's P100 accelerator. Source: Anandtech.
In addition to being faster and more energy efficient than its predecessors, the P100 used Nvidia's proprietary NVLink data interface to connect with other P100s or a CPU. NVLink provided higher performance than PCIe.
P100s were probably difficult to obtain at the time that Intel published its comparison, but Intel has never bothered to update the performance comparisons either. In fact, the P100 already outperformed Xeon Phi in the all-important metric of performance per watt, and the performance gap has only increased since the release of the Volta generation V100 accelerator. Intel claims about 12 gigaflops/Watt for Xeon Phi, whereas Volta is capable of about 26 gigaflops/Watt.
In a "white paper" published by Citron in June 2017, Xeon Phi (Knights Mill) was listed as a threat to Nvidia's GPU dominance of high-performance computing. Xeon Phi is nothing of the sort. Xeon Phi is a dead end.
The Koduri Recruitment
The fact that Xeon Phi was a dead end was probably fairly apparent to Intel management under Krzanich even before the release of the Nvidia P100 in 2016. In the area of machine learning, it already was clear that GPUs outperformed CPUs. Intel cast about for any processor architecture that might confer advantage in machine learning or the datacenter, as long as it wasn't a GPU.
Intel spent considerable treasure doing this. In December 2015, Intel bought FPGA maker Altera for $16.7 billion. Intel currently offers an Altera FPGA based accelerator.
In August 2016, Intel bought deep learning startup Nervana Systems, reportedly for more than $350 million. Although the Nervana Neural Network Processor is often cited as a threat to Nvidia, it has yet to see the light of day. Intel recently announced that its first product based on the Nervana chip, now designated Spring Crest, will be released in 2019.
None of Intel's initiatives, from Larrabee on down, have served to inhibit Nvidia's growth in the datacenter, as can be seen from Nvidia's datacenter revenue history.
Source: Nvidia earnings data.
If the hiring of Raja Koduri signifies anything, it's the belated realization of the durability and power of the GPU. Up until Intel made him SVP of a newly formed Core and Visual Computing Group in November 2017, Koduri had led GPU design and development at AMD as head of the Radeon Technologies Group.
Although AMD fans have been rather unkind to Koduri following his departure, I think Koduri worked a near miracle in keeping AMD as close to Nvidia as it is in graphics, given the dearth of resources. But developing competitive (with Nvidia) GPUs has become a huge, billion-dollar, multi-year effort. Intel is not expected to have its first discrete GPU out until 2020.
Given the fact that Nvidia's R&D is very well funded, and it won't be standing still over the next couple of years, I consider it unlikely that even Koduri can catch Nvidia.
Investor Takeaways
There's a major disruption underway in the datacenter right now, and it has nothing to do with the contest between Intel CPUs and AMD CPUs. This disruption is due to the growth of GPU acceleration, as exemplified by Summit. In the new GPU compute paradigm, the CPU is relegated to a supervisory role for high-performance computing.
The reader may think, but isn't that just in supercomputing? No, it isn't, due to the ongoing growth of cloud-based machine learning and big data analytics. Hyperscale cloud providers such as Google (GOOG) (GOOGL), Amazon (AMZN), and Microsoft (MSFT) are rapidly expanding their machine learning capabilities. And all are using Nvidia GPUs to do that.
Source: Nvidia.
This missed opportunity in GPUs is part of the Krzanich legacy. The overall growth of cloud computing has served to mask the impact of GPUs on Intel's datacenter business in the near term, but I don't expect that to continue for much longer.
If the CPU is to play a minor supervisory role in the datacenter of the future, then its specific architecture will become less important. Datacenters will tend to prefer the most energy efficient architecture and compatibility with legacy software will recede as a priority.
The energy efficiency advantage of ARM architecture, which I described in part 1 of this series, will tend to become decisive.
Nvidia is part of the Rethink Technology Portfolio and is rated a buy. Consider joining Rethink Technology for exclusive reports on technology companies and developments.
This article was written by
Analyst’s Disclosure: I am/we are long NVDA. I wrote this article myself, and it expresses my own opinions. I am not receiving compensation for it (other than from Seeking Alpha). I have no business relationship with any company whose stock is mentioned in this article.
Seeking Alpha's Disclosure: Past performance is no guarantee of future results. No recommendation or advice is being given as to whether any investment is suitable for a particular investor. Any views or opinions expressed above may not reflect those of Seeking Alpha as a whole. Seeking Alpha is not a licensed securities dealer, broker or US investment adviser or investment bank. Our analysts are third party authors that include both professional investors and individual investors who may not be licensed or certified by any institute or regulatory body.
Recommended For You
Comments (134)






o Sparse Distributed Memory (SDM)
o Very Long Vector Words (100 - 10,000 bits)
0 Neural Network Auto-associative Distributed Memories (NAMs)I'll let everyone make up their own fictions about where this may be taking Natural Intelligence Semiconductor but the most probable fit is that they discovered a better method that doesn't require the AP but does require a foundry instead and are off and running in that direction which doesn't happen to have "MEMORY" stamped in big broad letters that Micron's management can understand.This is not to vague, though it is speculative, it does remove some ambiguity from the dialogue...,-Growlzler


"Micron’s Automata Processor
(AP)19,20 is an example for Compute
Memory. It transforms DRAM structures
to a Nondeterministic Finite
Automata (NFA) computational unit.
NFA processing occurs in two phases:
state match and state transition. AP
cleverly repurposes the DRAM array
decode logic to enable state matches.
Each of the several hundreds of memory
arrays can now perform state
matches in parallel. The state-match
logic is coupled with a custom interconnect
to enable state transition.
We can process as many as 1,053
regular expressions in Snort (a classic
network-intrusion detection system) in
one go using little more than DRAM
hardware. AP can be an order of magnitude
more efficient than GPUs and
nearly two orders of magnitude more
efficient than general-purpose multicore
CPUs! Imagine the possibilities if
we can sequence a genome within minutes
using cheap DRAM hardware."






applications management. Many salvaged from server use that never saw an accelerator in system. Can be integrated with current and back in time GPUS as a second marriage has tremendous channel revenue implications.How many bsck in time CPUs will see integration with GPUs, new and old, has been my primary growth market take for the past two quarters.Big money to b made in the channel here, for all kinds of accelerated applications; FPGA and DSP too.mb






www.hpcwire.com/...












