Apple Does Something Amazing

Oct.21.13 | About: Apple Inc. (AAPL)

With both Intel (NASDAQ:INTC) and Apple (NASDAQ:AAPL) designing world-class microprocessor cores, it's worthwhile to examine the various design choices and trade-offs in Intel's "Silvermont" micro-architecture and Apple's "Cyclone" micro-architecture and to compare and contrast them. The field of processor micro-architecture (that is the design choices that determine the performance/power of a chip) is just as much an art as it is a science, and I believe that investors, industry analysts, and chip geeks alike are likely to be interested in understanding how Apple and Intel both likely thought when designing their latest generation microprocessor cores.

I realize that most investors likely do not have a background in computer architecture, so I will begin with a brief primer. After that, I will talk about the design choices that Intel and Apple each made for their respective processors (I have detailed information on the Intel chip, but Apple's chip requires some guesswork on my part from its performance characteristics and die-size).

Instruction Set Architecture: Addressing The Myth

Many investors and press are often confused about the difference between "instruction set architecture" and "micro-architecture". So, here's the best way to think of this if you're new to this stuff. The instruction set architecture (or "ISA" for short) determines what operations a particular processor can execute. Instruction sets typically evolve over time as the instruction set designers identify common operations in software that matters to people and decide to give those operations a boost by implementing them directly in hardware.

Anyway, so, the "myth" that ARM (NASDAQ:ARMH) and the other RISC players have tried to propagate for years is that their approaches are more power efficient than the X86 instruction set architecture. While it is very difficult to meaningfully quantify this, let me explain to you the major reason why folks believe that ARM may offer a slight efficiency edge.

So, the bottom line is that all processors do basically the same thing: they grab instructions from memory, figure out what operations they need to do and what data they need to do it, execute those operations, and then, if applicable, write the result back to memory. This is the basis of the classic "5-stage RISC pipeline" which (simplified) says that a processor does the following:

  1. Fetch instruction - grab instruction to be executed
  2. Decode instruction - figure out what the instruction is supposed to do
  3. Execute - carry out the instruction
  4. Memory access - if need be, read/write to the memory
  5. Write back - write the results of the calculation to the "register file" (think of registers as memory that the processor can access really quickly - even faster than caches)

Of course, most pipelines are much more complex than this (the "Silvermont" pipeline is 13 stages long, and the ARM Cortex A15 has a 15 stage integer/17-25 stage floating point pipeline), but this is the general flavor of what's going on here.

Turning to the notion of "RISC" vs. "CISC", the idea is simple: in a RISC machine, each operation is very simple, and to accomplish a complex task, you simply chain up a bunch of these smaller operations to get to the desired result. These smaller, bite-sized operations typically execute very quickly, but there are a bunch of them. On the other hand, something like Intel's X86 is a "CISC" design, in which the operations are more complex (so you need fewer of them), but they do more.

The first major source of inefficiency that some point to in X86 against, say, ARM, is that (at least for non-SIMD instructions) all of ARM's instructions are 32-bits in length, while X86 instructions can vary wildly from 8 bits to 120 bits. So the disadvantage on the power side is that the decoder needs to expend extra power figuring out the length of every instruction that it deals with before it actually decodes it. The advantage, however, is that this more efficiently utilizes the available instruction space.

While these differences may have meaningfully impacted power/performance in smaller designs, today's chips are so large and complex that something like instruction decode is just not all that big a portion of the die size nor of the chip's power budget. Further, Intel and AMD have come up with a number of clever "tricks" to mitigate the effects of the instruction decode complexity over the years. So, when it comes to mobile system-on-chip designs, "ARM vs. X86" or "RISC vs. CISC" really just doesn't matter all that much.

So, what does matter? Micro-architecture and physical layout.

Micro-architecture is what matters

What really defines the power and performance of a chip today is the micro-architecture, which is the actual design of the processor. Think of it this way: the Instruction Set Architecture defines what operations the machine can perform, but the micro-architecture defines how they're executed. You can have a great micro-architecture from a performance/watt standpoint or a garbage one, even if both implement the same instruction set architecture. For example, AMD's X86 chips get their butts handed to them 10 ways to Sunday by Intel's X86 chips. They're both X86, but a "Haswell" is a superior design to a AMD's "Piledriver".

But really, most instruction set architectures fundamentally do the same thing: compute things and spit out results. So, what matters is how they do it, rather than what "language" they speak to do so. There are hundreds - if not thousands - of decisions that go into defining a micro-architecture, and these designs are informed by a very rigorous analysis of the kind of software ("traces") that the chip designer expects will be run on said chip.

Now, there's no "right" solution for a power efficient, low cost design - depending on the strengths/weaknesses of a given chip design team, the problem of providing the best power/performance in a given area and power envelope. In this article, I'd like to talk about the design decisions that Apple made for its "Cyclone" CPU core and the decisions that Intel made with its "Silvermont" CPU core and how two radically different approaches can lead to two very good outcomes.

Brainiac Versus Speed Demon

In CPU design, there are generally two "extremes" to processor design. On one hand, a chip house can go for what is called a "brainiac" design. That is, a very wide design that can execute a very high number of instructions in a given clock speed. However, these designs are usually restricted to a fairly low clock speed for power and complexity reasons (i.e. you can't just assume that you can turn up the clock speed on a brainiac design and suddenly get massive amounts more performance in any reasonable power envelope).

On the other hand, you can go for a "speed demon" style design. These are typically not as complex and can't do as much work per clock cycle, but they can clock much higher within the same power envelope.

At the end of the day, CPU designers need to find the "optimal" balance between high performance per clock and clock speed for a given power envelope and die-size budget on a given manufacturing process. A pure "brainiac" design might run at 500 MHz, have a very high performance per clock, but still fall short of a more balanced design that still has "good" performance per clock but clocks at 2GHz. The operative word really is "balance".

So, as you can probably guess from the title, Apple went with a "brainiac" design. The chip doesn't do much in the way of fancy voltage/frequency scaling, and it doesn't clock above 1.3GHz, but its performance per clock is absolutely staggering - if Geekbench 3 is to be believed, then it is competitive with even Intel's "Haswell" on a per-clock basis (although, Haswell is a design that is designed for 3GHz+ operation while Cyclone is likely to top out at 1.5-1.6GHz).

Silvermont, on the other hand, is a very narrow design, and as a result, its per-clock, per-core performance is nowhere near what Apple's "Cyclone" can do. That being said, while we don't have power numbers for the A7, Silvermont can clock all the way up to 2.4GHz and still consume ~0.8-0.9W while doing so.

So, why would Apple go "brainiac" and why would Intel go "speed demon"? Good question.

Explaining Intel's versus Apple's Choices

According to Hiroshige Goto, an expert microprocessor analyst, a dual core "Silvermont" complete with 1MB of L2 cache weighs in at about 8mm^2, with a single core w/o the bus or cache weighing in at just ~2mm^2.

Click to enlarge

Compare this, then, to Apple's "Cyclone" which according to yet another microprocessor expert (this time, Hans de Vries), a pair of Apple Cyclone cores with 1MB of L2 cache weighs in at 14.5mm^2. Now, do note that Intel has a process advantage here, but it's tough to say how big Apple's A7 would be if it were built on the 22nm process as the design methodologies themselves may differ (this contributes substantially to density). But do keep in mind that Apple's chip would likely see a material increase in transistor density if it were to move from Samsung's 28nm process to Intel's 22nm FinFET process.

The thing is, Intel probably went with a "speed demon", but narrower, design in a bid to stay cost effective. Now, while super high clock speed designs often means that something's got to give on density, Intel still probably came out ahead with respect to actual transistor counts by going with a narrow, fairly high clocking design rather than a wider, slower clocking design. It is also important to note that Intel's chip has a very sophisticated power management unit that does very interesting things in dynamically scaling voltage/frequency, while the Apple chip appears to not have such scaling, according to Anandtech.

Also note that Intel's 22nm FinFET process has better performance/power characteristics than Samsung's 28nm HKMG process, which means that the individual transistors can switch faster (i.e. clock higher) on Intel's 22nm process than Samsung's 28nm process. So, why go wide, when your process lead lets you stay small and narrow?

Apple is currently limited to Samsung's 28nm HKMG process, so trying to compete on clock speed is silly, particularly as clock speed is driven by voltage, and power scales quadratically with voltage (by the equation P = V^2/R where P = power, V = voltage, R = resistance). Apple, instead, chose wisely to forget about going after clock speed and chose to keep clocks low, but do a lot of work per clock. Make no mistake, "more work per clock" drives power up, but it was likely a much better power/performance trade-off than doing a less brainiac architecture and then trying to clock it up. Just ask ARM how its Cortex A15 does power wise when it goes past about 1.2GHz.

A Look At The Performance Numbers

It's difficult to find a good set of benchmarks to objectively compare the performance of Apple's "Cyclone" at 1.3GHz and Intel's "Silvermont" at 2.4GHz, but - flawed as it is - Geekbench 3 is probably the closest that we can get today. So, with that, let's take a look at a comparison of the two chips in the three major buckets that Geekbench 3 tests: integer performance, floating point performance, and memory performance.


Here are the Geekbench 3 results for A7 against Atom Z3770:

Click to enlarge

On a single-core basis, the Silvermont wins 4/13 tests while A7 wins 9/13 tests, with the Cyclone leading the Silvermont by 28% in aggregate.

On a multicore basis (remember: A7 has two cores, Z3770 has four), the quad Silvermont wins by 45% in aggregate and wins 11/13 tests.

Floating Point

Click to enlarge

In floating point, on a per core basis, the A7 wins by 63% (!), losing only one test. On a multicore basis, Intel's quad core Silvermont edges out ahead scoring 2982 in aggregate against the A7's 2633.

Apple's "Cyclone" is a floating point monster.

Memory Performance

In the final round of tests, Apple simply cleans up, winning by about 50% in aggregate, losing no test:

Click to enlarge

So, What To Think?

Now this is a real dilemma - did Apple just out-design Intel on the CPU side of things? Well, sorta. On one hand, Geekbench is not be-all, end-all CPU benchmark, and other benchmarks suggest that Silvermont and Cyclone have very similar single threaded CPU performance. But it's hard to ignore such consistently good results from the A7 in every one of Geekbench's sub-tests.

The real question is power. How much power does the "Cyclone" core draw at full power? How long can it sustain full performance? Do four Silvermont cores fit into the same power envelope as the A7? The fact that Apple's A7 powers a svelte smartphone suggests that it is probably very good on power, but I have seen Bay Trail power measured at sub-2.5W at full load. If 4 Silvermont cores can fit in the same power envelope as 2 Cyclones, then Intel's design made more sense for multithreaded workloads, while Apple's made more sense for single-threaded.

What's not up for debate is that Apple's design team has done a damn good job with its second low power processor design, proving that Apple is a more deeply technical and innovative company than ever before. Intel's processor team also has done a fine job, particularly as "Silvermont" competes very well with the merchant chips from Qualcomm (NASDAQ:QCOM) and Nvidia (NASDAQ:NVDA). But it's clear that Apple will be using its own designs for generations to come as its design team truly is best in class.

Disclosure: I am long INTC, NVDA. I wrote this article myself, and it expresses my own opinions. I am not receiving compensation for it (other than from Seeking Alpha). I have no business relationship with any company whose stock is mentioned in this article.