One of the chief criticisms that I received for my recent article, "Apple Outguns Intel" was that I was showing Geekbench results for the Apple (NASDAQ:AAPL) A7 running on 64-bit iOS while only showing 32-bit Android results for Intel's (NASDAQ:INTC) recently announced Atom Z3480 platform. There was a reason that I did this, and it was actually - believe it or not - to provide a more apples-to-apples comparison between the Intel chip and the Apple chip in that particular benchmark.
You'll see below that "64-bit" doesn't actually do much (if anything) for the Intel Silvermont core (Z3480 packs two of them) in that benchmark but it actually makes a pretty big difference for the ARMv8 based "Cyclone" core inside the Apple A7 because the ARMv8 instruction set adds support for cryptography instructions (AES-NI, in particular) that aren't available on the 32-bit ARMv7 chips (or in ARMv8, 32-bit mode).
Let me illustrate what I mean.
Apple A7: ARMv7 (32-bit) vs. ARMv8 (64-bit)
Since the ARMv8 instruction set adds a number of important features and, in particular, dedicated cryptography instructions, the performance delta between the A7 in 32-bit mode and 64-bit mode in Geekbench 3 is actually pretty staggering:
all images from Primate Labs website
Breaking it down by the benchmark's 3 sections (Integer, Floating Point, and Memory), you can see where the improvements are coming from:
(click to enlarge)
In the Integer results, we get a huge speedup in AES and SHA1 (both cryptography algorithms) due to the fact that ARMv8 implements these in hardware (and Apple seems to have done a good job implementing them). The rest of the benchmarks see mild improvements (or in the Dijkstra case, a slight regression).
Let's take a look at floating point:
You can see that the improvements (2x the 128-bit SIMD registers) are much more broad based and in aggregate lead to a pretty significant improvement in the total score.
Finally, the memory results:
The difference here is largely due to the improved Stream Copy with the rest of the tests identical.
The point remains, though, that because Apple implemented ARMv8 (and in particular, AArch64), it was able to gain a pretty sizable speedup thanks to the much more powerful instruction set. It's also worth noting that the CPU core itself is much improved, even in 32-bit mode over the "Swift" core inside of the Apple A6.
Intel 32-Bit vs. Intel 64-Bit
With the release of the first 64-bit enabled Atom tablet (the HP ElitePad 1000 G2), we now have some legitimate 64-bit Windows 8.1 results for a new chip known as the Atom Z3795. I will show you in a moment that the "32-bit" versus "64-bit" question really doesn't matter at all for the Intel chips (particularly as Intel's chips can make use of those dedicated cryptography instructions):
Indeed, there is quite literally no difference between the 32-bit and 64-bit versions of Geekbench for Intel "Silvermont" based SoCs. This why I did not bother to show 64-bit results for Intel in the prior article.
Intel: GCC vs. ICC
One thing that does make a difference above all else is what compiler is used to generate the Geekbench program (i.e. the translation from source code to a program that users can run). The difference between Microsoft's compiler/GCC and Intel's own compiler is pretty staggering. Let's go through it by benchmark section:
When the benchmark is compiled with Intel's own compiler, the integer subtest sees a pretty nice speedup across all of its constituent tests. The overall score per-core sees a modest improvement.
The difference here is staggering. Intel's chip sees a 29% speedup overall with one test seeing about a 3x improvement! Here's what Mr. John Poole, lead programmer/author of Geekbench, had to say:
We haven't done a lot of work with ICC, but we've found that ICC builds running on Silvermont are faster, with a 5% increase in integer performance and a 29% increase in floating point performance.
Most of the floating point increase comes from the Sharpen Filter workload which ICC is able to vectorize. Also, the MKL library, which ICC uses in place of the standard math library, helps improve the performance of the BlackScholes, N-Body, and Ray Trace workload
In short, the advantages that the Apple A7 (thanks to ARMv8) enjoys are exposed when the program is compiled with Intel's compiler but for whatever reason not with Microsoft's compiler or GCC. With that in mind, the A7 is still faster per core in Integer/Floating Point workloads by a non-trivial amount (although it probably draws more power at peak load), but Silvermont does look at lot better when Geekbench 3 is compiled with Intel's compiler than it does with GCC/MSVC++ for what appear to be legitimate reasons/optimizations and NOT by "cheating" (again, Geekbench 3's developer explains it nicely).
However, it is puzzling that Intel has not been more aggressive in making sure all of the popular benchmarks are compiled with ICC, particularly as these benchmarks do influence buyers and OEMs alike. The difference here is staggering.
I still stand by my previous article. Merrifield is slower than A7 and Moorefield with four cores is likely to be faster although it will be facing the Apple A8 and not the A7. With that in mind, I think that what really happened here is that Intel under-designed Silvermont by targeting a 1 watt/core power envelope while the Apple engineers were much more aggressive. In a mobile device, the screen is by far the biggest drain of power, and an extra 0.5W - 1W per core at peak performance would have been enough on Intel's 22nm process to match the power envelopes of its competition but dramatically outperform them.
The irony here is not lost on me. Intel built the industry's most power-efficient mobile CPU core, but ended up with a core too frugal on power and not aggressive enough on raw CPU performance.