About a month ago, I penned what turned out to be the second most popular article that I had ever written, and it was called "Apple Does Something Amazing". In this piece, I did what I believed to be an accurate representation of Intel's (NASDAQ:INTC) "Bay Trail" chip for tablets against Apple's (NASDAQ:AAPL) Cyclone core found inside of the A7 SoC which itself is found inside of its iPhone 5s (as well as the iPad Air and iPad mini with Retina Display). The conclusions that I came to were the following:
- On a per core basis, A7 was on the order of 50% faster than Intel's Z3770
- On a 4 cores against 2 cores compare, Intel's Z3770 was slightly faster
- Apple had done something truly remarkable
Since that time, I have conducted a more thorough investigation and would like to update my conclusions. I'll spoil it upfront, though, for those who do not wish to read the entire article:
- On a per core basis, Apple's A7 is still meaningfully faster than Intel's Silvermont, although not by as much as I had originally seen
- On a multi-core compare (4 Silvermont against 2 Cyclone cores), Intel offers more performance in tablet power envelopes
- Apple's chip offers superior GPU performance, but the delta is not large (and the comparison may not even be fair/apples-to-apples)
In short, I have been able to - with much thanks to John Poole, founder of Primate Labs (the folks who write Geekbench) - get a much better picture of exactly what's going on with the performance numbers of these chips and to frame them in a better context.
Before we go any further, understand that I own shares of Intel and have no position in Apple, but do have a deep respect for both companies. I am not in the business of attempting to twist reality to fit what I "want" to happen, but am instead interested in ascertaining the cold, hard truth - whether it suits my stock positions or not. If I find something that contradicts a thesis of mine, I will have absolutely no problem issuing a note to my readers to that effect.
With that out of the way, I wish to provide an update on A7 versus Z3770 and to then consider the longer-term implications that can be reasonably extrapolated from the current trends.
Testing A Hypothesis
Stepping back a bit, it's important to understand a little bit about how software works. When a programmer sits down to write code in a language such as C or C++, this code actually looks a lot like English. The CPU understands not a word of it until it is translated into the low-level machine language that the CPU can understand. The tool that does this translation is known as a compiler.
Now, not all compilers are created equally, and not all projects compiled with the same compiler will perform the same, as developers can choose to set "flags" for the compiler to tell it what to optimize for and what to target. When I did the comparison of Intel's Z3770 against Apple's A7, I noted the following:
- The Intel chip, on Windows, was running code compiled with Microsoft's Visual Studio 2012
- The Apple chip, on iOS, was running code compiled with Apple's modified version of the GNU Compiler Collection compiler ("GCC")
The results were less-than-flattering for the Intel Silvermont core, with Cyclone beating Silvermont out by 28% in integer workloads and 63% in floating point workloads. Frankly, it was embarrassing.
My hypothesis upon seeing these results was that Intel would see a rather sizable speedup if the application were built with Intel's own compiler rather than Microsoft's or the GCC compiler. I didn't have any proof of this, though, and I was basically resigned to the fact that without the developer of Geekbench 3 recompiling his code with Intel's compiler, I basically had nothing to test the hypothesis either way.
Contacting Primate Labs, Getting Better Results
On November 13th, I reached out to Mr. John Poole, the lead developer of Geekbench via e-mail and asked him if he had compiled Geekbench 3 with Intel's compiler and if he could give me any indication of the performance delta - if it even existed. Much to my surprise, not only did Mr. Poole indeed compile the test with Intel's compiler and run the tests, he even had the tests up in the Geekbench 3 database (although you'd never be able to find them without knowing what you're looking for)! Here is a comparison of the same hardware running Geekbench 3, but with the program built with different compilers (along with some relevant competitive comparisons, to boot - highest score in bold):
|Result||Atom Z3770 4C/4T 2.4GHz (MSVC++), Windows 8.1)||Atom Z3770 4C/4T 2.4GHz (ICC, Windows 8.1)||Apple A7 1.3GHz 2C/2T (iPad Air, Apple compiler, iOS)||Qualcomm Snapdragon 800 2.26GHz 4C/4T (Nexus 5, GCC, Android)|
|Floating Point (single-thread)||816||1054||1433||822|
|Floating Point (multi-thread)||2993||3945||2833||2961|
Now, note a couple of things before we continue:
- This test is short enough so that these chips don't throttle. The A7 in the iPad Air performs almost identically to the A7 in the iPhone 5s, even though Anandtech has confirmed that running sustained workloads, the A7 in the iPhone clocks itself at about half of what it does in the iPad Air. In other words, Qualcomm's score is unlikely to improve if results from a tablet were used.
- The OS does make a difference, but it doesn't seem to be too large - Z3770 compiled with the GCC compiler doesn't perform too differently from Z3770 running code compiled with Visual C++ on Windows 8.1 (see comparison here).
Interestingly enough, there's not too much of a speedup on integer code for Silvermont on ICC versus Microsoft's compiler or the GCC compiler - it's on the order about 5%. But what we see going on in the floating point results is pretty startling - Intel sees a 29% speedup. What the heck?
Digging more deeply into the results, we can see something interesting:
While the gains look across the board, the "Sharpen Filter" test sees a 3x speedup with ICC over VC++, and the BlackScholes result sees a clean double. According to Mr. Poole, the "Sharpen" workload was able to be vectorized by ICC (but wasn't with Microsoft's compiler) - this means, in layman's terms, that the code is able to use Silvermont's execution resources more effectively by doing the same operation on multiple sets of operands (i.e. input data) at once. This workload is vectorized on Apple's Cyclone core/ARMv8, but it is unclear if it is on ARMv7 chips. It's amazing how an outlier can lead to misleading comparisons.
According to Anandtech, the Apple A7 is a bit of a power-hungry beast at full load - chugging about 8 watts in a power-virus scenario:
Now, this power virus stresses both the CPU and the GPU, so the dual Cyclone cores are obviously sucking down less than 8 watts combined, but it's clear that these are not cores designed for the 1 watt operating point. Indeed, if I had to hazard a guess, I would say that at the full 1.3GHz, each Cyclone core draws between 2 and 2.5W, with much of its life spent well under that.
Interestingly enough, when I was at IDF, I saw a very sophisticated power demonstration of Intel's Bay Trail. Running the very intense PC CPU benchmark, Cinebench, the Z3770 did not exceed 2.5W. And, in Intel's presentation at IDF, Silvermont's lead architect claimed that at 2.4GHz (max turbo), Silvermont consumed "less than 1 watt":
From what I saw measured at IDF, "less than 1 watt" meant about 0.850W. Given the kind of performance that a single Silvermont core can pull off, this is extremely impressive and this would suggest that as far as "performance per watt" goes, Silvermont may be ahead of Cyclone, but as far as absolute performance goes, Cyclone seems to be able to push much higher.
The fundamentals of my conclusion do change, but in a subtle way. I do not believe that Intel is "behind" any of the ARM SoC players as far as performance per watt goes, and on a per-core basis I believe only Apple is pushing higher. As far as multi-core performance goes within a ~2.5W envelope, it seems to be no contest in favor of Silvermont. This is no doubt helped by the 22 nanometer FinFET process, but there's more to it. The ARM A15 and Qualcomm Krait 400 seem to be really pushing the limits on power/frequency in order to win benchmarks - even if that performance level cannot be sustained.
I now highly suspect that Intel's Bay Trail (and upcoming Merrifield for phones) can stay near the top of its frequency/performance range in battery constrained environments much better than the ARM/Qualcomm/Apple designs can. While I originally questioned some of the design decisions for Silvermont (in-order FPU cluster, 2-issue, and no SMT) it is clear to me that Intel was really designing a core that could really peg max turbo in an iPhone-like design (perhaps with lesser/cheaper cooling). From what I hear about Merrifield, it is now increasingly apparent that the company was trying to woo Apple and/or enable Apple competitors to have a chip with similar design points as Apple's A-series chips (except be lower power to accommodate worse cooling solutions).
Intel has been vindicated with its design decisions on Silvermont, and the value of the 22nm FinFET process is also apparent. It's clear, though, that the 22 nanometer generation in mobiles will only act as a stopgap until the 14 nanometer generation, where the big guns with respect to integration, graphics, and even CPU can really be deployed. Intel's mobile strategy is playing out as planned, and it's going well. There is no reason that with great technology and amazing marketing muscle that the company will not eventually succeed in this space. Just how successful remains to be seen, but I'm expecting good things.