An Update on Apple’s A7: It's Better Than I Thought

When I reviewed the iPhone 5s I didn’t have much time to go in and do the sort of in-depth investigation into Cyclone (Apple’s 64-bit custom ARMv8 core) as I did with Swift (Apple’s custom ARMv7 core from A6) the year before. I had heard rumors that Cyclone was substantially wider than its predecessor but I didn’t really have any proof other than hearsay so I left it out of the article. Instead I surmised in the 5s review that the A7 was likely an evolved Swift core rather than a brand new design, after all - what sense would it make to design a new CPU core and then do it all over again for the next one? It turns out I was quite wrong.

Armed with a bit of custom code and a bunch of low level tests I think I have a far better idea of what Apple’s A7 and Cyclone cores look like now than I did a month ago. I’m still toying with the idea of doing a much deeper investigation into A7, but I wanted to share some of my findings here.

The first task is to understand the width of the machine. With Swift I got lucky in that Apple had left a bunch of public LLVM documentation uncensored, referring to Swift’s 3-wide design. It turns out that although the design might be capable of decoding, issuing and retiring up to three instructions per clock, in most cases it behaved like a 2-wide machine. Mix FP and integer code and you’re looking at a machine that’s more like 1.5 instructions wide. Obviously Swift did very well in the market and its competitors at the time, including Qualcomm’s Krait 300, were similarly capable.

With Cyclone Apple is in a completely different league. As far as I can tell, peak issue width of Cyclone is 6 instructions. That’s at least 2x the width of Swift and Krait, and at best more than 3x the width depending on instruction mix. Limitations on co-issuing FP and integer math have also been lifted as you can run up to four integer adds and two FP adds in parallel. You can also perform up to two loads or stores per clock.

I don’t yet have a good understanding of the number of execution ports and how they’re mapped, but Cyclone appears to be the widest ARM architecture we’ve ever seen at this point. I’m talking wider than Qualcomm’s Krait 400 and even ARM’s Cortex A15.

I did have some low level analysis in the 5s review, where I pointed out the significantly reduced memory latency and increased bandwidth to the A7. It turns out that I was missing a big part of the story back then as well…

A Large System Wide Cache

In our iPhone 5s review I pointed out that the A7 now featured more computational GPU power than the 4th generation iPad. For a device running at 1/8 the resolution of the iPad, the A7’s GPU either meant that Apple had an application that needed tons of GPU performance or it planned on using the A7 in other, higher resolution devices. I speculated it would be the latter, and it turns out that’s indeed the case. For the first time since the iPad 2, Apple once again shares common silicon between the iPhone 5s, iPad Air and iPad mini with Retina Display.

As Brian found out in his investigation after the iPad event last week all three devices use the exact same silicon with the exact same internal model number: S5L8960X. There are no extra cores, no change in GPU configuration and the biggest one: no increase in memory bandwidth.

Previously both the A5X and A6X featured a 128-bit wide memory interface, with half of it seemingly reserved for GPU use exclusively. The non-X parts by comparison only had a 64-bit wide memory interface. The assumption was that a move to such a high resolution display demanded a substantial increase in memory bandwidth. With the A7, Apple takes a step back in memory interface width - so is it enough to hamper the performance of the iPad Air with its 2048 x 1536 display?

The numbers alone tell us the answer is no. In all available graphics benchmarks the iPad Air delivers better performance at its native resolution than the outgoing 4th generation iPad (as you'll soon see). Now many of these benchmarks are bound more by GPU compute rather than memory bandwidth, a side effect of the relative lack of memory bandwidth on modern day mobile platforms. Across the board though I couldn’t find a situation where anything was smoother on the iPad 4 than the iPad Air.

There’s another part of this story. Something I missed in my original A7 analysis. When Chipworks posted a shot of the A7 die many of you correctly identified what appeared to be a 4MB SRAM on the die itself. It's highlighted on the right in the floorplan diagram below:


A7 Floorplan, Courtesy Chipworks

While I originally assumed that this SRAM might be reserved for use by the ISP, it turns out that it can do a lot more than that. If we look at memory latency (from the perspective of a single CPU core) vs. transfer size on A7 we notice a very interesting phenomenon between 1MB and 4MB:

That SRAM is indeed some sort of a cache before you get to main memory. It’s not the fastest thing in the world, but it’s appreciably quicker than going all the way out to main memory. Available bandwidth is also pretty good:

We’re only looking at bandwidth seen by a single CPU core, but even then we’re talking about 10GB/s. Lookups in this third level cache don’t happen in parallel with main memory requests, so the impact on worst case memory latency is additive unfortunately (a tradeoff of speed vs. power).

I don’t yet have the tools needed to measure the impact of this on-die memory on GPU accesses, but in the worst case scenario it’ll help free up more of the memory interface for use by the GPU. It’s more likely that some graphics requests are cached here as well, with intelligent allocation of bandwidth depending on what type of application you’re running.

That’s the other aspect of what makes A7 so very interesting. This is the first Apple SoC that’s able to deliver good amounts of memory bandwidth to all consumers. A single CPU core can use up 8GB/s of bandwidth. I’m still vetting other SoCs, but so far I haven’t come across anyone in the ARM camp that can compete with what Apple has built here. Only Intel is competitive.

 

Introduction, Hardware & Cases CPU Changes, Performance & Power Consumption
Comments Locked

444 Comments

View All Comments

  • rituraj - Wednesday, October 30, 2013 - link

    What?
  • dugbug - Wednesday, October 30, 2013 - link

    astroturf
  • Kevin G - Wednesday, October 30, 2013 - link

    Whoa, 6 issue architecture in a phone/tablet? Apple wasn't kidding when they said 'desktop class' performance. I'm wondering what low level power management voodoo they have going to pull that off.

    The flip side is that if Apple wanted to build a real desktop/server class chip, they look like they could pull it off and be competitive with Intel. Disable Turbo and throttle down a Haswell to 1.4 Ghz and do a performance and performance per watt comparison. I fathom that Intel still leads but Apple's A7 design will be seriously competitive.

    I am in agreement that Apple should have moved to 2 GB of memory here. One could argue the merits of keeping with 2 GB on the phone but in the age of retina displays on tablets, it'll seem constrained over the long term. This would have been an ideal way to distinguish the iPad's hardware from the iPhone in terms of hardware features/performance. Ditto for not going with a 128 bit wide memory interface. Hell, it would have made sense for Apple to build the die with a 128 bit wide bus but only use the full width in the iPad.
  • ananduser - Wednesday, October 30, 2013 - link

    Intel's Atom runs a full size desktop OS. That's more of a load on it than simple mobile software like ios. The best ARM can muster is not even close to Intel.
  • Arbee - Wednesday, October 30, 2013 - link

    iOS is OS X (true BSD UNIX) with a different top level GUI. Similarly, Android is creeping towards feature parity with desktop Linux, although they have farther to go on audio and MIDI, and Windows Phone runs the real NT kernel. They're all a lot less different from a "full size desktop OS" than you seem to think.
  • Kevin G - Wednesday, October 30, 2013 - link

    There is an Android port for x86 which would put Atom SoC's like Baytrail on equal footing.

    The thing is that the Cyclone core is wider than even Haswell: 6 vs. 4. (For reference Silvermont is 2 issue.) Haswell likely has a higher throughput of instructions considering its x86 ISA (more load/stores for example) and different balance of execution units.
  • errorr - Wednesday, October 30, 2013 - link

    It is reportedly a very buggy port and DALVIK is broken which means it is useless. Plus it is not 64bit enabled yet which hurts bay trail.
  • Wilco1 - Thursday, October 31, 2013 - link

    Cyclone may well be 6-issue, but that's not unusual: Cortex-A15 is 8-issue. This is a design decision based on whether to use a single big issue queue or multiple separate issue queues (there are advantages/disadvantages either way). However it seems likely it is 4-way decode, just like Haswell. And the decode rate determines the sustained performance.
  • KPOM - Wednesday, October 30, 2013 - link

    Perhaps battery life came into play. Remember, Apple doesn't add specs for the sake of winning spec wars. They may also be trying to discourage developers from simply writing RAM-hungry apps that will leave the iPhone 5c and iPad 2 behind. 64-bit is supposed to be a smooth transition.

    Plus, I'm sure they're looking to keep some reasons to upgrade to an "iPad Air 2" or "iPad Pro" next year. 2GB would be nice, but I don't think 1GB will be a problem for most users.
  • YuLeven - Wednesday, October 30, 2013 - link

    As we're kindly reminded in every other tablet's review about things like 'this tablet is good, but it don't have the number of tablet optimized that the iPad has', I would kindly like to remind you a couple of small... well, shortcomings of the iPad. Ups, I said that.

    The iPad is good, but it can't open two apps at once.
    The iPad screen is great and sharp, but it's 4:3 aspect ratio is far worse than 16:9 for video watching, specially TV (Netflix, Hulu) shows.
    The iPad is bright, but its far more reflexive than Surface's.
    The iPad is good, but you don't have USB mass storage mode.
    The iPad is good, but you can't expand your memory.
    The iPad is good, but you can't use your external HD, pendrive, printer, mouse and other hardware stuff via USB port.
    The iPad is good, but office experience on it falls short of the one on Windows RT.
    The iPad is good, but you can't have a browser running on the background, for exemple for listening to some youtube music video while you have two other apps running on the front end.
    The iPad gestures are good, but multitasking by a simple swype from the left feels better than having to using four finger at once.
    The iPad is light and it's ok to use the cover as a stand, but it feels less confortable than having a real, sturdy quickstand.
    The iPad thousands of apps are great, but some of them are worse than using the actual website: Facebook and Pandora, for example.
    The iPad is good, but its experience using a remote desktop is worse than on other tablets.

    Well, the list goes on.

    Not that the iPad is a bad tablet, quite the opposite actually. But as reviews usually like to remind us of things that Android/Windows RT tablets can't do - and the iPad always can, as a matter of fact -, I wanted to recall some things that Android/Windows RT do superbly - and the iPad don't, as a matter of fact -.

Log in

Don't have an account? Sign up now