HiSilicon Kirin 960: A Closer Look at Performance and Power

Name: HiSilicon Kirin 960: A Closer Look at Performance and Power
Item: HiSilicon Kirin 960: A Closer Look at Performance and Power
Author: Matt Humrick

by Matt Humrick on March 14, 2017 7:00 AM EST

86 Comments | Add A Comment

86 Comments

CPU Performance

We’ll begin our Kirin 960 performance evaluation by investigating the A73’s integer and floating-point IPC with some synthetic tests. Then we’ll see how the changes to its memory system affect memory latency and bandwidth. Finally, after completing the lower-level tests, we’ll see how Huawei’s Mate 9 and its Kirin 960 SoC perform when running some real-world workloads.

Our first look at the A73’s integer performance comes from SPECint2000, the integer component of the SPEC CPU2000 benchmark developed by the Standard Performance Evaluation Corporation. This collection of single-threaded tests allows us to compare IPC for competing CPU microarchitectures. The scores below are not officially validated numbers, which requires the test to be supervised by SPEC, but we’ve done our best to choose appropriate compiler flags and to get the tests to pass internal validation.

SPECint2000 - Estimated Scores ARMv8 / AArch64
	Kirin 960	Kirin 950 (% Advantage)	Exynos 7420 (% Advantage)	Snapdragon 821 (% Advantage)
164.zip	1217	1094 (11.3%)	940 (29.5%)	1273 (-4.4%)
175.vpr	4118	3889 (5.9%)	2857 (44.1%)	1687 (144.1%)
176.gcc	2157	1864 (15.7%)	1294 (66.7%)	1746 (23.5%)
181.mcf	1118	664 (68.3%)	928 (20.5%)	1200 (-6.8%)
186.crafty	2222	2083 (6.7%)	1176 (88.9%)	1613 (37.8%)
197.parser	1395	1208 (15.5%)	933 (49.5%)	1059 (31.8%)
252.eon	3421	3333 (2.6%)	2453 (39.5%)	3714 (-7.9%)
253.perlmk	1748	1651 (5.8%)	1216 (43.8%)	1513 (15.5%)
254.gap	1930	1667 (15.8%)	1264 (52.6%)	1594 (21.1%)
255.vortex	2111	1863 (13.3%)	1473 (43.3%)	1712 (23.3%)
256.bzip2	1402	1220 (15.0%)	1079 (29.9%)	1172 (19.6%)
300.twolf	2479	2521 (-1.7%)	1887 (31.4%)	847 (192.6%)

The Kirin 960’s A73 CPU is about 11% faster on average than the Kirin 950’s A72. In addition to the front-end changes discussed on the previous page and the changes to the memory system discussed in the next section, the A73’s integer pipelines have undergone a few tweaks as well. Where the A72 had 3 integer ALUs—2 simple ALUs for basic operations such as addition and shifting and 1 dedicated multi-cycle ALU for complex operations such as multiplication, division, and multiply-accumulate—the A73 only has 2 integer ALUs that are capable of performing both basic and complex operations. This affects performance in different ways. For example, because only one of the A73’s ALUs can handle multiplication while the other handles division, the time to execute multiply or division operations sees no change; however, while an ALU is occupied with a multi-cycle instruction, it cannot execute simple instructions like the A72’s dedicated pipelines can, leading to a potential performance loss. Multiply-accumulate operations, which require both of the A73’s pipelines, incur a similar penalty. It’s not all bad, however. Workloads that perform parallel arithmetic or use certain other complex instructions can see double the execution throughput on A73 versus A72.

Note that the table above does not account for differences in CPU frequency. The Kirin 960’s frequency advantage over the Kirin 950 and Snapdragon 821 is less than 3%, making these numbers easier to compare, but its advantage over the Exynos 7420 is a little over 12%. The chart below accounts for this by dividing the estimated SPECint2000 ratio score by CPU frequency, making IPC comparisons easier.

SPECint2000 64b/32b Estimated Ratio/MHz

Despite the substantial microarchitectural differences between the A73 and A72, the A73’s integer IPC is only 11% higher than the A72’s. This is likely the result of improvements in one area being partially offset by regressions in another. Still, assuming ARM’s power reduction claims hold true, this is not a bad result.

The gap between the A73 and A57 increases to 29%. The integer performance for Qualcomm’s custom Kryo core is well behind ARM’s A73 and A72 cores, essentially matching the A57’s IPC.

Geekbench 4 - Integer Performance Single Threaded
	Kirin 960	Kirin 950 (% Advantage)	Exynos 7420 (% Advantage)	Snapdragon 821 (% Advantage)
AES	911.3 MB/s	935.6 MB/s (-2.59%)	795.8 MB/s (14.52%)	559.1 MB/s (63.00%)
LZMA	3.03 MB/s	2.87 MB/s (5.69%)	2.28 MB/s (33.33%)	2.20 MB/s (38.09%)
JPEG	16.1 Mpixels/s	15.5 Mpixels/s (3.66%)	14.1 Mpixels/s (13.95%)	21.6 Mpixels/s (-25.62%)
Canny	22.5 Mpixels/s	26.8 Mpixels/s (-16.06%)	23.6 Mpixels/s (-4.80%)	30.3 Mpixels/s (-25.77%)
Lua	1.70 MB/s	1.55 MB/s (10.13%)	1.20 MB/s (41.94%)	1.47 MB/s (16.14%)
Dijkstra	1.53 MTE/s	1.14 MTE/s (33.53%)	0.92 MTE/s (65.12%)	1.39 MTE/s (9.57%)
SQLite	51.6 Krows/s	43.5 Krows/s (18.62%)	34.0 Krows/s (51.99%)	36.7 Krows/s (40.73%)
HTML5 Parse	8.30 MB/s	6.79 MB/s (22.19%)	6.37 MB/s (30.25%)	7.61 MB/s (9.02%)
HTML5 DOM	2.17 Melems/s	1.92 Melems/s (12.82%)	1.26 Melems/s (72.91%)	0.37 Melems/s (489.09%)
Histogram Equalization	48.7 Mpixels/s	57.0 Mpixels/s (-14.56%)	50.6 Mpixels/s (-3.66%)	51.2 Mpixels/s (-4.82%)
PDF Rendering	44.8 Mpixels/s	45.5 Mpixels/s (-1.47%)	39.7 Mpixels/s (12.93%)	53.0 Mpixels/s (-15.36%)
LLVM	194.4 functions/s	167.9 functions/s (15.76%)	128.6 functions/s (51.14%)	113.5 functions/s (71.20%)
Camera	5.45 images/s	5.45 images/s (0.00%)	4.95 images/s (10.17%)	7.19 images/s (-24.12%)

The updated Geekbench 4 workloads give us a second look at integer IPC. Similar to the SPECint2000 results, we see Kirin 960 showing 5% to 15% gains over Kirin 950 in several of the tests, but there’s a bit more variation overall. The Kirin 960 is actually slower than Kirin 950 in some tests, and, in the case of Canny and Histogram Equalization, its A73 is even slower than the Exynos 7420’s A57. It also falls behind Qualcomm’s Kryo in the JPEG, PDF Rendering, and Camera tests. The tests where the Kirin 960 does well—HTML5 Parse, HTML5 DOM, and SQLite—are very common workloads, though, which should translate into better real-world performance.

Geekbench 4 (Single Threaded) Integer Score/MHz

The chart above accounts for differences in CPU frequency, making it easier to directly compare IPC. Overall the A73 shows only about a 4% improvement over the A72 and about a 12% improvement over the A57 in this group of workloads, considerably less than what we saw in SPECint2000; however, with margins ranging from 33.5% in Dijkstra to -16.1% in Canny, it’s impossible to make any sweeping statements about the A73’s integer performance being better or worse than the A72’s.

Qualcomm’s Kryo CPU falls just behind the A57 once again despite posting better results in many of the Geekbench integer tests. Its poor performance in LLVM and HTML5 DOM weighs heavily on its overall score.

I’ve also included results for ARM’s in-order A53 companion core. The A73’s integer IPC is 1.7x to 2x higher overall, which illustrates why octa-core A53 SoCs are so much slower, particularly in Web browsing, than designs that use 2-4 big cores (A73/A72/A57) instead of 4 additional A53s.

Geekbench 4 - Floating Point Performance Single Threaded
	Kirin 960	Kirin 950 (% Advantage)	Exynos 7420 (% Advantage)	Snapdragon 821 (% Advantage)
SGEMM	10.7 GFLOPS	13.9 GFLOPS (-23.44%)	11.9 GFLOPS (-10.36%)	12.2 GFLOPS (-12.57%)
SFFT	2.89 GFLOPS	2.26 GFLOPS (27.73%)	2.62 GFLOPS (10.39%)	3.21 GFLOPS (-10.07%)
N-Body Physics	838.4 Kpairs/s	896.9 Kpairs/s (-6.52%)	634.5 Kpairs/s (32.14%)	1156.7 Kpairs/s (-27.51%)
Rigid Body Physics	5891.4 FPS	6497.4 FPS (-9.33%)	4662.7 FPS (26.35%)	7171.3 FPS (-17.85%)
Ray Tracing	221.9 Kpixels/s	216.9 Kpixels/s (2.30%)	136.1 Kpixels/s (63.07%)	298.3 Kpixels/s (-25.59%)
HDR	7.46 Mpixels/s	7.57 Mpixels/s (-1.45%)	7.17 Mpixels/s (4.09%)	10.8 Mpixels/s (-30.90%)
Gaussian Blur	23.6 Mpixels/s	28.6 Mpixels/s (-17.37%)	24.4 Mpixels/s (-2.94%)	48.5 Mpixels/s (-51.27%)
Speech Recognition	12.8 Words/s	8.9 Words/s (44.14%)	10.2 Words/s (25.49%)	10.9 Words/s (17.43%)
Face Detection	501.2 Ksubs/s	518.9 Ksubs/s (-3.42%)	435.5 Ksubs/s (15.09%)	685.0 Ksubs/s (-26.83%)

With the exception of SFFT and Speech Recognition, the Kirin 960 is generally a little slower than the Kirin 950 in Geekbench 4’s floating-point workloads. This is a bit of a surprise considering that the A73’s NEON execution units are relatively unchanged from the A72’s design, with reduced latency for specific instructions improving NEON performance by 5%, according to ARM. These results are even harder to interpret after factoring in the A73’s lower-latency front end and improvements to its fetch block and memory subsystems. It’s possible that some of these tests are limited by the A73’s narrower decode stage, but given the variation in workloads, this is probably not true for every case. It will be interesting to see if A73 implementations from other SoC vendors show similar results.

Geekbench 4 (Single Threaded) Floating Point Score/MHz

After accounting for the differences in CPU frequency, floating-point IPC for the Kirin 960’s A73 is 3% to 5% lower overall than the A72 but about 3% higher than the older A57. These results, which are a geometric mean of the floating-point subtest scores, are certainly closer to what I would expect, but hide the large performance variation from one workload to the next.

It’s pretty obvious that floating-point performance was Qualcomm’s focus for its custom Kryo core. While integer IPC was no better than ARM’s A57, Kryo’s floating-point IPC is 23% higher than the A72 in Geekbench 4, with particularly strong results in the Gaussian Blur and HDR tests.

Introduction Memory and System Performance

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

86 Comments

View All Comments

lilmoe - Tuesday, March 14, 2017 - link
I read things thoroughly before criticizing. You should do the same before jumping in to support an idiotic comment like fanofanand's. He's more interested in insulting people than finding the truth.

These tests are the ones which aren't working. No one gets nearly as much battery life as they report. Nor are the performance gains anywhere near what benchmarks like geekbench are reporting. If something isn't working, one should really look for other means. That's how progress works.

You can't test a phone the same way you test a workstation. You just can't. NO ONE leaves their phone lying on a desk for hours waiting on it to finish compiling 500K lines of code, or rendering a one-hour 3D project or a 4K video file for their channel on Youtube. But they do spend a lot of time watching video on Youtube, browsing the web with 30 second pauses between each scroll, and uploading photos/videos to social media after applying filters. Where are these tests??? You know, the ones that actually MATTER for most people? You know, the ones that ST performance matters less for, etc, etc...

Anyway, I did suggest what I believe is a better, more realistic, method for testing. Hint, it's in the fifth paragraph of my original reply. But who cares right? We just want to know "which is the fastest", which method confirms our biases, regardless of the means of how such performance is achieved. Who cares about the truth.

People are stubborn. I get that. I'm stubborn too. But there's a limit at how stubborn people can be, and they need to be called out for it.
Meteor2 - Wednesday, March 15, 2017 - link
I'm with fanof and close on this one. Here we have a consistent battery of repeatable tests. They're not perfectly 'real-world' but they're not far off either; there's only so many things a CPU can do.

I like this test suite (though I'd like to see GB/clock and SPi and GB/power calculated and graphed too). If you can propose a better one, do so.
close - Wednesday, March 15, 2017 - link
This isn't about supporting someone's comment, I was very clear which part I agree with: the one where you help come up with a practical implementation of your suggestion.

Phone can and should be tested like normal desktops since the vast majority of them spend most of their time idling, just like phones. The next this is running Office like applications, normal browsing, and media consumption.

You're saying that "NO ONE leaves their phone lying on a desk for hours waiting on it to finish compiling 500K lines of code". But how many people would find even that relevant? How many people compile 500K lines of code regularly? Or render hours of 4K video? And I'm talking about percentage of the total.

Actually the ideal case for testing any device is multiple scenarios that would cover a more user types: from light browsing and a handful of phone calls to heavy gaming or media consumption. These all result in vastly different results as a SoC/phone might be optimized for sporadic light use or heavier use for example. So a phone that has best battery life and efficiency while gaming won't do so while browsing. So just like benchmarks, any result would only be valid for people who follow the test scenario closely in their daily routine.

But the point wasn't whether an actual "real world" type scenario is better, rather how exactly do you apply that real world testing into a sequence of steps that can be reproduced for every phone consistently? How do you make sure that all phones are tested "equally" with that scenario and that none has an unfair (dis)advantage from the testing methodology? Like Snapchat or FB being busier one day and burning through the battery faster.

Just like the other guy was more interested in insults (according to you), you seem more interested in cheap sarcasm than in actually providing an answer. I asked for a clear methodology. You basically said that "it would be great if we had world peace and end hunger". Great for a beauty pageant, not so great when you were asked for a testing methodology. A one liner is not enough for this. A methodology is you describing exactly how you proceed with testing the phones, step by step, while guaranteeing reproducibility and fairness. Also please explain how opening a browser, FB, or Snapchat is relevant for people who play games 2 hours per day, watch movies or actually use the phone as a phone and talk to other people.

You're making this more difficult than it should be. You look like you had plenty of time to think about this. I hald half a day and already I came up with a better proposal then yours (multiple scenarios vs. single scenario). And of course, I will also leave out the exact methodology part because this is a comment competition not an actual search for solutions.
lilmoe - Wednesday, March 15, 2017 - link
I like people who actually spend some time to reply. But, again, I'd appreciate it more if you read my comments more carefully. I told you that the answer you seek is in my first reply, in the fifth paragraph. If you believe I have "plenty of time" just for "cheap sarcasm", then sure we can end it here. If you don't, then go on reading.

I actually like this website. That's why I go out of my way to provide constructive criticism. If I was simply here for trolling, my comments won't be nearly as long.

SoCs don't live in a vacuum, they come bundled with other hardware and software (Screen, radios, OS/Kernel), optimized to work on the device being reviewed. In the smartphone world, you can't come to a concrete conclusion on the absolute efficiency of a certain SoC based on one device, because many devices with the same SoC can be configured to run that SoC differently. This isn't like benchmarking a Windows PC, where the kernel and governer are fixed across hardware, and screens are interchangeable.

Authors keep acknowledging this fact, yet do very little to go about testing these devices using other means. It's making it hard for everyone to understand the actual performance of said devices, or the real bang for the buck they provide. I think we can agree on that.

"You're making this more difficult than it should be"
No, really, I'm not. You are. When someone is suggesting something a bit different, but everyone is slamming them for the sake of "convention" and "familiarity", then how are we supposed to make progress?

I'm NOT saying that one should throw benchmarks out. But I do believe that benchmarks should stay in meaningful context. They give you a rough idea about the snappiness of a ultra-mobile device, since it's been proven time after time that the absolute performance of these processors is ONLY needed for VERY short bursts, unlike workstations. However, they DO NOT give you anywhere near a valid representation of average power draw and device battery life, and neither do scripts written to run synthetic/artificial workloads. Period.

This is my point. I believe the best way to measure a specific configuration is by first specifying the performance point a particular OEM is targeting, and then measuring the power draw of that target. This comes in as the average clocks the CPU/GPU at various workloads, from gaming, browsing, playing video, to social media. It doesn't matter how "busy" these content providers are at specific times, the average clocks will be the same regardless because the workload IS the same.

I have reason to believe that OEMs are optimizing their kernels/governers for each app alone. Just like they did with benchmarks several years ago, where they ramp clocks up when they detect a benchmark running. Except, they're doing it the right way now, and optmizing specific apps to run differently on the device to provide the user with the best experience.

When you've figured out the average the OEM is targetting for various workloads, you'd certainly know how much power it's drawing, and how much battery life to expect AFTER you've already isolated other factors, such as the screen and radios. It also makes for a really nice read, as a bonus (hence, "worth investigating").

This review leaves an important question unanswered about this SoC's design (I'm really interested to know the answer); did HiSilicon cheap out on the fab process to make more money and leach on the success of its predecessor? Or did they do that with good intentions to optimize their SoC further for modern, real world workloads that currently used benchmarks are not detecting? I simply provided a suggest to answer that question. Does that warrant the language in his, or your reply? Hence my sarcasm.
fanofanand - Tuesday, March 14, 2017 - link
It's exciting to see the envelope being pushed, and though these are some interesting results I like that they are pushing forward and not with a decacore. The G71 looks like a botched implementation if it's guzzling power that heavily, I wonder if some firmware/software could fix that? A73 still looks awesome, and I can't wait to see a better implementation!
psychobriggsy - Tuesday, March 14, 2017 - link
TBH the issue with the GPU appears to be down to the clock speed it is configured with.

It's clear that this is set for benchmarking purposes, and it's good that this has been caught.

Once the GPU settles down into a more optimal 533MHz configuration, power consumption goes down significantly. Sadly it looks like there are four clock settings for the GPU, and they've wasted three of them on stupid high clocks. A better setup looks to be 800MHz, 666MHz, 533MHz and a power saving 400MHz that most Android games would still find overkill.
Meteor2 - Wednesday, March 15, 2017 - link
Performance/Watt is frankly rubbish whatever the clock speed. Clearly they ran out of time or money to implement Bifrost properly.
fanofanand - Wednesday, March 15, 2017 - link
That's what I'm thinking, I read the preview to Bitfrost and thought "wow this thing is going to be killer!" I was right on the money, except that it's a killer of batteries, not competing GPUs.
Shadowmaster625 - Tuesday, March 14, 2017 - link
What is HTML5 DOM doing that wrecks the Snapdragon 821 so badly?
joms_us - Tuesday, March 14, 2017 - link
Just some worthless test that the Monkey devs put to show how awesome iPhones are. But if you do real side-by-side website comparison between iPhone and and phone with SD821, SD821 will wipe the floor.

HiSilicon Kirin 960: A Closer Look at Performance and Power

CPU Performance

Post Your Comment

86 Comments

View All Comments

lilmoe - Tuesday, March 14, 2017 - link

Meteor2 - Wednesday, March 15, 2017 - link

close - Wednesday, March 15, 2017 - link

lilmoe - Wednesday, March 15, 2017 - link

fanofanand - Tuesday, March 14, 2017 - link

psychobriggsy - Tuesday, March 14, 2017 - link

Meteor2 - Wednesday, March 15, 2017 - link

fanofanand - Wednesday, March 15, 2017 - link

Shadowmaster625 - Tuesday, March 14, 2017 - link

joms_us - Tuesday, March 14, 2017 - link

Log in

Don't have an account? Sign up now