Floating point peak performance of Kaveri and other recent AMD and Intel chips

by Rahul Garg on January 22, 2014 8:30 AM EST

Posted in
CPUs
AMD
Intel
APUs
GPUs

101 Comments | Add A Comment

101 Comments

With the launch of Kaveri, some people have been wondering if the platform is suitable for HPC applications. Floating point peak performance of the CPU and GPU on both fp32 and fp64 datatypes is one of the considerations. At launch time, we were not clear on the fp64 performance of Kaveri's GPU but now we have official confirmation from AMD that it is 1/16th the rate of fp32 (similar to most GCN based GPUs except the flagships) and we have verified this on our 7850K by running FlopsCL.

I am taking this opportunity to summarize the info about Kaveri, Trinity, Llano and Intel's competing platforms Haswell and Ivy Bridge on both the CPU and GPU side. We provide a per-cycle estimate for the chips as well as peak calculated in gflops. The estimates are chip-wide, i.e. already take into account the number of cores or modules. Due to turbo boost, it was difficult to decide what frequency to use for peak calculations. For CPUs, we are using the base frequency and for GPUs we are using the boost frequency because in multithreaded and/or heterogeneous scenarios the CPU is less likely to turbo. In any case, we believe our readers are smart enough to calculate peaks at any frequency they want, given that we already supply per-cycle peaks :)

The peak CPU performance will depend on the SIMD ISA that your code was written and compiled for. We consider three cases: SSE, AVX (without FMA) and AVX with FMA (either FMA3 or FMA4).

CPU floating-point peak performance
Platform	Kaveri	Trinity	Llano	Haswell	Ivy Bridge
Chip	7850K	5800K	3870K	4770K	3770K
CPU frequency	3.7 GHz	3.8 GHz	3.0GHz	3.5GHz	3.5GHz
SSE fp32 (/cycle)	16	16	32	32	32
SSE fp64 (/cycle)	8	8	16	16	16
AVX fp32 (/cycle)	16	16	-	64	64
AVX fp64 (/cycle)	8	8	-	32	32
AVX FMA fp32 (/cycle)	32	32	-	128	-
AVX FMA fp64 (/cycle)	16	16	-	64	-
SSE fp32 (gflops)	59.2	60.8	96	112	112
SSE fp64 (gflops)	29.6	30.4	48	56	56
AVX fp32 (gflops)	59.2	60.8	-	224	224
AVX fp64 (gflops)	29.6	30.4	-	112	112
AVX FMA fp32 (gflops)	118.4	121.6	-	448	-
AVX FMA fp64 (gflops)	59.2	60.8	-	224	-

It is no secret that AMD's Bulldozer family cores (Steamroller in Kaveri and Piledriver in Trinity) are no match for recent Intel cores in FP performance due to the shared FP unit in each module. As a comparison point, one core in Haswell has the same floating point performance per cycle as two modules (or four cores) in Steamroller.

Now onto GPU peaks. Here, for Haswell, we chose to include both GT2 and GT3e variants.

GPU floating-point peak performance
Platform	Kaveri	Trinity	Llano	Haswell GT3e	Haswell GT2	Ivy Bridge
Chip	7850K	5800K	3870K	4770R	4770K	3770K
GPU frequency	720 MHz	800 MHz	600 MHz	1.3 GHz	1.25 GHz	1.15 GHz
fp32/cycle	1024	768	800	640	320	256
fp64/cycle (OpenCL)	64	48**	0	0	0	0
fp64/cycle (Direct3D)	64	0?	0	160	80	64
fp32 gflops	737.3	614	480	832	400	294.4
fp64 gflops (OpenCL)	46.1	38.4**	0	0	0	0
fp64 gflops (Direct3D)	46.1	0?	0	208	100	73.6

The fp64 support situation is a bit of a mess because some GPUs only support fp64 under some APIs. The fp64 rate of Intel's GPUs does not appear to be published but David Kanter provides an estimate of 1/4 speed compared to fp32. However Intel only enables fp64 under DirectCompute but does not enable fp64 under OpenCL for any of its GPUs.

Situation on AMD's Trinity/Richland is even more complicated. fp64 support under OpenCL is not standards-compliant and depends upon using a proprietary extension (cl_amd_fp64). Trinity/Richland do not appear to support fp64 under DirectCompute (and MS C++ AMP implementation) from what I can tell. From an API standapoint, Kaveri's GCN GPUs should work fine on for fp64 under all APIs.

Some of you might be wondering whether Kaveri is good for HPC applications. Compared to discrete GPUs, applications that are already ported and work well on discrete GPUs will continue to be best run on discrete GPUs. However, Kaveri and HSA will enable many more applications to be GPU accelerated.

Now we compare Kaveri against Haswell. In applications depending upon fp64 performance, conditions are not generally favorable to Kaveri. Kaveri's fp64 peak including both the CPU and GPU is only about 110 gflops. You will generally be better off first optimizing your code for AVX and FMA instructions and running on Haswell's CPU cores. If you are using Windows 8, you might also want to explore using Iris Pro through C++ AMP in conjunction with the CPU. Overall I doubt we will see Kaveri being used for fp64 workloads.

For heterogeneous fp32 applications, Kaveri should outperform Haswell GT2 and Ivy Bridge. Haswell GT3e will again be a strong contender on Windows given the extremely capable Haswell CPU cores and Iris Pro graphics. Intel's GPUs do not currently support OpenCL under Linux, but a driver is being worked on. Thus, on Linux, Kaveri will simply win out on fp32 heterogeneous applications. However, even on Windows Haswell GT3e will get strong competiton from Kaveri. While AMD has advantages such as excellent GCN architecture and HSA software stack (when ready) enabling many more applications to take advantage of GPU, Iris Pro will have the eDRAM to potentially provide much improved bandwidth and the backing of strong CPU cores.

I hope I have provided a fair overview of the FP capabilities of each platform. Application performance will of course depend on many more factors. Your questions and comments are welcome.

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

101 Comments

View All Comments

kantian - Monday, January 27, 2014 - link
Or if you prefer, you can calculate the first 2 numbers like that:
- i7-3770K, AVX fp32 (/cycle) -> 8*128/32 = 32 (8 FMA, 128-bit, fp32)
- i7-3770K, AVX fp64 (/cycle) -> 8*128/64 = 16 (8 FMA, 128-bit, fp64)
And the corresponding A10-7850K and i7-4770K numbers are correctly calculated like that:
- A10-7850K, AVX fp32 (/cycle) -> 4*128/32 = 16 (4 FMA, 128-bit, fp32)
- A10-7850K, AVX fp64 (/cycle) -> 4*128/64 = 8 (4 FMA, 128-bit, fp64)
- i7-4770K, AVX fp32 (/cycle) -> 8*256/32 = 64 (8 FMA, 256-bit, fp32)
- i7-4770K, AVX fp64 (/cycle) -> 8*256/64 = 32 (8 FMA, 256-bit, fp64)
rahulgarg - Monday, January 27, 2014 - link
Each Ivy Bridge core has two 256-bit ALU units and no FMA support. Ivy Bridge doesn't support FMA.
kantian - Monday, January 27, 2014 - link
Ok, you are right, I just didn't wish to go into such details. It doesn't change my calculations, because Intel Sandy Bridge/Ivy Bridge ALUs are used like that:
- 8 DP FLOPs/cycle: 4-wide AVX addition + 4-wide AVX multiplication
- 16 SP FLOPs/cycle: 8-wide AVX addition + 8-wide AVX multiplication
If you multiply those numbers by the number of cores (i.e. 4) you get 4 x 4 = 16 (fp64) and 4 x 8 = 32 (fp32) Those are exactly the numbers in my previous comment.
kantian - Monday, January 27, 2014 - link
Or in other words, Sandy Bridge/Ivy Bridge ALU can execute either 1 256-bit addition or one 256-bit multiplication per cycle per core. While the two 128-bit Steamroller FMA units can group together to execute the same 1 256-bit addition or one 256-bit multiplication per cycle per module. Hence in the most cases, 1 Steamroller module should have the same throughput as 1 Ivy Bridge core. As non FMA AVX multiply and add operations are rarely mixed together, one could not expect many cases where both operations are performed on both 256-bit Ivy Bridge ALUs at the same cycle. In some ideal scenario, one of the Ivy Bridge hyper threads would provide 256-bit addition and the other - 256-bit multiplication. I can agree that in those cases the CPU will reach your maximum numbers of peak performance.
kantian - Monday, January 27, 2014 - link
'Or in other words, Sandy Bridge/Ivy Bridge FPU ..." above
FellTheSky - Thursday, February 6, 2014 - link
would gddr5 and better memory bus help kaveri in HSA enabled applications?

There are some benchmarks around of opencalc and another app that supports hsa, but they are very simple test, and i would like to know if memory speed has a direct impact on hsa applications
crunchmore - Friday, May 23, 2014 - link
I'm not sure about my understanding, but maybe FPU in bulldozer don't work as a single core:
"What he could tell me was that the 128-bit FP units are symmetrical, and that, on any cycle, either integer core can dispatch a 256-bit AVX instruction (assuming software compiled to support AVX). Or, both integer cores can dispatch a single 128-bit instruction at the same time."

From: http://www.tomshardware.com/reviews/bulldozer-bobc...

There are some test to run for make situation clear? Thanks.

Floating point peak performance of Kaveri and other recent AMD and Intel chips

Post Your Comment

101 Comments

View All Comments

kantian - Monday, January 27, 2014 - link

rahulgarg - Monday, January 27, 2014 - link

kantian - Monday, January 27, 2014 - link

kantian - Monday, January 27, 2014 - link

kantian - Monday, January 27, 2014 - link

FellTheSky - Thursday, February 6, 2014 - link

crunchmore - Friday, May 23, 2014 - link

Log in

Don't have an account? Sign up now