AMD’s Kaveri: Pre-Launch Information
by Ian Cutress on January 6, 2014 8:00 PM ESTOn the back of AMD’s Tech Day at CES 2014, all of which was under NDA until the launch of Kaveri, AMD have supplied us with some information that we can talk about today. For those not following the AMD roadmap, Kaveri is the natural progression of the AMD A-Series APU line, from Llano, Trinity to Richland and now Kaveri. At the heart of the AMD APU design is the combination of CPU cores (‘Bulldozer’, ‘Steamroller’) and a large dollop of GPU cores for on-chip graphics prowess.
Kaveri is that next iteration in line which uses an updated FM2+ socket from Richland and the architecture is updated for Q1 2014. AMD are attacking with Kaveri on four fronts:
Redesigned Compute Cores* (Compute = CPU + GPU)
Kaveri uses an enhanced version of the Richland CPU core, codename Steamroller. As with every new CPU generation or architecture update, the main goal is better performance and lower power – preferably both. AMD is quoting a 20% better x86 IPC with Kaveri compared to Richland when put clock to clock. For the purposes of this information release, we were provided with several AMD benchmarking results to share:
These results border pretty much on the synthetic – AMD did not give any real world examples today but numbers will come through in time. AMD is set to release two CPUs on January 14th (date provided in our pre-release slide deck), namely the A10-7700K and the A10-7850K. Some of the specifications were also provided:
AMD APUs | ||||
Richland A8-6600K |
Richland A10-6800K |
Kaveri A10-7700K |
Kaveri A10-7850K |
|
Release | June 4 '13 | June 4 '13 | Jan 14th '14 | Jan 14th '14 |
Frequency | 3900 MHz | 4100 MHz | ? | 3700 MHz |
Turbo | 4200 MHz | 4400 MHz | ? | ? |
DRAM | DDR3-1866 | DDR3-2133 | DDR3-2133 | DDR3-2133 |
Microarhitecture | Piledriver | Piledriver | Steamroller | Steamroller |
Manufacturing Process | 32nm | 32nm | ? | ? |
Modules | 2 | 2 | ? | 2 |
Threads | 4 | 4 | ? | 4 |
Socket | FM2 | FM2 | FM2+ | FM2+ |
L1 Cache |
2 x 64 KB I$ 4 x 16 KB D$ |
2 x 64 KB I$ 4 x 16 KB D$ |
? | ? |
L2 Cache | 2 x 2 MB | 2 x 2 MB | ? | ? |
Integrated GPU | HD 8570D | HD 8670D | R7 | R7 |
IGP Cores | 256 | 384 | ? | 512 |
IGP Architecture | Cayman | Cayman | GCN | GCN |
IGP Frequency | 844 | 844 | ? | 720 |
Power | 100W | 100W | ? | 95W |
All the values marked ‘?’ have not been confirmed at this point, although it is interesting to see that the CPU MHz has decreased from Richland. A lot of the APU die goes to that integrated GPU, which as we can see above becomes fully GCN, rather than the Cayman derived Richland APUs. This comes with a core bump as well, seeing 512 GPU cores on the high end module – this equates to 8 CUs on die and what AMD calls ’12 Compute Cores’ overall. These GCN cores are primed and AMD Mantle ready, suggesting that performance gains could be had directly from Mantle enabled titles.
Described in AMD’s own words: ‘A compute core is an HSA-enabled hardware block that is programmable (CPU, GPU or other processing element), capable of running at least one process in its own context and virtual memory space, independently from other cores. A GPU Core is a GCN-based hardware block containing a dedicated scheduler that feeds four 16-wide SIMD vector processors, a scalar processor, local data registers and data share memory, a branch & message processor, 16 texture fetch or load/store units, four texture filter units, and a texture cache. A GPU Core can independently execute work-groups consisting of 64 work items in parallel.’ This suggests that if we were to run asynchronous kernels on the AMD APU, we could technically run twelve on the high end APU, given that each Compute Core is capable of running at least one process in its own context and virtual memory space independent of the others.
The reason why AMD calls them Compute Cores is based on their second of their four pronged attack: hUMA.
HSA, hUMA, and all that jazz
AMD went for the heterogeneous system architecture early on to exploit the fact that many compute intensive tasks can be offloaded to parts of the CPU that are designed to run them faster or at low power. By combining CPU and GPU on a single die, the system should be able to shift work around to complete the process quicker. When this was first envisaged, AMD had two issues: lack of software out in the public domain to take advantage (as is any new computing paradigm) and restrictive OS support. Now that Windows 8 is built to allow HSA to take advantage of this, all that leaves is the programming. However AMD have gone one step further with hUMA, and giving the system access to all the memory, all of the time, from any location:
Now that Kaveri offers a proper HSA stack, and can call upon 12 compute cores to do work, applications that are designed (or have code paths) to take advantage of this should emerge. One such example that AMD are willing to share today is stock calculation using LibreOffice's Calc application – calculating the BETA (return) of 21 fake stocks and plotting 100 points on a graph of each stock. With HSA acceleration on, the system performed the task in 0.12 seconds, compared to 0.99 seconds when turned off.
Prong 3: Gaming Technologies
In a year where new gaming technologies are at the forefront of design, along with gaming power, AMD are tackling the issue on one front with Kaveri. By giving it a GCN graphics backbone, features from the main GPU line can fully integrate (with HSA) into the APU. As we have seen in previous AMD releases and talks, this means several things:
- Mantle
- AMD TrueAudio
- PCIe Gen 3
AMD is wanting to revolutionize the way that games are played and shown with Mantle – it is a small shame that the Mantle release was delayed and that AMD did not provide any numbers to share with us today. The results should find their way online after release however.
Prong 4: Power Optimisations
With Richland we had CPUs in the range of 65W to 100W, and using the architecture in the FX range produced CPUs up to 220W. Techincally we had 45W Richland APUs launch, but to date I have not seen one for sale. However this time around, AMD are focusing a slightly lower power segment – 45W to 95W. Chances are the top end APUs (A10-7850K) will be 95W, suggesting that we have a combination of a 20% IPC improvement, 400 MHz decrease but a 5% TDP decrease for the high end chip. Bundle in some HSA and let’s get this thing on the road.
Release Date
AMD have given us the release date for the APUs: January 14th will see the launch of the A10-7850K and the A10-7700K. Certain system builders should be offering pre-built systems based on these APUs from today as well.
133 Comments
View All Comments
abufrejoval - Tuesday, January 7, 2014 - link
That's very much how I see it, too. But I see the gamble as being extremely problematic, because so far all of that only works in a very small niche: Where one APU is enough.With 1080p becoming the lower end in everything from smartphones to TVs or beamers, I don't see a single APU powerful enough to become mainstream.
Yes, I can run Unigine Valley at 30FPS in a 1024x576 window on Trinity and Richland and perhaps with Kavery and Mantle it will work at 720p, but that still falls short of what most people will want. It’s half a PS4 and it needs to scale to twice a PS4 at least.
Now if there was a *natural* scaling path, like the ability to simply add another APU or three to gain resolution (1080p and 4k), then they'd have me convinced.
But currently the only way is to add a discrete GPU and HSA won’t scale with that.
Well, compared to the current situation, where 50% of the APU die area become useless as soon as you add a dGPU (made Trinity/Richmond a hard sell for gaming IMHO) with HSA code could still use the iGPU portion of the APU for something useful, but basically a developer would again have to distribute their code into at least two distinct and individually sized buckets of compute resources and unless 90% of all PCs out there had it, nobody will very likely go through the effort.
Kavery needs to be multi processor and designed with an interconnect which allows creating a single image SMP HAS system with at least four nodes in a gaming rig and perhaps a little more for HPC or even server use. I also believe that Kavery should be sold in GPU like modules with DRAM built in, soldered on and optimized for that specific module. GDDR5 or DDR4 depending on where you want to wind up in price and power. These modules should then either be mounted flat for the single or stuck into a passive backplane to create the dual, quad or whatever sized rigs.
With Mantle AMD has game developers ready to invest some fundamental work to redo their engines, if now they could make it scale I could see this turn into a stampede.
As a single APU only design, it could die because the size of the ecosystem is too small to sustain it.
mikato - Wednesday, January 8, 2014 - link
About scaling... they can actually just add another CPU module or two, or GPU cores. These would be bigger more expensive APUs but they'd be what you want. This is sort of what they did with the Jaguar APUs that are in the new consoles. With their module based architecture, and APU marriage, they know they have this flexibility. We'll see what they choose to do with it.silverblue - Tuesday, January 7, 2014 - link
AMD's weak FPU performance is more of a Bulldozer thing. In any case, in SSE calculations, it should still equal or beat Phenom II, even when referring to an FX-4xxx CPU.Dribble - Tuesday, January 7, 2014 - link
Pushing everyone to adopt a different architecture only works if you control the market (i.e. in this case you are Intel). AMD are a small time player, for most companies it's simply not worth all the effort it would take to develop stuff for HUMA if 90%+ of your market can't use it. Hence while the tech may be great you know it will fail like the last few iterations of AMD cpu's which also had power point slides that promised great things for the cpu/gpu combo but have actually come to nothing.Mathos - Tuesday, January 7, 2014 - link
Actually people aren't taking something into consideration here. They do now have the ability to control the market when it comes to gaming. All XBOne, and PS4 games are running said AMD processors, with huma already built into them. Those 8 core jaguar apu's were designed with that in mind. Any games ported from those consoles, to the PC will have support for HSA by default. Just the same as any game designed to run on all 3 will have it by default. This is apparent when you look at the ps4 where both the cpu and gpu in the APU use the same bank of DDR5 memory. http://www.anandtech.com/show/5493/amd-outlines-hs... Something else people forgot about.Something else to digest, Intel has been doing this for a while, it's called quick sync on their cpu's. So it's no surprise that AMD would make effort to utilize similar tech on their APU's as well.
To the person saying it'd have to be twice a jaguar apu. Those cpu cores in the jaguar apu are minimal function x86-64 cores. Plus, They run at about half the frequency of these full steamroller cores. Which would effectively make a 4 core / 2 module Kaveri APU equal to that console apu other than the gpu component.
Now about AMD's weak fpu. I have to look back at the older reviews with between the PII and the Bulldozer/piledriver. Every time I look at those, I realize that people were forgetting they were comparing 4 FPU's to 6 in the previous generation. Since BD/PD 8xxx CPU's were 4 module, they only had 4 FPU's. Where as the older PII X6, had 6 full cores, meaning 6 FPU's. On a per FPU basis, BD/PD was actually a lot stronger than Thuban/Deneb.
silverblue - Wednesday, January 8, 2014 - link
BD's FlexFPU could do double the work of the FPU inside a Phenom II - two 128-bit instructions instead of one.abufrejoval - Thursday, January 9, 2014 - link
The problem is screen resolution: 60FPS or even 30FPS on 1080p can't be done with a single 128-Bit DDR3 bus. And that's all APUs can offer today. PS4 using GDDR5 and Xbox using eDRAM should prove that to the less technically inclined. At the moment the *top* Trinity/Kaveri APUs are 720p or 1K only for reasonble gaming performance. And while AMD has a whole pethoria of APU bins going down, only dGPU is available going up and that doesn't include HSA.2K is the bare minimum you need today for anything stationary, consoles or PCs and this CES is about going from 4K to 8K for TV screens. So if you don't have a clear answer, vision and growth path today to address these resolutions any chance to come out of the niche is severely hampered.
It doesn't mean you have to deliver 4K yet, 2K is enough, but unless developers know it will be there by the time they need it, they won't take the risk of jumping for HSA. Nor perhaps the consumers, who would certainly prefer a simple seamless upgrade for higher resolution or would like to play the same games on the differently sized screens around the house.
silverblue - Thursday, January 9, 2014 - link
Doesn't the Radeon Bus mitigate the bandwidth limitation somewhat?lmcd - Tuesday, January 7, 2014 - link
Which is a terrible idea because as weak at compute as Kepler is, Nvidia can upend their roadmap and go back to some of the ideas behind Fermi which would wipe out the compute advantage really quickly.And then there's Intel's mammoth Knight's Landing looming overhead.
dragonsqrrl - Tuesday, January 7, 2014 - link
Where's that guy bitching about 'doctored' die shots over in the Tegra K1 announcement article? lol, Oh ya, I forgot this is an AMD product, so it's okay.