AMD Zen 2 Microarchitecture Analysis: Ryzen 3000 and EPYC Rome
by Dr. Ian Cutress on June 10, 2019 7:22 PM EST- Posted in
- CPUs
- AMD
- Ryzen
- EPYC
- Infinity Fabric
- PCIe 4.0
- Zen 2
- Rome
- Ryzen 3000
- Ryzen 3rd Gen
Editor's Note: With Zen 2 set to launch tomorrow (7/7), here's our architecture analysis from last month for some timely background information.
We have been teased with AMD’s next generation processor products for over a year. The new chiplet design has been heralded as a significant breakthrough in driving performance and scalability, especially as it becomes increasingly difficult to create large silicon with high frequencies on smaller and smaller process nodes. AMD is expected to deploy its chiplet paradigm across its processor line, through Ryzen and EPYC, with those chiplets each having eight next-generation Zen 2 cores. Today AMD went into more detail about the Zen 2 core, providing justification for the +15% clock-for-clock performance increase over the previous generation that the company presented at Computex last week.
AMD’s Zen 2 Product Portfolio
The current products that AMD has announced that have Zen 2 cores include the Ryzen 3rd Generation consumer CPUs, known as the Ryzen 3000 family, and AMD’s next generation enterprise EPYC processor, known as Rome. As of today, AMD has announced explicit details of six consumer Ryzen 3000 processors, including core counts, frequencies, memory support, and power. Details about the server processor, aside from some peak values, are expected in due course over the next few months.
AMD 'Matisse' Ryzen 3000 Series CPUs | |||||||||||
AnandTech | Cores Threads |
Base Freq |
Boost Freq |
L2 Cache |
L3 Cache |
PCIe 4.0 |
DDR4 | TDP | Price (SEP) |
||
Ryzen 9 | 3950X | 16C | 32T | 3.5 | 4.7 | 8 MB | 64 MB | 16+4+4 | 3200 | 105W | $749 |
Ryzen 9 | 3900X | 12C | 24T | 3.8 | 4.6 | 6 MB | 64 MB | 16+4+4 | 3200 | 105W | $499 |
Ryzen 7 | 3800X | 8C | 16T | 3.9 | 4.5 | 4 MB | 32 MB | 16+4+4 | 3200 | 105W | $399 |
Ryzen 7 | 3700X | 8C | 16T | 3.6 | 4.4 | 4 MB | 32 MB | 16+4+4 | 3200 | 65W | $329 |
Ryzen 5 | 3600X | 6C | 12T | 3.8 | 4.4 | 3 MB | 32 MB | 16+4+4 | 3200 | 95W | $249 |
Ryzen 5 | 3600 | 6C | 12T | 3.6 | 4.2 | 3 MB | 32 MB | 16+4+4 | 3200 | 65W | $199 |
The Zen 2 design paradigm, compared to the first generation of Zen, has changed significantly. The new platform and core implementation is designed around small 8-core chiplets built on TSMC’s 7nm manufacturing process, and measure around 74-80 square millimeters. On these chiplets are two groups of four-cores arranged in a ‘core complex’, or CCX, which contains those four cores and a set of L3 cache – the L3 cache is doubled for Zen 2 over Zen 1.
Each full CPU, regardless of how many chiplets it has, is paired with a central IO die through Infinity Fabric links. The IO die acts as the central hub for all off-chip communications, as it houses all the PCIe lanes for the processor, as well as memory channels, and Infinity Fabric links to other chiplets or other CPUs. The IO die for the EPYC Rome processors is built on Global Foundries' 14nm process, however the consumer processor IO dies (which are smaller and contain fewer features) are built on the Global Foundries 12nm process.
The consumer processors, known as ‘Matisse’ or Ryzen 3rd Gen or Ryzen 3000-series, will be offered with up to two chiplets for sixteen cores. AMD is launching six versions of Matisse on July 7th, from six cores to sixteen cores. The six and eight-core processors have one chiplet, while above this the parts will have two chiplets, but in all cases the IO die is the same. This means that every Zen 2 based Ryzen 3000 processor will have access to 24 PCIe 4.0 lanes and dual channel memory. Based on the announcements today, the prices will range from $199 for the Ryzen 5 3600, up to $700+ for the sixteen core (we’re waiting on final confirmation of this price).
The EPYC Rome processors, built on these Zen 2 chiplets, will have up to eight of them, enabling a platform that can support up to 64 cores. As with the consumer processors, no chiplet can communicate directly with each other – each chiplet will only connect directly to the central IO die. That IO die houses links for eight memory channels, and up to 128 lanes of PCIe 4.0 connectivity.
AMD’s Roadmap
Before diving into the new product line, it is worth recapping where we currently sit in AMD’s planned roadmap.
In previous roadmaps, showcasing AMD’s movement from Zen to Zen 2 and Zen 3, the company has explained that this multi-year structure will showcase Zen in 2017, Zen 2 in 2019, and Zen 3 by 2021. The cadence isn’t exactly a year, as it has depended on AMD’s design and manufacturing abilities, as well as agreements with its partners in the foundries and the current market forces.
AMD has stated that its plan for Zen 2 was to always launch on 7nm, which ended up being TSMC’s 7nm (Global Foundries wasn’t going to be ready in time for 7nm, and ultimately pulled the plug). The next generation Zen 3 is expected to align with an updated 7nm process, and at this point AMD has not made any comment about a potential ‘Zen 2+’ design in the works, although at this point we do not expect to see one.
Beyond Zen 3, AMD has already stated that Zen 4 and Zen 5 are currently in various levels of their respective design stages, although the company has not committed to particular time frames or process node technologies. AMD has stated in the past that the paradigms of these platforms and processor designs are being set 3-5 years in advance, and the company states it has to make big bets every generation to ensure it can remain competitive.
For a small insight into Zen 4, in an interview with Forrest Norrod, SVP of AMD’s Enterprise, Embedded, and Semi-Custom group, at Computex, he exclusively revealed to AnandTech the code name of AMD’s Zen 4 EPYC processor: Genoa.
AMD EPYC CPU Codenames | |||
Gen | Year | Name | Cores |
1st | 2017 | Naples | 32 x Zen 1 |
2nd | 2019 | Rome | 64 x Zen 2 |
3rd | 2020 | Milan | ? x Zen 3 |
4th | ? | Genoa | ? x Zen 4 |
5th | ? | ? | ? x Zen 5 |
Forrest explained that the Zen 5 code name follows a similar pattern, but would not comment on the time frame for the Zen 4 product. Given that the Zen 3 design is expected mid-2020, that would put a Zen 4 product for late 2021/early 2022, if AMD follows its cadence. How this will play into AMD’s consumer roadmap plans is unclear at this point, and will depend on how AMD approaches its chiplet paradigm and any future adjustments to its packaging technology in order to enable further performance improvements.
216 Comments
View All Comments
eek2121 - Wednesday, June 19, 2019 - link
I think what people are getting at is having an L4 Cache. Such a cache would be slower than L3, but would be much faster than DRAM (for now, DDR 5133 was recently demonstrated, that is 2566 MHz double data rate). HBM2 is a prime candidate for that because you can stick 8 Gb on a CPU for $60 and with some engineering work, it would help performance massively. 8 gb could hold practically everything needed in cache. That being said, there are engineering challenges to overcome and I doubt this will ever be a thing.Once JEDEC approves RAM running at DDR 5600 at reasonable timings it won’t matter anyway. AMD can simply bump up the IF speed to 1:1 and with shortened RAM traces, performance penalties can be minimized.
jamescox - Saturday, June 22, 2019 - link
For an interposer based Epyc package for the next generation, I would expect perhaps they do an active interposer with all of the external interface transistors in the interposer. They could do similar things with a passive interposer also though. The passive interposer could be an intermediate between Zen 3 and Zen 4. Then they could place a large number of 7 nm+ chiplets on the interposer. As I said, it is hard to speculate, but an option that I thought of based on AdoredTV 15 chiplet rumor would be to have 4 memory controller chips, each one running 2 channels (128-bit) DDR5. Those chips would just be the memory controller logic if on an active interposer and the interfaces to the interposer connections. That isn’t much so at 7 nm and below, they could place massive L4 SRAM caches on the memory controller chips. Current ~75 square mm Zen 2 chiplets have 16 MB plus 8 cpu cores, so it could be a large amount of cache; perhaps something like 64 or 128 MB per chip. It wouldn’t be a cheap device, but AMD’s goal is to get into the high end market eventually.The other chiplets could be 1 or two die to manage connections out to the cpu chiplets. This would just be the logic with an active interposer. With a regular interposer, it would need to have the IO transistors also, but the interfaces are quite small. A single infinity fabric switch chip handling all cpu chiplets could provide very low latency. They may have another chip with a switch to tie everything together or they could actually place a couple cpu chiplets on the interposer. Two extra cpu chiplets or one 16 core chiplet could be where the 80 core rumor came from. A possible reason to do that is to allow an HBM based gpu to be mounted on either side. That would make an exceptional HPC product with 16 cores (possible 64 threads if they go to 4 way SMT) and 2 HBM gpus. Another way to get 80 core would be to just make a 3 CCX chiplet with 12 cores. It looks like the Epyc package will not fit all 12 core die though. A mixture of 4 12-core and 4 8-core looks like it would fit, but it wouldn’t be symmetric though. That would allow a quick Zen 2+ style upgrade. Desktop might be able to go to 24 cores and Epyc to 80. The confusion could be mixing up a Zen 2+ rumor and a Zen 3 rumor or something like that. The interposer makes a lot of sense for the giant IO die that cannot be easily implemented at 7 nm. The yields probably don’t support that large of die, so you use an interposer and make a bunch of 100 square mm sized die instead.
I can’t rule out placing HBM on an IO interposer, but due to the latency not really being that much better than off package DRAM, especially at DDR5 speeds, it just doesn’t seem like they would do it.
nandnandnand - Sunday, July 7, 2019 - link
"That being said, there are engineering challenges to overcome and I doubt this will ever be a thing."Putting large amounts of DRAM ever closer to the CPU will definitely be a thing:
https://www.darpa.mil/attachments/3DSoCProposersDa...
Intel is already moving in this direction with Foveros, and AMD is also working on it:
https://www.tomshardware.com/news/amd-3d-memory-st...
It doesn't matter how fast DDR5 is. The industry must move in this direction to grab performance and power efficiency gains.
AdrianMel - Sunday, June 16, 2019 - link
I would like these AMD chips to be used on laptops. It would be a breakthrough in computing power, low consumption. I think that if a HBM2 memory or a larger memory is integrated into the processor, I think it will double the computing power. It would be a study and implementation of 2 super ports, the old expresscard 54 in which we can insert 2 video cards in laptopsnandnandnand - Sunday, July 7, 2019 - link
AMD needs to put out some 6-8 core Zen 2 laptop chips.peevee - Monday, June 17, 2019 - link
Does it mean that AVX2 performance doubles compared to Zen+? At least on workloads where data for the inner loop fits into L1D$ (hierarchical dense matrix multiplication etc)?peevee - Monday, June 17, 2019 - link
"AMD manages its L3 by sharing a 16MB block per CCX, rather than enabling access to any L3 from any core."Does it mean that for code and shared data caches, 64MB L3 on Ryzen 9 behaves essentially like 16MB cache (say, all 12/16 cores run the same code as it usually is in performance-critical client code and not 4+ different processes/VMs in parallel)? What a waste it is/would be...
jamescox - Saturday, June 22, 2019 - link
The caches on different CCXs can communicate with each other. In Zen 2, those one the same die probably communicate at core clock rather than at memory clock; there is no memory clock on the cpu chiplet. The speeds between chiplets have essentially more than doubled the clocks vs. Zen 1 and there is a possibility that they doubled the widths also. There just about isn’t any way to scale to such core counts otherwise.An intel monolithic high core count device will have trouble competing. The latency of their mesh network will go up with more cores and it will burn a lot of power. The latency of the L3 with a mesh network will be higher than the latency within a 4-core CCX. Problems with the CCX architecture are mostly due to OS scheduler issues and badly written multithreaded code. Many applications performed significantly better on Linux compared to windows due to this.
The mesh network is also not workable across multiple chiplets. A 16-core (or even a 10 core) monolithic device would be quite large for 10 nm. They would be wasting a bunch of expensive 10 nm capacity on IO. With the large die size and questionable yields, it will be a much more expensive chip than AMD’s MCM. Also, current Intel chips top out at 38.5 MB of L3 cache on 14 nm. Those are mostly expensive Xeon processors. AMD will have a 32 MB part for $200 and a 64 MB part for $500. Even when Intel actually gets a 10 nm part on the desktop, it will likely be much more expensive. They are also going to have serious problems getting their 10 nm parts up to competitive clock speeds with the 14 nm parts. They have been tweaking 14 nm for something like 5+ years now. Pushing the clock on their problematic 10 nm process doesn’t sound promising.
peevee - Monday, June 17, 2019 - link
"One of the features of IF2 is that the clock has been decoupled from the main DRAM clock....For Zen 2, AMD has introduced ratios to the IF2, enabling a 1:1 normal ratio or a 2:1 ratio that reduces the IF2 clock in half."
I have news for you - 2:1 is still COUPLED. False advertisement in the slides.
And besides, who in their right mind would want to halve IF clock to go from DDR3200 to even DDR4000 (with requisite higher timings)?
BMNify - Saturday, June 22, 2019 - link
the only real world test that matters in the UHD2/8K Rec. 2020/BT.2020 LIVE NHK/bbc broadast of the 2020 Summer Olympics will begin on Friday, 24 July and related video streams is can AMD Zen 2 do it can any pc core do realtime x264/x265/ffmpeg software encoding and x264/x265 compliant decoding (notice how many hw assisted encoders today dont decode to spec as seen when you re-enode them with the latest ffmpeg), how many 8k encodes and what overheads are remaining if any can even do one...