AMD Unveils CDNA GPU Architecture: A Dedicated GPU Architecture for Data Centersby Ryan Smith on March 5, 2020 5:25 PM EST
- Posted in
- Machine Learning
- AMD FAD 2020
Over the last decade, the industry has seen a boom in demand for GPUs for the data center. Driven in large part by rapid progress in neural networking, deep learning, and all things AI, GPUs have become a critical part of some data center workloads, and their role continues to grow with every year.
Unfortunately for AMD, they’ve largely been bypassed in that boom. The big winner by far as been NVIDIA, who has gone on to make billions of dollars in the field. Which is not to say that AMD hasn’t had some wins with their previous and current generation products, including the Radeon Instinct series, but their share of that market and its revenue has been a fraction of what NVIDIA has enjoyed.
AMD’s fortunes are set to change very soon, however. We already know that AMD (as a supplier to Cray) has scored two big supercomputer wins with the United States – totaling over $1 billion for CPUs and GPUs – so there have been a lot of questions on just what AMD has been working on that has turned the heads of the US government. The answer, as AMD is revealing today, is their new dedicated GPU architecture for data center compute: CDNA.
The Compute counterpart to the gaming-focused RDNA, CDNA is AMD’s compute-focused architecture for data center and other uses. Like everything else being presented at today’s Financial Analyst Day, AMD’s reveal here is at a very high level. But even at that high level, AMD is making it clear that there’s a fission of sorts going on in their GPU development process, leading to CDNA and RDNA becoming their own architectures.
Just how different these architectures are (and over time, will be) remains to be seen. AMD has briefly mentioned that CDNA is going to have less “graphics-bits”, so it’s likely that these parts will have limited (if any) graphics capabilities, making them quite dissimilar from RDNA GPUs in some ways. So broadly speaking, AMD is now on a path similar to what we’ve seen from other GPU vendors, where compute GPUs are increasingly becoming a distinct class of product, as opposed to repurposed gaming GPUs.
AMD’s goals for CDNA are simple and straightforward: build a family of big, powerful GPUs that are specifically optimized for compute and data center usage in general. This is a path AMD already started to go down with GPUs such as Vega 20 (used in the Radeon Instinct MI 50/60), but now with even more specialization and optimization. A big part of this will of course be machine learning performance, which means supporting faster execution of smaller data types (e.g. INT4/INT8/FP16), and AMD even goes as far as to explicitly mention tensor ops. But this can’t come at the cost of traditional FP32/FP64 compute either; those supercomputers that AMD’s GPUs will be going in will be doing a whole lot of high precision math. So AMD needs to perform well across the compute and machine learning spectrum, across many data types.
To get there, AMD will also need to improve their performance-per-watt, as this is an area they have frequently trailed at. Today’s Financial Analyst Day announcement isn’t going into any real detail on how AMD is going to do this – beyond the obvious improvements in manufacturing processes, at least – but AMD is keenly aware of their need to improve.
All the while CDNA will also differentiate itself with features, including some things only AMD can do. Enterprise-grade reliability and security will be one leg here, including support for ever-popular virtualization needs.
But AMD will also be leaning on their Infinity Fabric to give them an edge in performance scaling and CPU/GPU integration. Infinity Fabric has been a big part of AMD’s success story this far on the CPU side of matters, and AMD is applying this same logic to the GPU side of matters. This means using IF to not only link GPUs to other GPUs, but using IF to link GPUs to CPUs. Which is something we’ve already seen in the works for AMD’s supercomputer wins, where both systems will be using IF to team up 4 GPUs with a single CPU.
AMD’s big win, however, will be a bit further down the line, when their 3rd gen Infinity Fabric is ready. It’s at that point where AMD intends to deliver a fully unified CPU/GPU memory space, fully leveraging their ability to provide both the CPUs and GPUs for a system. Unified memory can take a few different forms, so there are some important details that are missing here that will be saved for another day, but ultimately having a unified memory space should make programming heterogenous systems a whole lot easier, which in turn makes incorporating GPUs into servers all the better choice.
And since CDNA is now its own branch of AMD’s GPU architecture – with command of it falling under data center boss Forrest Norrod, interestingly enough – it also has its own roadmap with multiple generations of GPUs. With AMD treating Vega 20 as the branching point here, the company is revealing two generations of CDNA to come, aptly named CDNA (1) and CDNA 2.
CDNA (1) is AMD’s impending data center GPU. We believe this to be AMD’s “Arcturus”, and according to AMD it will be optimized for machine learning and HPC uses. This will be an Infinity Fabric-enabled part, using AMD’s second-generation IF technology. Keeping in mind that this is a high level overview, at this point it’s not super clear whether this part is going in either of AMD’s supercomputer wins; but given what we know so far about the later El Capitan – which is now definitely using CDNA 2 – CDNA (1) may be what’s ending up in Frontier.
Following CDNA (1) of course is CDNA 2. AMD is not sharing too much in the way of details here – after all, they haven’t yet shipped the first CDNA let alone the second – but they have confirmed that it will incorporate AMD’s third generation Infinity Fabric. As well, it will use a newer manufacturing node, which AMD is calling "Advanced Node" for now, as they are not disclosing the specific node they intend to use. So in a few different respects, CDNA 2 will be the piece de resistance of AMD’s heterogeneous compute plans, where they finally get to have a unified, coherent memory system across discrete CPUs and GPUs.
As for shipping dates, while AMD isn’t disclosing exact dates at this time, the roadmap itself only extends to the end of 2022, meaning that AMD expects to be shipping CDNA 2 in volume by then. This aligns fairly well with this week’s El Capitan announcement, which has the supercomputer being delivered in 2023.
Overall, AMD has some significant ambitions for their future data center GPUs. And while they have a lot of catching up to do to realize those ambitions, they’ve certainly laid out a promising roadmap to get there. AMD isn’t wrong about the importance of the data center market from both a technology perspective and a revenue perspective, and having a dedicated branch of their GPU architecture to get there may be just what AMD needs to finally find the success they seek.
Post Your CommentPlease log in or sign up to comment.
View All Comments
Threska - Friday, March 6, 2020 - linkGuess the compute crowd will still be hanging onto their Vegas then.
mode_13h - Sunday, March 8, 2020 - linkJust get a Radeon VII, while you still can. Not long ago, I saw some deals on them for a bit over $500, new.
mode_13h - Sunday, March 8, 2020 - linkRight now, I see an XFX-branded card selling for $600. That's $100 cheaper than where it launched, a year ago, and the most memory bandwidth & 64-bit floating point horsepower you're going to find at that price point for at least a couple more years. Similar gaming performance to a RTX 2080.
Gc - Thursday, March 5, 2020 - linkFor what kinds of algorithms will each CDNA design's floating point operations per second be optimized? The customers of El Capitan might prioritize different algorithms compared to the customers of Frontier. Such as different weights for dense vs. sparse matriices, or bandwidth vs. routing. Are exaflop supercomputers large enough that AMD can customize the GPU design for each supercomputer?
Maybe these contracts are the primary source of AMD's funding for ROCm development, so the algorithms for which the customers require implementations has a strong influence on what parts of the ROCm ecosystem are developed by AMD. For example, the HPE El Capitan supercomputer announcement also mentioned LLNL will work with National Cancer Institute through the ATOM consortium, which has been using tensorflow, so that might explain why most ROCm ML work has been on tensorflow ahead of winning this contract.
sing_electric - Friday, March 6, 2020 - linkI think that's an interesting question and I wonder if they've figured it out yet. I wonder how big the product stack will be, and if part of the reason they're splitting compute and graphics GPUs is so they can offer products targeting half, single and double FP performance (rather than just doing what has frequently happened with consumer GPUs, and creating a base design and offering more/fewer streaming processors/CUDA cores to meet a certain performance or price target).
mode_13h - Sunday, March 8, 2020 - link> Are exaflop supercomputers large enough that AMD can customize the GPU design for each supercomputer?
I seriously doubt it. Those supercomputers use thousands of GPUs, not millions.
That said, the AI market is already branching out into training and inference-specific chips. AMD could do the same, pairing training horsepower with their HPC GPUs and relegating inferencing to their consumer dies. Or, maybe they'll surprise us with 2 or 3 different CDNA dies. Though, I wouldn't be surprised if AMD wants to get some traction in the AI market, before going down the path of specialization.
Gc - Sunday, March 8, 2020 - linkNextPlatform reported: "While the Frontier system that is being installed in 2021 and put into production in 2022 is based on custom Epyc CPU and custom Radeon Instinct GPU motors, the contract with Lawrence Livermore specifies that El Capitan will be built with standard Epyc CPU and standard Radeon Instinct GPU parts, according to Forrest Norrod, general manager of the Datacenter and Embedded Systems Group at AMD." https://www.nextplatform.com/2020/03/04/lawrence-l...
mode_13h - Sunday, March 8, 2020 - linkWhat the mean by "custom" is worth questioning. Google uses "custom" Vega GPUs in Stadia, but the hardware specs make it pretty darn clear that it's basically a Vega 56, but with faster memory. As such, I think it's a custom card, maybe even down to the GPU package level, but I highly doubt anything at the die-level.
haukionkannel - Friday, March 6, 2020 - linkThis Also means that when there is rumor/leak. It is only for datacenter or Gaming. Not for both!
It is very likely that cu80 version is only datacenter. And Gamin GPUs gets different amounth of CUs.
scineram - Friday, March 6, 2020 - linkNo, an 80 CU GPU is probably for rendering. CDNA1 has 128 CU.