NVIDIA Announces CUDA 6: Unified Memory for CUDA
by Ryan Smith on November 14, 2013 9:00 AM ESTKicking off next week will be the annual International Conference for High Performance Computing, Networking, Storage, and Analysis, better known as SC. For NVIDIA, next to their annual GPU Technology Conference, SC is their second biggest GPU compute conference, and is typically the venue for NVIDIA’s summer/fall announcements. To that end NVIDIA has a number of announcements lined up for this year, so many in fact that they’re pushing out some of them ahead of the conference just to keep them from being overwhelming. The most important of those announcements in turn will be the announcement of the next version of CUDA, CUDA 6.
Unlike some prior CUDA releases, NVIDIA isn’t touting a large number of new features for this version of CUDA. But what few elements NVIDIA is working on are going to be very significant.
The big news here – and the headlining feature for CUDA 6 – is that NVIDIA has implemented complete unified memory support within CUDA. The toolkit has possessed unified virtual addressing support since CUDA 4, allowing the disparate x86 and GPU memory pools to be addressed together in a single space. But unified virtual addressing only simplified memory management; it did not get rid of the required explicit memory copying and pinning operations necessary to bring over data to the GPU first before the GPU could work on it.
With CUDA 6 NVIDIA has finally taken the next step towards removing those memory copies entirely, by making it possible to abstract the memory management away from the programmer. This is achieved through the CUDA 6 unified memory implementation, which implements a unified memory system on top of the existing memory pool structure. With unified memory, programmers can access any resource or address within the legal address space, regardless of which pool the address actually resides in, and operate on its contents without first explicitly copying the memory over.
Now to be clear here, CUDA 6’s unified memory system doesn’t resolve the technical limitations that require memory copies – specifically, the limited bandwidth and latency of PCIe – rather it’s a change in who’s doing the memory management. Data still needs to be copied to the GPU to be operated upon, but whereas CUDA 5 required explicit memory operations (higher level toolkits built on top of CUDA withstanding) CUDA 6 offers the ability to have CUDA do it instead, freeing the programmer from the task.
The end result as such isn’t necessarily a shift in what CUDA devices can do or their performance while doing it since the memory copies didn’t go away, but rather it further simplifies CUDA programming by removing the need for programmers to do it themselves. This in turn is intended to make CUDA programming more accessible to wider audiences that may not have been interested in doing their own memory management, or even just freeing up existing CUDA developers from having to do it in the future, speeding up code development.
With that said NVIDIA isn’t talking about the performance impact at this time. Memory abstractions such as these typically have some kind of performance penalty over manual memory management – after all, who knows more about the memory needs of an application than an application itself – but of course manual memory management isn’t going anywhere, as there will still be scenarios where the higher complexity is worth the tradeoff.
Meanwhile it’s interesting to note that this comes ahead of NVIDIA’s upcoming Maxwell GPU architecture, whose headline feature is also unified memory. From what NVIDIA is telling us they developed the means to offer a unified memory implementation today entirely in software, so they went ahead and developed that ahead of Maxwell’s release. Maxwell will have some kind of hardware functionality for implementing unified memory (and presumably better performance for it), though it’s not something NVIDIA is talking about until Maxwell is ready for its full unveiling. In the interim NVIDIA has laid the groundwork for what Maxwell will bring by getting unified memory into the toolkit before Maxwell even ships.
Moving on, there are a pair of further, smaller additions that will be coming to CUDA with CUDA 6. The first of these is that CUDA 6 will come with new BLAS and FFT libraries that are further tuned for multi-GPU scaling, with these new libraries supporting scaling of up to 8 GPUs in a node. Meanwhile NVIDIA will also be releasing drop-in compatible libraries for BLAS and FFTW, allowing applications that use those libraries to use the GPU accelerated version of their respective routines just by replacing the library.
Wrapping things up, NVIDIA will be showing off CUDA 6 and the rest of their announcement at SC13 next week. Meanwhile we’ll be back on Monday with coverage of the rest of NVIDIA’s SC13 announcements.
43 Comments
View All Comments
Klimax - Saturday, November 16, 2013 - link
So they should just support another closed proprietary API, where they gain no advantage and just give competition another advantage, right...Maxwell_88 - Thursday, November 14, 2013 - link
While this shouldn't change performance in any appreciable way this is going to enable a whole class of applications to be run on the GPU. For example currently no data structures with embedded pointers work on the GPU because the pointer is meaningless when we copy it over to the GPU.With unified memory architecture CUDA takes care of this translation meaning u can now run applications that use complex data structures such as trees, linked lists etc.
Morawka - Thursday, November 14, 2013 - link
This guy is the only one who knows what he's talkin about.p1esk - Thursday, November 14, 2013 - link
That was my thought as well.Senti - Friday, November 15, 2013 - link
Yup, except performance would be horrible. Even now it's possible to map GPU memory to CPU and work with it like with regular memory with pointers etc., but no one sane would treat that mapped memory the same as system one if you care about the performance. It's not like you can't use unified pointers – you don't want to.Hung_Low - Thursday, November 14, 2013 - link
More GPU scaling? Stacking 8? Mama mia!!! The high end gaming community will go nuts againlooncraz - Thursday, November 14, 2013 - link
Uh huh... so basically it hides the copies from the developer so it can change the underlying mechanics at a later date. This is NOT unified memory, it is simply memory management.I've written things like this for almost two decades now to make my life a little easier... nothing innovative (though it may allow for some minimal performance CHANGES - gains here, losses there...).
AMD has it right, nVidia is doing damage control...
Krysto - Thursday, November 14, 2013 - link
This proves once again that Nvidia's chip engineers got lazy and stayed behind the software engineers, who were working on CUDA 6. Maxwell was supposed to arrive at the same time with CUDA 6, but I think they've delayed it until 2H2014.How does CUDA 6 compare to OpenCL 2.0?
Yojimbo - Friday, November 15, 2013 - link
Aren't AMD and NVidia's new architectures delayed by TSMC's process cost issues? From what seems to be floating around, the new architectures are designed to be (or were designed to be) implemented on a <=20nm process technology, but NVidia is unhappy with the cost of 20nm production. Thermal density scaling is not great, costs are higher and projected to stay high, so I guess the main benefit is smaller die size? I think TSMC's 16nm finfet node is supposed to have about the same areal density as the 20nm, but with better thermal characteristics. I could have things wrong, though.maximumGPU - Friday, November 15, 2013 - link
Doesn't C++AMP kinda do some of this already? you can define an array such that it can live on either the cpu or gpu memory, and copying will take place behind the scenes.Anyway even if it's not truly unified memory, the lower entry barrier is a boon, and you can always opt for handling memory management yourself if performance is critical.