The NVIDIA Titan V Deep Learning Deep Dive: It's All About The Tensor Cores

Name: The NVIDIA Titan V Deep Learning Deep Dive: It's All About The Tensor Cores
Item: The NVIDIA Titan V Deep Learning Deep Dive: It's All About The Tensor Cores
Author: Nate Oh

by Nate Oh on July 3, 2018 10:15 AM EST

65 Comments | Add A Comment

65 Comments

When we last discussed the NVIDIA Titan V in our preview, it was only a few weeks after its surprise launch at the 2017 Neural Information Processing Systems conference. We came away with the understanding that the Volta-based Titan V was a new breed of NVIDIA’s prosumer line of video cards, one that essentially encapsulated NVIDIA’s recent datacenter/compute achievements and how they got there. Which is to say, deep learning and neural networking has quickly become the driving force behind NVIDIA GPUs as state-of-the-art compute accelerators, now incorporating built-in hardware and software acceleration for machine learning operations. Deep learning prowess is the calling card of the Titan V and of Volta in general, and that performance is what we will be investigating today.

The most eye-catching of Volta’s new features are the new specialized processing blocks – tensor cores – but as we will see, this is very much integrated with the rest of Volta's microarchitectural improvements and surrounding software/framework support for deep learning (DL) and high performance compute (HPC). Matching up with the NVIDIA Titan V are the Titan Xp and GeForce GTX Titan X (Maxwell), with the AMD Radeon RX Vega 64 also present for some tests.

NVIDIA GPU Specification Comparison
	Titan V	Titan Xp	GTX Titan X (Maxwell)	GTX Titan
CUDA Cores	5120	3840	3072	2688
Tensor Cores	640	N/A	N/A	N/A
ROPs	96	96	96	48
Core Clock	1200MHz	1485MHz	1000MHz	837MHz
Boost Clock	1455MHz	1582MHz	1075MHz	876MHz
Memory Clock	1.7Gbps HBM2	11.4Gbps GDDR5X	7Gbps GDDR5	6Gbps GDDR5
Memory Bus Width	3072-bit	384-bit	384-bit	384-bit
Memory Bandwidth	653GB/sec	547GB/sec	336GB/sec	288GB/sec
VRAM	12GB	12GB	12GB	6GB
L2 Cache	4.5MB	3MB	3MB	1.5MB
Single Precision	13.8 TFLOPS	12.1 TFLOPS	6.6 TFLOPS	4.7 TFLOPS
Double Precision	6.9 TFLOPS (1/2 rate)	0.38 TFLOPS (1/32 rate)	0.2 TFLOPS (1/32 rate)	1.5 TFLOPS (1/3 rate)
Half Precision	27.6 TFLOPS (2x rate)	0.19 TFLOPs (1/64 rate)	N/A	N/A
Integer (INT8)	55.2 TOPS (4x rate)	48.4 TOPS (4x rate)	26.4 TOPS (4x rate)	N/A
Tensor Performance (Deep Learning)	110 TFLOPS	N/A	N/A	N/A
Other Native INT Operations	INT32, DP4A, DP2A	DP4A, DP2A	N/A	N/A
GPU	GV100 (815mm2)	GP102 (471mm2)	GM200 (601mm2)	GK110 (561mm2)
Transistor Count	21.1B	12B	8B	7.1B
TDP	250W	250W	250W	250W
Manufacturing Process	TSMC 12nm FFN	TSMC 16nm FinFET	TSMC 28nm	TSMC 28nm
Architecture	Volta	Pascal	Maxwell 2	Kepler
Launch Date	12/07/2017	04/07/2017	08/02/2016	02/21/13
Price	$2999	$1299	$999	$999

Circling back to NVIDIA’s compute endeavors, with Titan V, the Titan brand became closer than ever to workstation-class compute, featuring a high-end compute-centric GPU for the first time: the gargantuan 815 mm² GV100. Complete with a workstation-class price tag of $3000, the Titan V doubled-down on high performance compute (HPC) and deep learning (DL) acceleration in hardware and software, while maintaining the fastest graphics performance around. Looking back, it’s a far cry from the original Kepler-based GeForce GTX Titan, a jack-of-all-trades video card that acted as enthusiast flagship with full double precision (FP64) compute for prosumers. Up until Titan V, NVIDIA’s Titan lineup more-or-less represented that design methodology, where a big GPU served as lynchpin for both compute and consumer lines.

NVIDIA Tesla/Titan Family Specification Comparison
	Tesla V100 (SXM2)	Tesla V100 (PCIe)	Titan V (PCIe)	Tesla P100 (SXM2)
CUDA Cores	5120	5120	5120	3584
Tensor Cores	640	640	640	N/A
Core Clock	?	?	1200MHz	1328MHz
Boost Clock	1455MHz	1370MHz	1455MHz	1480MHz
Memory Clock	1.75Gbps HBM2	1.75Gbps HBM2	1.7Gbps HBM2	1.4Gbps HBM2
Memory Bus Width	4096-bit	4096-bit	3072-bit	4096-bit
Memory Bandwidth	900GB/sec	900GB/sec	653GB/sec	720GB/sec
VRAM	16GB 32GB	16GB 32GB	12GB	16GB
ECC	Yes	Yes	No	Yes
L2 Cache	6MB	6MB	4.5MB	4MB
Half Precision	30 TFLOPS	28 TFLOPS	27.6 TFLOPS	21.2 TFLOPS
Single Precision	15 TFLOPS	14 TFLOPS	13.8 TFLOPS	10.6 TFLOPS
Double Precision	7.5 TFLOPS	7 TFLOPS	6.9 TFLOPS	5.3 TFLOPS
Tensor Performance (Deep Learning)	120 TFLOPS	112 TFLOPS	110 TFLOPS	N/A
GPU	GV100	GV100	GV100	GP100
Transistor Count	21B	21B	21.1B	15.3B
TDP	300W	250W	250W	300W
Form Factor	Mezzanine (SXM2)	PCIe	PCIe	Mezzanine (SXM2)
Cooling	Passive	Passive	Active	Passive
Manufacturing Process	TSMC 12nm FFN	TSMC 12nm FFN	TSMC 12nm FFN	TSMC 16nm FinFET
Architecture	Volta	Volta	Volta	Pascal

With Volta, there's little detail of anything other than GV100 existing, outside of Tegra Xavier’s Volta iGPU, which is also part of Drive PX Pegasus. So as it stands, Volta is only available to the broader public in the form of the Titan V, though depending on the definition of ‘broader public,’ the $9000 32GB Quadro GV100 released in March might fall under that category too.

Remaking of a Titan: Less Flagship, More Compute

Deep learning and compute aside, there are a few more factors involved in this iteration of the Titan brand. NVIDIA has less need to make a name for itself with the Titan line, of which the original GTX Titan did exactly that by invoking the NVIDIA’s K20Xs powering Oak Ridge National Laboratory’s Titan supercomputer, and then setting a new high in performance (and price). Nor is there any particular competitive pressure in pricing or performance – the GeForce GTX 1080 Ti has no direct competition while the Pascal-based Titan X/Xp has carved out a $1200 price bracket above the previous $1000 mark.

Meanwhile, it’s fair to assume pushing the reticle limit (815mm²) on a new process node (12nm FFN) with new microarchitecture and additional HBM2 packaging results in poor-yielding silicon, and thus fewer options for salvage parts, especially if they needed to be validated at enterprise level (i.e. Teslas and Quadros). So a more-prosumer-than-consumer Titan V part would be the best – and only – fit, given that the gaming performance isn’t at the level of $3000. Ultimately, as we’ve discussed prior, NVIDIA seeds academics, developers, and other researchers at a lower cost-of-entry to Tesla V100s, with the feedback contributing to ecosystem support of Volta. And on that note, while Titan V’s non-ECC HBM2 and GeForce driver stack are more consumer oriented, the card still directly benefits from software support with frameworks and APIs as part of NVIDIA’s overall deep learning development efforts. Other than NVLink, Titan V’s main compute functions (FP64, FP16, tensor core) are uncrippled, which makes sense as single node Titan V’s don’t quite cannibalize sales of NVIDIA’s other compute products. If that were to change with the Quadro GV100, cryptomining will ensure that prices are kept apart.

Taking a step back, the approach with Volta doesn’t mesh with NVIDIA’s previous approaches with Pascal and others. Instead of leading with a compute-centric big die design that could naturally cascade down the consumer stack as smaller GDDR5(x) designs for enthusiast graphics, they went for a gargantuan low-yielding die with good amounts of silicon area dedicated to brand-new non-graphics functions (i.e. tensor cores). We noted that tensor cores were a calculated bet, and broadly-speaking it was the usual tradeoff between lower-margin consumer graphics performance against lower-volume compute, one that NVIDIA could easily afford. The past couple years have put NVIDIA in pole position for raw consumer graphics performance and mindshare, while years of continued involvement at the forefront of GPU accelerated deep learning have put them in prime position to implement DL-specialized hardware with corresponding software support.

And as a side note, cryptomining demand has also thrown a wrench in matters, depleting much of the current generation products for extended periods of time. In turn, the consumer market hasn’t quite been saturated with current generation video cards, leaving NVIDIA in no rush to push out a new GeForce generation. Though with all the microarchitectural improvements over Pascal, I’m sure that Volta with disabled tensor cores could be levied as a very capable gaming product if necessary – the Titan V is still king of the hill – just not at the same margins as last generation. In any case, NVIDIA quarterly financials continue to cite high Pascal GeForce sales, and like all marquee silicon designers has leapfrogging design teams, the fruits of which we might just see in a few months.

Thinking Deep with GPUs

Whatever the case may be with the next generation consumer GeForce, the big picture is that both NVIDIA and AMD have publicly stated the necessity of GPU architecture bifurcation – one for HPC/ML, and one for graphics/gaming. For NVIDIA, considering that Pascal has been around for over 2 years now, Volta is conspicuously absent from recent speculation over the next GeForce generation. In looking at the Titan V today, it almost seems that NVIDIA’s divergence is imminent. Even in the case of a Volta-based GeForce launch, the implementation of consumer Volta would be a very big hint at the future direction of GPUs, gaming and compute alike. At the very least, it would be a smaller design with far fewer tensor cores – NVIDIA's RTX technology all but guarantees that at least some tensor cores will show up in consumer parts – and with a GDDR controller, at which point it raises the question how much of Volta was optimized for tensor core operations.

As our first analysis of DL performance of any GPU, we have not yet determined a standard set of benchmark tests, particularly due to Volta’s unique tensor cores and mixed precision capability. For this Titan V deep dive, we will be utilizing Baidu DeepBench, as well as tests from NVIDIA’s Caffe2 Docker image, Stanford DAWNBench implementations, and HPE Deep Learning Benchmark Suite (DLBS).

But before we dive into the numbers, this is an opportune time to provide some context, of which there is plenty: deep learning and GPUs, the Volta microarchitecture, and the current state of benchmarking DL performance.

Deep Learning, GPUs, and NVIDIA: A Brief Overview

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

65 Comments

View All Comments

Nate Oh - Wednesday, July 11, 2018 - link
Thanks for your inquisitive responses throughout :)

And yes, I was trying to be impartial with AMD's claims about deep learning. Until I have results myself, I offer them a degree of the benefit of the doubt, considering their traditional GPGPU capabilities. Meaning that "image classification for machine learning..." essentially falls under all the deep learning investigations I did for the review. My personal opinion is that 8-bit SAD will be as useful as it was with Kepler/Maxwell in terms of DL acceleration, except with lesser software support; you can make of that as you will. It really gets into the weeds to put AMD's 'machine intelligence' terminology under the scope, and I'd feel more comfortable doing so in an AMD-focused DL/ML investigation. I want to emphasize again that new instructions matter much less in the context of software/library/API support, so the fact that they are absent from the whitepaper directly adds to that observation. If this were a Vega FE DL review, I would certainly pester AMD about that, as much as I put an effort towards TensorRT and FP16 storage/tensor cores here. So encourage AMD to sample me :D

>TFLOPS

It is TFLOPS just for DeepBench because that is how Baidu and NV/AMD/Intel present their DeepBench results; you can see for yourselves at the DeepBench Github. We have not independently configured results (for DeepBench) that way, and I apologize if that's how it came across. This also makes it easier to keep us accountable by comparing our results to Baidu's Github. DeepBench is, as stated in the article, completely framework and model agnostic. We use TFLOPS when it is floating point, and we actually use TOPS when it is integer :) I've generalized a bit only because that comment had become so lengthy. This TFLOPS/TOPS usage is limited to solely DeepBench because of how they use pure math kernels, and precisely the reason I included end-to-end results with DAWNBench implementations.

>Open source

Indeed, like I've said, I've actually gone and attempted (poorly) to do some dev work myself. The article could *easily* ballooned to double the length, as well. The point I wanted to convey is exactly what you've picked up with AMD. Given the limited scope of the article (and the lack of direct AMD DL investigations), I want to refrain from saying something outright like, 'one of the main reasons we don't currently use AMD,' but I am just aware as you are on this point :) This deduction is unsaid but present throughout, .
Nate Oh - Wednesday, July 11, 2018 - link
Clarification: "so the fact that citations are absent from the whitepaper"
mode_13h - Thursday, July 12, 2018 - link
> I was trying to be impartial with AMD's claims about deep learning. Until I have results myself, I offer them a degree of the benefit of the doubt, considering their traditional GPGPU capabilities.

As a member of the tech press, please don't forget your privileged position of being able to request guidance on how to exercise claimed product features. I think this is a fair question and wouldn't impart any bias. Rather, it would help inform readers of how to exploit these features, and also quantify product performance when used as the designers intended.

I think it's also fair to ask if they can provide any references (either implementations or papers) to support their claims regarding how SAD can be utilized in machine learning, in cases of doubt.

Again, I'm saying this mostly in anticipation of your future Vega coverage, whether you choose to follow up with Vega 10, or perhaps you only revisit the matter with Vega 20.

As for searching & sifting through the sources of MIOpen, I think that's "over and above" what's expected. I'm just pointing out that, sometimes, it's actually surprisingly easy to answer questions by doing simple text searches on the source code. Sometimes, like when checking whether a certain instruction is emitted, it's also possible to save the generated assembly language and search *that*.
Demiurge - Friday, July 20, 2018 - link
Nate gets paid to educate and discuss with you, I don't, but more importantly to me, I made my point that Vega is not "underwhelming" for DL.

Why should I *convince* you? I don't *need* to convince you. You didn't state Vega was "underwhelming" for DL.
Nate Oh - Monday, July 9, 2018 - link
To put it lightly, use of FP16 in DL training is not on the same level of use of INT8 in training; the latter is basically pure research and highly niche to those specific implementations. FP16 training (with NVIDIA GPUs) has reached a level of maturity and practicality where there is out-of-the-box support for most major frameworks. FP16 training and INT8 inferencing is the current understanding of lower-precision applicability in DL.

More specifically, the whole field of lower-precision DL training/inference is all about making lower-precision datatypes more important, so of course that's the case for INT8/FP8. FP16 is already relevant for real-world training in certain scenarios; some researchers are *trying* to make INT8 relevant for real-world training in certain scenarios. As mode_13h said, that paper is a custom 8-bit datatype used to approximate 32-bit gradients for parameter updates during the backprop, specifically to speed-up inter-GPU communication for greater parallelism. AKA it is not usage of 8-bit datatypes all around, it's very specific to one aspect. It's essentially a proof-of-concept and pure research. Using INT16 for everything is hard enough; some people (see below) were able to use a custom INT16 format and use INT16/INT32 FMA. And yes, sometimes, companies don't distinguish inference and training as clearly as they should, with the resulting perception of superior general DL performance.

In any case, DP4A is not really used in training at all and it wasn't designed to do so anyway. You can ‘make’ the exception with research papers like what you cited but you can always find niche exceptions in research because that is its purpose. It was designed for inferencing acceleration and as product segmentation for non-GP100 GPUs. Even now, it's pushed for working with a model that TensorRT converted from higher-precision to INT8.

(I am splitting this comment up to respond separately on the topic of Vega/instruction set support, but both comments should be considered in tandem)

References/Links

https://software.intel.com/en-us/articles/lower-nu...
https://ai.intel.com/lowering-numerical-precision-...
http://dawn.cs.stanford.edu/2018/03/09/low-precisi...
https://www.tensorflow.org/performance/quantizatio...
https://arxiv.org/pdf/1802.00930.pdf (Custom datatype for INT16/INT32 mixed precision training)
http://on-demand.gputechconf.com/gtc/2017/presenta...
https://devblogs.nvidia.com/int8-inference-autonom...
https://devblogs.nvidia.com/mixed-precision-progra... (Introduction of DP4A/DP2A)
mode_13h - Tuesday, July 10, 2018 - link
> ... DP4A is not really used in training at all ... It was designed for inferencing acceleration and as product segmentation for non-GP100 GPUs.

You mean segmentation of GP100 vs. GP102+ ? Or are you saying it's lacking in some of the smaller Pascal GPUs, like GP107? And *why* isn't it listed in the CUDA compute capabilities table (https://docs.nvidia.com/cuda/cuda-c-programming-gu... Grrr!

Regardless, given that GV100 has it, I get the sense that it was simply an evolution that came too late for the GP100.

Finally, thank you for another thoughtful and detailed reply.
Ryan Smith - Tuesday, July 3, 2018 - link
The Titan V is such a niche card that I'm not surprised to hear NV hasn't prepared macOS drivers. There are good reasons for them to have drivers ready for their consumer hardware - they need to do the work anyhow to support existing products and make sure they're ready to take a new Apple contract if they win it - but the Titan V/GV100 will never end up in a Mac. So adding that to the mac drivers would be a less beneficial decision.
Flunk - Tuesday, July 3, 2018 - link
I'm surprised any cards not shipped in Mac Models have Mac drivers anymore. It's not like you can add a PCI-E video card to any recent Mac.
Strunf - Wednesday, July 4, 2018 - link
Thunderbolt allows for an external PCI-E card but there's probably just a few guys ready to do this kind of thing...
ImSpartacus - Tuesday, July 3, 2018 - link
Is the new 32GB V100 still on SXM2?

Several sites mentioned SXM3 in reference to the 32GB refresh of V100, but it's hard to find details on what improved (if anything).

The NVIDIA Titan V Deep Learning Deep Dive: It's All About The Tensor Cores

Remaking of a Titan: Less Flagship, More Compute

Thinking Deep with GPUs

Post Your Comment

65 Comments

View All Comments

Nate Oh - Wednesday, July 11, 2018 - link

Nate Oh - Wednesday, July 11, 2018 - link

mode_13h - Thursday, July 12, 2018 - link

Demiurge - Friday, July 20, 2018 - link

Nate Oh - Monday, July 9, 2018 - link

mode_13h - Tuesday, July 10, 2018 - link

Ryan Smith - Tuesday, July 3, 2018 - link

Flunk - Tuesday, July 3, 2018 - link

Strunf - Wednesday, July 4, 2018 - link

ImSpartacus - Tuesday, July 3, 2018 - link

Log in

Don't have an account? Sign up now