NVIDIA Launches Tesla K20 & K20X: GK110 Arrives At Last
by Ryan Smith on November 12, 2012 9:00 AM ESTContinuing our SC12 related coverage today, while AMD was the first GPU announcement of the day they are not the only one. NVIDIA is also using the venue to launch their major GPU compute product for the year: Tesla K20.
We first saw Tesla K20 at NVIDIA’s 2012 GPU Technology Conference, where NVIDIA first announced the K20 along with the already shipping K10. At the time NVIDIA was still bringing up the GPU behind K20 – GK110 – with the early announcement at GTC offering an early look at the functionality it would offer in order to prime the pump for developers. At the time we knew quite a bit about its functionality, but not its pricing, configuration, or performance.
More recently, upon completion of K20 NVIDIA has dedicated most of the initial allocation to Oak Ridge National Laboratory’s Titan supercomputer, completing delivery on a contract years in the making. As it turned out K20 was quite powerful indeed, and with K20 providing some 90% of the computational throughput of the supercomputer, Titan has taken the #1 spot on the fall Top500 supercomputer list.
This brings us to today. With Titan complete NVIDIA can now focus their attention and their GPU allocations towards making the Tesla K20 family available to the public at large. With SC12 and the announcement of the new Top500 list as their backdrop, today NVIDIA will be officially launching the Tesla K20 family of compute GPUs.
NVIDIA Tesla Family Specification Comparison | ||||||
Tesla K20X | Tesla K20 | Tesla M2090 | Tesla M2070Q | |||
Stream Processors | 2688 | 2496 | 512 | 448 | ||
Core Clock | 732MHz | 706MHz | 650MHz | 575MHz | ||
Shader Clock | N/A | N/A | 1300MHz | 1150MHz | ||
Memory Clock | 5.2GHz GDDR5 | 5.2GHz GDDR5 | 3.7GHz GDDR5 | 3.13GHz GDDR5 | ||
Memory Bus Width | 384-bit | 320-bit | 384-bit | 384-bit | ||
VRAM | 6GB | 5GB | 6GB | 6GB | ||
Single Precision | 3.95 TFLOPS | 3.52 TFLOPS | 1.33 TFLOPS | 1.03 TFLOPS | ||
Double Precision | 1.31 TFLOPS (1/3) | 1.17 TFLOPS (1/3) | 655 GFLOPS (1/2) | 515 GFLOPS (1/2) | ||
Transistor Count | 7.1B | 7.1B | 3B | 3B | ||
TDP | 235W | 225W | 250W | 225W | ||
Manufacturing Process | TSMC 28nm | TSMC 28nm | TSMC 40nm | TSMC 40nm | ||
Architecture | Kepler | Kepler | Fermi | Fermi | ||
Launch Price | >$3199 | $3199? | N/A | N/A |
When NVIDIA first announced K20 back in May we were given a number of details about the GK110 GPU that would power it, but because they were still in the process of bringing up the final silicon for GK110 we knew little about the shipping configuration for K20. What we could say for sure is that GK110 was being built with 15 SMXes, 6 memory controllers, 1.5MB of L2 cache, and that it would offer double precision (FP64) performance that was 1/3rd its single precision (FP32 rate). Now with the launch of the K20 we finally have details on what the shipping configurations will be for K20.
First and foremost, K20 will not be a single GPU but rather it will be a family of GPUs. NVIDIA has split up what was previously announced as a single GPU into two GPUs: K20 and K20X. K20X is the more powerful of these GPUs, featuring 14 active SMXes along with all 6 memory controllers and 1.5MB of L2 cache, attached to 6GB of GDDR5. It will be clocked at 732MHz for the core clock and 5.2GHz for the memory clock. This sets a very high bar for theoretical performance, with FP32 performance at 3.95 TFLOPS, FP64 performance at 1.31 TFLOPS, and fed by some 250GB/sec of memory bandwidth. For those of you who have kept an eye on Titan, these are the same specs as the GPUs Titan, and though NVIDIA would not name it at the time we can now confirm that Titan is in fact composed of K20X GPUs and not K20.
Below K20X will be the regular K20. K20 gives up 1 SMX and 1 memory controller, giving it 13 SMXes, 5 memory controllers, 1.25MB of L2 cache, and 5GB of GDDR5. It will also be clocked slightly lower than K20X, with a shipping core clock of 706MHz while the memory clock is held at 5.2GHz. This will give K20 theoretical performance numbers around 3.52 TFLOPS for FP32, 1.17 TFLOPS for FP64, fed by 208GB/sec of memory bandwidth.
This split ends up being very similar to what NVIDIA eventually did with the Fermi generation of Tesla products such as the M2090 and M2075, spacing their products not only by performance and pricing, but also by power consumption. K20X will be NVIDIA’s leading Tesla K20 product, offering the best performance at the highest power consumption (235W). K20 meanwhile will be cheaper, a bit slower, and perhaps most importantly lower power at 225W. On that note, despite the fact that the difference is all of 10W, 225W is a very important cutoff in the HPC space – many servers and chasses are designed around that being their maximum TDP for PCIe cards – so it was important for NVIDIA to offer as fast a card as possible at this TDP, alongside the more powerful but more power hungry K20X. This tiered approach also enables the usual binning tricks, allowing NVIDIA to do something with chips that won’t hit the mark for K20X.
Moving on, at the moment NVIDIA is showing off the passively cooled K20 family design, confirming in the process that both K20 and K20X can be passively cooled as is the standard for servers. NVIDIA’s initial wave of focus for the Telsa K20 is going to be on servers (it is SC12 after all), but with K20 also being an integral part of NVIDIA’s next-generation Maximus strategy we’re sure to see actively cooled workstation models soon enough.
73 Comments
View All Comments
dcollins - Monday, November 12, 2012 - link
It should be noted that recursive algorithms are not always more difficult to understand than their iterative counterparts. For example, the quicksort algorithm used in nvidia's demos is extremely simple to implement recursively but somewhat tricky to get right with loops.The ability to directly spawn sub-kernels has applications beyond supporting recursive GPU programming. I could see the ability to create your own workers would simplify some problems and leave the CPU free to do other work. Imagine an image processing problem where a GPU kernel could do the work of sharding an image and distributing it to local workers instead of relying on a (comparatively) distant CPU to perform that task.
In the end, this gives more flexibility to GPU compute programs which will eventually allow them to solve more problems, more efficiently.
mayankleoboy1 - Monday, November 12, 2012 - link
We need compilers that can work on GPGPU to massively speed up compilation times.Loki726 - Tuesday, November 13, 2012 - link
I'm working on this. It actually isn't as hard as it might seem at first glance.The amount of parallelism in many compiler optimizations scale with program size, and the simplest algorithms basically boil down to for(all instructions/functions/etc) { do something; }. Everything isn't so simple though, and it still isn't clear if there are parallel versions of some algorithms that are as efficient as their sequential implementations (value-numbering is a good example).
So far the following work very well on a massively parallel processor:
- instruction selection
- dataflow analysis (live sets, reaching defs)
- control flow analysis
- dominance analysis
- static single assignment conversion
- linear scan register allocation
- strength reduction, instruction simplification
- constant propagation (local)
- control flow simplification
These are a bit harder and need more work:
- register allocation (general)
- instruction scheduling
- instruction subgraph isomorphism (more general instruction selection)
- subexpression elimination/value numbering
- loop analysis
- alias analysis
- constant propagation (global)
- others
Some of these might end up being easy, but I just haven't gotten to them yet.
The language frontend would also require a lot of work. It has been shown
that it is possible to parallelize parsing, but writing a parallel parser for a language
like C++ would be a very challenging software design project. It would probably make more sense to build a parallel parser generator for framework like Bison or ANTLR than to do it by hand.
eachus - Wednesday, November 14, 2012 - link
I've always assumed that the way to do compiles on a GPU or other heavily parallel CPU is to do the parsing in a single sequential process, then spread the semantic analysis and code generation over as many threads as you can.I may be biased in this since I've done a lot of work with Ada, where adding (or changing) a 10 line file can cause hundreds of re-analysis/code-generation tasks. The same thing can happen in any object-oriented language. A change to a class library, even just adding another entry point, can cause all units that depend on the class to be recompiled to some extent. In Ada you can often bypass the parser, but there are gotchas when the new function has the same (simple) name as an existing function, but a different profile.
Anyway, most Ada compilers, including the GNAT front-end for GCC will use as many CPU cores as are available. However, I don't know of any compiler yet that uses GPUs.
Loki726 - Thursday, November 15, 2012 - link
The language frontend (semantic analysis and IR generation, not just parsing) for C++ is generally harder than languages that have concepts of import/modules or interfaces because you typically need to parse all included files for every object file. This is especially true for big code bases (with template libraries).GPUs need thousands of way parallelism rather than just one dimension for each file/object, so it is necessary to extract parallelism on a much finer granularity (e.g. for each instruction/value).
A major part of the reason why GPU compilers don't exist is because they are typically large/complex codebases that don't map well onto parallel models like OpenMP/OpenACC etc. The compilers for many languages like OpenCL are also immature enough that writing a debugging a large codebase like this would be intractable.
CUDA is about the only language right now that is stable enough and has enough language features (dynamic memory allocation, object oriented programming, template) to try. I'm writing all of the code in C++ right now and hoping that CUDA will eventually cease to be a restricted subset of C++ and just become C++ (all it is missing is exceptions, the standard library, and some minor features that other compilers are also lax about supporting).
CeriseCogburn - Thursday, November 29, 2012 - link
Don't let the AMD fans see that about OpenCL sucking so badly and being immature.It's their holy grail of hatred against nVidia/Cuda.
You might want to hire some protection.
I can only say it's no surprise to me, as the amd fanboys are idiots 100% of the time.
Now as amd crashes completely, gets stuffed in a bankruptcy, gets dismantled and bought up as it's engineers are even now being pillaged and fired, the sorry amd fanboy has "no drivers" to look forward to.
I sure hope their 3G ram 79xx "futureproof investment" they wailed and moaned about being the way to go for months on end will work with the new games...no new drivers... 3rd tier loser engineers , sparse crew, no donuts and coffee...
*snickering madly*
The last laugh is coming, justice will be served !
I'd just like to thank all the radeon amd ragers here for all the years of lies and spinning and amd tokus kissing, the giant suction cup that is your pieholes writ large will soon be able to draw a fresh breath of air, you'll need it to feed all those delicious tears.
ROFL
I think I'll write the second edition of "The Joys of Living".
inaphasia - Tuesday, November 13, 2012 - link
Everybody seems to be fixated on the fact that the K20 doesn't have ALL it's SMXes enabled and assuming this is the result of binning/poor yields, whatever...AFAICT the question everybody should be asking and the one I'd love to know the answer to is:
Why does the TFLOP/W ratio actually IMPROVE when nVidia does that?
Watt for Watt the 660Ti is slightly better at compute than the 680 and far better than the 670, and we all know they are based on the "same" GK104 chip. Why? How?
My theory is that even if TSCM's output of the GK110 was golden, we'd still be looking at disabled SMXes. Of course since it's just a theory, it could very well be wrong.
frenchy_2001 - Tuesday, November 13, 2012 - link
No, you are probably right.Products are more than their raw capabilities. When GF100 came out, Nvidia placed a 480 core version (out of 512) in the consumer market (at 700MHz+) and a 448 at 575MHz in the Quadro 6000. Power consumption, reliability and longevity were all parts of that decision.
This is part of what was highlighted in the article as a difference between K20X and K20, the 235W vs 225W makes a big difference if your chassis is designed for the latter.
Harry Lloyd - Tuesday, November 13, 2012 - link
Can you actually play games with these cards (drivers)?I reckon some enthusiasts would pick this up.
Ryan Smith - Wednesday, November 14, 2012 - link
Unfortunately not. If nothing else, because there aren't any display outputs.