Investigating Performance of Multi-Threading on Zen 3 and AMD Ryzen 5000

Name: Investigating Performance of Multi-Threading on Zen 3 and AMD Ryzen 5000
Item: Investigating Performance of Multi-Threading on Zen 3 and AMD Ryzen 5000
Author: Dr. Ian Cutress

by Dr. Ian Cutress on December 3, 2020 10:00 AM EST

126 Comments | Add A Comment

126 Comments

CPU Performance

For simplicity, we are listing the percentage performance differentials in all of our CPU testing – the number shown is the % performance of having SMT2 enabled compared to having the setting disabled. Our benchmark suite consists of over 120 tests, full details of which can be found in our #CPUOverload article.

Here are the single threaded results.

Single Threaded Tests AMD Ryzen 9 5950X
AnandTech	SMT Off Baseline	SMT On
y-Cruncher	100%	99.5%
Dwarf Fortress	100%	99.9%
Dolphin 5.0	100%	99.1%
CineBench R20	100%	99.7%
Web Tests	100%	99.1%
GeekBench (4+5)	100%	100.8%
SPEC2006	100%	101.2%
SPEC2017	100%	99.2%

Interestingly enough our single threaded performance was within a single percentage point across the stack (SPEC being +1.2%). Given that ST mode should arguably give more resources to each thread for consistency, the fact that we see no difference means that AMD’s implementation of giving a single thread access to all the resources even in SMT mode is quite good.

The multithreaded tests are a bit more diverse:

Multi-Threaded Tests AMD Ryzen 9 5950X
AnandTech	SMT Off Baseline	SMT On
Agisoft Photoscan	100%	98.2%
3D Particle Movement	100%	165.7%
3DPM with AVX2	100%	177.5%
y-Cruncher	100%	94.5%
NAMD AVX2	100%	106.6%
AIBench	100%	88.2%
Blender	100%	125.1%
Corona	100%	145.5%
POV-Ray	100%	115.4%
V-Ray	100%	126.0%
CineBench R20	100%	118.6%
HandBrake 4K HEVC	100%	107.9%
7-Zip Combined	100%	133.9%
AES Crypto	100%	104.9%
WinRAR	100%	111.9%
GeekBench (4+5)	100%	109.3%

Here we have a number of different factors affecting the results.

Starting with the two tests that scored statistically worse with SMT2 enabled: yCruncher and AIBench. Both tests are memory-bound and compute-bound in parts, where the memory bandwidth per thread can become a limiting factor in overall run-time. yCruncher is arguably a math synthetic benchmark, and AIBench is still early-beta AI workloads for Windows, so quite far away from real world use cases.

Most of the rest of the benchmarks are between a +5% to +35% gain, which includes a number of our rendering tests, molecular dynamics, video encoding, compression, and cryptography. This is where we can see both threads on each core interleaving inside the buffers and execution units, which is the goal of an SMT design. There are still some bottlenecks in the system affecting both threads getting absolute full access, which could be buffer size, retire rate, op-queue limitations, memory limitations, etc – each benchmark is likely different.

The two outliers are 3DPM/3DPMavx, and Corona. These three are 45%+, with 3DPM going 66%+. Both of these tests are very light on the cache and memory requirements, and use the increased Zen3 execution port distribution to good use. These benchmarks are compute heavy as well, so splitting some of that memory access and compute in the core helps SMT2 designs mix those operations to a greater effect. The fact that 3DPM in AVX2 mode gets a higher benefit might be down to coalescing operations for an AVX2 load/store implementation – there is less waiting to pull data from the caches, and less contention, which adds to some extra performance.

Overall

In an ideal world, both threads on a core will have full access to all resources, and not block each other. However, that just means that the second thread looks like it has its own core completely. The reverse SMT method, of using one global core and splitting it into virtual cores with no contention, is known as VISC, and the company behind that was purchased by Intel a few years ago, but nothing has come of it yet. For now, we have SMT, and by design it will accelerate some key workloads when enabled.

In our CPU results, the single threaded benchmarks showed no uplift with SMT enabled/disabled in our real-world or synthetic workloads. This means that even in SMT enabled mode, if one thread is running, it gets everything the core has on offer.

For multi-threaded tests, there is clearly a spectrum of workloads that benefit from SMT.

Those that don’t are either hyper-optimized on a one-thread-per-core basis, or memory latency sensitive.

Most real-world workloads see a small uplift, an average of 22%. Rendering and ray tracing can vary depending on the engine, and how much bandwidth/cache/core resources each thread requires, potentially moving the execution bottleneck somewhere else in the chain. For execution limited tests that don’t probe memory or the cache at all, which to be honest are most likely to be hyper-optimized compute workloads, scored up to +77% in our testing.

Investigating SMT on Zen 3 Gaming Performance (Discrete GPU)

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

126 Comments

View All Comments

GeoffreyA - Tuesday, December 8, 2020 - link
There's a single set of 4 decoders. In SMT mode, I believe some sharing is going in. This is from the original Zen design:

https://images.anandtech.com/doci/10591/HC28.AMD.M...
GeoffreyA - Tuesday, December 8, 2020 - link
* going on
naive dev - Wednesday, December 9, 2020 - link
Right, I found that article as well and from that slide it looks like the decoder would be shared. But then that slide was from 2017, so that might have changed.

It looks though as if the decoder could decode those 4 instructions from a single program counter only, right? It's not like the decoder could decode e.g. 2 instructions from program counter 1 and another 2 instructions from program counter 2?
GeoffreyA - Thursday, December 10, 2020 - link
I'm not too sure how the implementation works, but I expect they're shuffling both threads through the decoder at roughly the same time. The decoder has four units (I think 1 complex and 3 simple). As far as I'm aware, that has stayed the same in both Zen 2 and 3.
mapesdhs - Thursday, December 10, 2020 - link
Ian, a question about Handbrake, though it may not apply to the type of test you used. I've read that Handbrake doing an h264 encode can only use 16 threads max. Does this mean that in theory one could run two separate h264 encodes on a 5950X and thus obtain a good overall throughput speedup? Have you tried such a thing? Or might this only work if it were possible to force one encode to only use the 16 threads of one 8c block (CCX?), and the other encode to use the rest? ie. so that the separate encodes are not fighting over the same cores or indeed the same CCX-shared L3? Is it possible to force this somehow? Also, if the claimed 16 thread limit for h264 is true, is there a performance difference for a single h264 encode between SMT on vs. off just in general? ie. with it on, is the OS smart enough to ensure that the 16 threads are spread across all the cores evenly rather than being scrunched onto fewer cores because reasons? If not, then turning SMT off might speed it up. Note that I'm using Windows for all this.

I don't know if any of this applies to h265, but atm the encoding I do is still 1080p. I did an analysis of all available Ryzen CPUs based on performance, power consumption and cost (I ruled out Intel partly due to the latter two factors but also because of a poor platform upgrade path) and found that although the 5900X scored well, overall it was beaten by the 2700X, mainly because the latter is so much cheaper. However, the 5950X would look a lot better if one could run two encodes on it at the same time without clashing, but review articles naturally never try this. I wish I could test it, but the only 16c system I have is a dual-socket S2011 setup with two 2560 v2s, so the separate CPUs introduce all sorts of other issues (NUMA and suchlike).

I found something similar a long time ago when I noticed one could run six separate Maya frame renders on a 24-CPU SGI rack Onyx (essentially one render per CPU board), compared to running a single render on a quad-CPU (single board) deskside Onyx, giving a good overall throughput increase (the renderer being limited to 4 CPUs per job). See:

http://www.sgidepot.co.uk/perfcomp_RENDER4_maya1.h...

Funny actually, re what you say about an overly good speedup perhaps implying a less than optimal core design. Something odd about SGIs is how many times on a multi CPU system one can btain better results by using more threads than there are CPUs, baring in mind MIPS CPUs from that era did not have SMT, ie. the CPUs kinda behave as if they do have SMT even though they don't. I found this behaviour occured most for Blender and C-Ray.

So anyway, it would be great if it were possible to run two h264 encodes on a 5950X at the same time, but there's probably no point if the OS doesn't spread out the loads in a sensible manner, or if in that circumstance there isn't a way to force each encode to use a separate CCX.

All very specific to my use case of course, but I have hundreds of hours of material to convert, so the ability to get twice the throughput from a 5950X would make that CPU a lot more interesting; so far reviews I've read show it to be about 2x faster than the 2700X for h264 Handbrake (just one encode of course), but it costs 4.4x more, rather ruining the price/performance angle. And if it does work then I guess one could ask the same question of TR - could one run eight separate h264 encodes on a future Zen3 TR without the thread management being a total mess? :D I'm assuming it probably wouldn't be so good with the older Zen2 design given the split L3.
GeoffreyA - Sunday, December 13, 2020 - link
Interesting question. Would be nice if someone could give this a test on 16-core Ryzen or TR, and see what happens. Yesterday, I was able to take both FFmpeg and Handbrake up to 128 threads, and it does work; but, having only a 4-core, 4-thread CPU, can't comment.*

As for x264's performance limit, I'm not sure at what number of threads it begins to flag; but, quality wise, using too many (say, over 16 at 1080p) is not advisable. According to the x264 developers, vertical resolution / threads shouldn't fall below 40-50 and certainly not below 30.

https://forum.doom9.org/showthread.php?p=1213185#p...

forum.doom9.org/showthread.php?p=1646307#post1646307

More posts on high core counts:

forum.doom9.org/showthread.php?t=173277

forum.doom9.org/showthread.php?t=175766

* As far as I know, Windows schedules threads all right. From 1903, on Zen 2, one CCX is supposed to be filled up, then another. I imagine 16 threads will be spread across two CCXs in the 5950X. FFmpeg's --threads switch could prove useful too.
GeoffreyA - Sunday, December 13, 2020 - link
-threads, not --threads

Here are links set out better (thought they'd link in the comment):

https://forum.doom9.org/showthread.php?p=1213185#p...

https://forum.doom9.org/showthread.php?p=1646307#p...

https://forum.doom9.org/showthread.php?t=173277

https://forum.doom9.org/showthread.php?t=175766
karthikpal - Friday, December 11, 2020 - link
Nice content bro
<a href="https://www.tronicsmaster.com">Ryzen 7 5800X</a>
deil - Sunday, December 13, 2020 - link
I wonder when smt4 will hit the market a model with 3 copies of most things on the die, in a ring configuration fp/int/fp/int, cache inside a ring st would have a chance to use 2 FP modules for single int processor part (when others don't use it ofc).
This kind of setup would have very interesting performance numbers at least. I am not saying it's a good idea, but interesting one for sure.
Machinus - Sunday, December 13, 2020 - link
This article omits one of the basic considerations in any manually-configured and custom-cooled desktop system: achieving uniform, preditcable thermal behavior. Unless you are building servers to perform only one or two specific types of mathematical operations, and can build, configure, and stress test on those instruction types alone, you need high confidence that the chip will never exceed the thermal flux densities of the cooling system you built. Fixed-clock systems with a static number of available cores have much more consistent thermal performance than chips whose clocks, and number of threads, are free-floating. This reduces your peak flops, but it significantly extends system lifetime. HEDT and HPC systems have double or triple-digit coure counts per sockrt in 2020; SMT is not worth paying the price of reduced hardware lifetime unless you are building extremely specialized calculation servers.

Investigating Performance of Multi-Threading on Zen 3 and AMD Ryzen 5000

CPU Performance

Overall

Post Your Comment

126 Comments

View All Comments

GeoffreyA - Tuesday, December 8, 2020 - link

GeoffreyA - Tuesday, December 8, 2020 - link

naive dev - Wednesday, December 9, 2020 - link

GeoffreyA - Thursday, December 10, 2020 - link

mapesdhs - Thursday, December 10, 2020 - link

GeoffreyA - Sunday, December 13, 2020 - link

GeoffreyA - Sunday, December 13, 2020 - link

karthikpal - Friday, December 11, 2020 - link

deil - Sunday, December 13, 2020 - link

Machinus - Sunday, December 13, 2020 - link

Log in

Don't have an account? Sign up now