AMD Comments on Threadripper 2 Performance and Windows Scheduler
by Ian Cutress on January 14, 2019 9:00 AM EST- Posted in
- CPUs
- AMD
- Trade Shows
- EPYC
- Threadripper 2
- CES 2019
Users may have been following Wendell from Level1Tech’s battle with researching the reasons behind why some benchmarks have regressed performance on quad-die Threadripper 2 compared to dual-die configurations. Through his research, he found that this problem was limited to Windows, as cross-platform software on Linux did not have this issue, and that the problem was not limited just to Threadripper 2, but quad-die EPYCs were also affected.
At the time, most journalists and analysts noted that the performance was lower, and that the Linux/Windows differences existed, but pointed the finger at the reduced memory performance of the large Threadripper 2 CPUs. At the time, Wendell discovered that removing CPU 0 from the thread pool, after the program starts running, it actually regained all of the performance loss on Windows.
After some discussions about what the issue was exactly, I helped Wendell with some additional testing, by running our CPU suite through an affinity mask at runtime to remove CPU 0 from the options at runtime. The results were negative, suggesting that the key to CPU 0 was actually changing it at run time.
After this, Wendell did his testing on an EPYC 7551 processor, one of the big four-die parts, and confirmed this was not limited to just Threadripper – the problem wasn’t memory, it was almost certainly the Windows Scheduler.
'Best NUMA Node' and Windows Hotfix for 2-NUMA
The conclusion was made that in a NUMA environment, Windows’ scheduler actually assigns a ‘best NUMA node’ for each bit of software and the scheduler is programmed to move those threads to that node as often as possible, and will actually kick out threads that also have the same ‘best NUMA node’ settings with abandon. When running a single binary that spawns 32/64 threads, every thread from that binary is assigned the same ‘best NUMA node’, and these threads will continually be pushed onto that node, kicking out threads that already want to be there. This leads to core contention, and a fully multi-threaded program could spend half of its time shuffling around threads to comply with this ‘best NUMA node’ situation.
The point of this ‘best NUMA node’ environment was originally meant to be for running VMs, such that each VM would run in its own runtime and be assigned different ‘best NUMA nodes’ depending on what else was currently on the system.
One would expect this issue to come up in any NUMA environment, such as dual processors or dual-die AMD processors. It turns out that Microsoft has a hotfix in place in Windows for dual-NUMA environments that disables this ‘best NUMA node’ situation. Ultimately at some point there were enough dual-socket workstation platforms on the market that this made sense, pushing the ‘best NUMA node’ implementation down the road to 3+ NUMA environments. This is why we see it in quad-die Threadripper and EPYC, and not dual-die Threadripper.
Wendell has been working with Jeremy from BitSum, creator of the CorePrio software, in developing a way of soft-fixing this issue. The CorePrio software now has an option called ‘NUMA Disassociator’ which probes which software is active every few seconds and adjusts the thread affinity while the software is running (rather than running an affinity mask which has no affect).
This is a good temporary solution for sure, however it needs to be fixed in the Windows scheduler.
AMD Comments On The Findings
There have been questions about how much AMD/Microsoft know about this issue, who they are in contact with, and what is being done. AMD was happy to make some comments on the record.
AMD stated that they have support and update tickets open with Microsoft’s Windows team on the issue. They believe they know what the issue is, and commends Wendell for being very close to what the actual issue is (they declined to go into detail). They are currently comparing notes with Bitsum, and actually helped Bitsum to develop the original tool for affinity masking, however the ‘NUMA Disassociator’ is obviously new.
The timeline for a fix will depend on a number of factors between AMD and Microsoft, however there will be announcements when the fix is ready and what exactly that fix will affect performance. Other improvements to help optimize performance will also be included. AMD is still very pleased with the Threadripper 2 performance, and is keen to stress that for the most popular performance related tests the company points to reviews that show that the performance in rendering is still well above the competition, and is working with software vendors to push that performance even further.
39 Comments
View All Comments
edwaleni - Monday, January 14, 2019 - link
Phoronix already tested the Coreprio app and they found small improvements in some apps, some worse. The only app showing a gain was Indigo, which oddly is the "only" app that Wendell can get appreciable improvement in.Microsoft needs to fix their NUMA support and stop kicking the can down the road. If Linux can support it "out of the box", then Microsoft needs to get up to speed.
Byyo - Monday, January 14, 2019 - link
Those results were at Phoronix were suspect. He didn't see the performance regression from Indigo on Win10, for unknown reasons (maybe lingering impact from Coreprio), so there was nothing for Coreprio to fix. He *did* see the perf regression in Win2019, where it only had 50% the performance there, though he didn't go into this discrepancy. Everyone else gets the same results from Indigo.7-Zip is also impacted the same, though the fix has been more challenging to consistently apply in NUMA mode (though does work): https://bitsum.com/forum/index.php/topic,8526.msg2...
edwaleni - Monday, January 14, 2019 - link
I am not sure what at Phoronix is "suspect". He ran the tool, he ran the tests. Other than Indigo, nothing caught or overtook Linux's performance in any appreciable way.The NumaPref detail you link to was just posted yesterday.
For those wondering what AMD is doing about it, they opened the ticket, they elevated it at MSFT. Not aware of the timelines involved, but it really is in MSFT's hands.
PeachNCream - Monday, January 14, 2019 - link
*buys 32-core processor for compute-intensive workloads**disables multiple cores to achieve acceptable performance*
Makes sense to me. Now fix your junk Microsoft.
coder543 - Monday, January 14, 2019 - link
After all these years, I really wish that AnandTech would do some of their benchmarks on Linux as well. Windows just has really bad performance at certain things, like listing directories with lots of small files, or apparently NUMA scheduling. Intrinsic issues like these make me question the value of the Chrome compilation benchmark and others that I want to care about, since those numbers could be wildly different on Linux. For consumer hardware, perhaps Windows benchmarks are fine, but for reviews of professional-grade desktop hardware, Linux should absolutely have a place in the benchmark results.At a minimum, on issues like the article addresses, it would make it easier for the reviewer to differentiate hardware problems from software problems when they can look at the benchmark results from two operating systems, instead of only one.
PeachNCream - Monday, January 14, 2019 - link
Your wish is about to be granted. Per Ian in the recent $60 CPU review article located here:https://www.anandtech.com/show/13660/amd-athlon-20...
"Linux (when feasible)
When in full swing, we wish to return to running LinuxBench 1.0. This was in our 2016 test, but was ditched in 2017 as it added an extra complication layer to our automation. By popular request, we are going to run it again."
Dragonstongue - Monday, January 14, 2019 - link
They should just leave Core 0 out of the equation in the first place when it comes to anything but the primary task user demand in question, example, I launch a game to play, Core 0 takes this game as priority, I launch a media player, it gets assigned to Core 0 as default etc, the things windows, background process, auto-launch programs etc get sent to all other Cores except for Core 0, there, problem solved, should be easy enough to do via windows KB update to Vista all the way through to Win 10.Lakados - Wednesday, January 16, 2019 - link
I have a OLD quad socket Intel server that is being decommissioned it has run REHL 5.x its whole life I would be half tempted to stick Windows on it just to see if this can be replicated there.Ozymankos - Saturday, February 9, 2019 - link
oh you ran explorer in windows with all 16 cores?well that is a true achievement
you shall post it together with
-7 flies in a single strike:))