A new update to the Intel document for software developers indicates that the company will begin to introduce various AVX-512 instruction set extensions to its consumer CPUs soon. This will start from the codenamed Cannon Lake (CNL) and Ice Lake (ICL) processors, made using 10 nm process technologies. The new extensions will enable future chips to improve performance in certain applications. One of the main questions on AVX-512 is which consumer programs will actually support the AVX-512 when these CNL and ICL processors hit the market. In addition to the AVX-512, the upcoming processors will introduce a host of other new non-AVX-512 instructions.

AVX-512 Coming to Consumer CPUs

According to the Intel Architecture Instruction Set Extensions and Future Features Programming Reference document, Intel’s Cannon Lake CPUs will support AVX512F, AVX512CD, AVX512DQ, AVX512BW, and AVX512VL. This will bring the feature set of these CPUs to the current level of the Skylake-SP based processors. In addition, the Cannon Lake microarchitecture will support the AVX512_IFMA and AVX512_VBMI commands, but at this point, it is unclear whether the support will be limited to servers, or will also be featured in the consumer processors (the latter scenario is likely based on the document wording, but remains unclear).

Intel originally promised to release Cannon Lake processors in 2016 – 2017 timeframe, but delayed introduction of its 10 nm process technology to 2018, thus postponing the CPU launch as well. Initially it was expected that the Cannon Lake CPUs would generally resemble the Kaby Lake and Coffee Lake chips with some refinements, but the addition of the AVX-512 support means a rather tangible architecture improvement. For AVX-512, large the chunks of data require massive memory bandwidth, which the Skylake-SP cores get due to large caches and more memory controllers. Keeping in mind memory bandwidth and power consumption factors, the AVX-512 might not be supported by all Cannon Lake client CPUs, but only by those aimed at higher-performance machines (i.e., no AVX-512 for ULP mobile parts as well as entry-level desktop SKUs, but this is a speculation at this point). Meanwhile, a good news is that by the time AVX-512-supporting Cannon Lake processors arrive, programs for client PCs that take advantage of the latest extensions will likely be available.

The evolution of the AVX-512 on general-purpose CPUs is not going to stop. Intel’s Ice Lake processors will support AVX512_VPOPCNTDQ (which will also be supported by the Xeon Phi ‘Knights Mill’) commands as well as AVX512_VNNI, AVX512_VBMI2, AVX512+VPCLMULQDQ and AVX512_BITALG instructions. The ICL chips will also feature AVX-512 versions of known AES and GFNI algorithms for encryption and error corrections — AVX512+VAES and AVX512+GFNI.

Meanwhile, the Knights Mill will exclusively support AVX512_4FMAPS and AVX512_4VNNI (at least for a while, because an Intel filing with the Linux kernel states that the upcoming Xeon Phi and Xeon CPUs will support both commands, but descriptions of Linux patches are not always accurate, plus, plans tend to change).

AVX-512 Support Propogation by Various Intel CPUs
  Xeon, Core X General Xeon Phi  
Skylake-SP AVX512BW
Knights Landing
Cannon Lake AVX512VBMI
Knights Mill
Ice Lake AVX512_VNNI
Source: Intel Architecture Instruction Set Extensions and Future Features Programming Reference (pages 12 and 13)

As it turns out from Intel’s document, the Cannon Lake and Ice Lake processors will have an up-to-date AVX-512 support. It is unknown whether the CNL and the ICL cores will be used inside the future server processors (remember that Intel has server-specific 'Cascade Lake' product incoming), but if this is the case, then it looks like Intel’s cores for server and client computers will have the same feature-set going forward, at least when it comes to the AVX-512 support.

Adding the AVX-512 to consumer processors looks like an important development even though the instruction set was primarily designed to process large amounts of data common for servers and, to a degree, workstations (such as encoding, rendering, cryptography, deep learning, etc.). Apparently, Intel believes that 512-bit INT/FP calculations will be important for mainstream PCs as well. A big question is how exactly Intel plans to implement the AVX-512 in various Cannon Lake and Ice Lake processors going forward. Keep in mind that Intel’s six and eight-core Skylake-X CPUs officially support one fused FMA for AVX-512-F, but the chips with 10+ cores officially support dual 512-bit AVX-512-F ports and can offer up to two times higher performance. So in that respect, there is potential for further differentiation between products.

In the meantime, Intel’s Cannon Lake and Ice Lake CPUs will have a number of other new instructions for various matters and they are certainly worth looking at.

New Instructions to Improve Security, Performance of Upcoming CPUs

In a bid to speed up certain cryptography algorithms, Cannon Lake will feature the SHA-NI instruction set that is already supported by the Goldmont cores. SHA-NI is of a similar base to AES-NI, that was added several generations prior. Based on Intel’s publications, SHA-NI can speed up SHA1, SHA256 and SHA224 algorithms. In addition, the new CPUs will also support the UMIP security mechanism that prevents the execution of certain instructions in if their privilege level is insufficient for that, preventing certain apps from accessing the OS settings.

The Ice Lake chips will bring support for Fast Short REP MOV instruction that will enable fast moves of large amounts of data from one location to another, which will benefit optimized memory-intensive applications. Keep in mind that we are moving towards persistent memory for a number of server applications and therefore large amounts of data located in DRAM and/or NVDIMMs will be more common in the future.

Another interesting feature supported by the Ice Lake consumer processors is CLWB (Cache Line Write Back) command for NVMe programming. The feature is already supported by the Skylake-SP cores and is required to better handle SSDs connected to the processor, but will come into consumer products with Ice Lake. CLWB flushes the write caches, but does not invalidate the data, making it available if it is needed after the line is flushed, thus improving performance in certain situations. Given the Purley/Skylake-SP context, CLWB is something required for upcoming NVDIMMs (based on 3D XPoint), but it is not completely clear how Intel expects to use it in case of consumer platforms (they make sense for certain workstation applications and for that reason CLWB is supported by SKL-SP). In any case, the addition of CLWB will add some speed in certain cases when very fast SSDs are used and cache miss is an issue.

There are other features coming in the Goldmont Plus (the heart of upcoming Gemini Lake SoCs) and Ice Lake processors, namely PTWRITE and RDPID, which seem to be aimed mostly at software developers and which purpose may not benefit end users right away.

Instruction Set Extensions of Cannon Lake, Ice Lake and Goldmont+ CPUs
  Instruction Purpose Description
Cannon Lake SHA-NI Security Cryptography acceleration.

User-Mode Instruction Prevention
Security Prevents execution of certain instructions if the Current Privilege Level (CPL) is greater than 0. If these instructions were executed while in CPL > 0, user space applications could have access to system-wide settings such as the global and local descriptor tables, the task register and the interrupt descriptor table.
Ice Lake CLWB

Cache Line
Write Back
Performance Writes back modified data of a cache line similar to CLFLUSHOPT, but avoids invalidating the line from the cache (and instead transitions the line to non-modified state). CLWB attempts to minimize the compulsory cache miss if the same data is accessed temporally after the line is flushed if the same data is accessed temporally after the line is flushed.
Fast Short REP MOV Performance Enables fast moves of data from one location to another.

Read Processor ID
General Quickly reads processor ID to discover its feature set and apply optimizations/use specific code path if possible.
Goldmont Plus PTWRITE

Write Data to a Processor Trace Packet
Debugging Unclear.
UMIP Security See above
RDPID General See above
Source: Intel Architecture Instruction Set Extensions and Future Features Programming Reference (pages 12 and 13)

Some History

Intel and AMD have been adding various instruction set extensions to the x86 architecture since the mid-1990s. Throughout the recent 20 years, both companies have brought in hundreds of new instructions designed to improve performance in various applications by SIMD instructions and feeding CPU cores large amounts of data at once or by using special-purpose hardware. Intel’s latest mainstream extensions are called the AVX/AVX2 and their main purposes were increasing the width of the register file (both SIMD and integer) to 256 bits and the introduction of commands like the FMA3 (that serves the same purpose — does relatively complex computations in one instruction). To perform 256-bit AVX2 operations, CPUs have to lower their frequency to maintain stability, as cores tend to draw a lot of power under such workloads, but even at lower clock rates AVX/AVX2 make a lot of sense and increase overall throughput.

The next step in the evolution of the instruction set extensions that Intel made was the AVX-512. With AVX-512 the company decided to introduce different sets of instructions for different applications and implemented them in different products. Some of the AVX-512 extensions are aimed primarily at enterprise workloads, whereas the others are needed for supercomputers or high performance compute. Implementing all of them in in all products hardly makes a lot of sense for Intel and its customers, so the latest Skylake-SP Xeons (and the high-end desktop processors) support one set of AVX-512 commands and the Xeon Phis support another one. In the meantime, contemporary mainstream consumer CPUs do not support AVX-512 at all.  One of the reasons for this is because the physical implementation significantly increases die size (by up to 15% in case of the Skylake core). Other factors such as the cost associated with a die increase, and partly because client applications today cannot take advantage of such instructions, are also in the mix. In the future, this is going to change as Intel plans to enable support of certain AVX-512 variations in its future Cannon Lake and Ice Lake processors for mainstream consumers.

Wrapping Up

The addition of the AVX-512 to the future consumer CPUs is a good news for those who use such processors for things like video encoding, rendering or other applications that are common for workstations. Meanwhile, with the Ice Lake consumer chips, Intel is adding a deep learning-specific (AVX512_VNNI) 512-bit instructions as well as the NV-DIMM-oriented features such as CLWB, although immediate advantages for this market segment are unclear. Intel is opening this information up to allow developers to prepare for these processors and develop software in advance. In any case, all new features are always welcome by many because at some point they start to bring certain advantages.

Related Reading

Source: Intel (via WikiChip Twitter).

Comments Locked


View All Comments

  • James5mith - Thursday, October 19, 2017 - link

    And which CPU generation gets the onboard TB3 controller we were promised?
  • smilingcrow - Thursday, October 19, 2017 - link

    There have been slides posted online about that and I think it's the next one which is not backwardly compatible with CL.
  • HStewart - Thursday, October 19, 2017 - link

    I would think TB3 integration is part of motherboard - and since it has functionality related to Video - it probably aim more for mobile platforms.

    Unlike AMD, Desktop is a limited marked for Intel - mobile is where it is at for Intel now days.
  • smilingcrow - Thursday, October 19, 2017 - link

    They are integrating the controller which is currently a separate chip directly into the actual CPU.
    Of course the motherboard will need to support its features which is the case for all I/O that a CPU offers.

    Your fallacy that the desktop is not important for Intel shows how out of touch you are.
    TB3 will be integrated in desktop CPUs for sure.
  • moozoo - Thursday, October 19, 2017 - link

    Wake me up when they add AVX512ER to a consumer CPU, until then I'll stick with GPU's
  • mode_13h - Friday, October 20, 2017 - link

    Do GPUs have hardware acceleration of exponential functions? Surely, they must accelerate reciprocal and probably even x^-0.5.

    Regardless, if you can currently use GPUs, then any AVX-512 support is unlikely to change things for you. Xeon Phi (KNL) has it and still gets stomped on raw compute perf by big GPUs.
  • peevee - Friday, October 20, 2017 - link

    What a mess. Instead of a general highly efficient general vectorized command set they have to have a separate instruction for every possible need.
  • mode_13h - Tuesday, October 24, 2017 - link

    You're being unrealistic. The combination of possible operations and operands is dizzying. It took Intel several generations to flesh out their previous vector instructions, and those didn't even support data types like fp16.
  • peevee - Wednesday, October 25, 2017 - link

    I am being realistic. There are principal architectural decisions made in the 70s (and even as early as 40s) which need to be ditched to extract full potential from modern manufacturing technologies, but it is possible, and much better than this constant abomination of inventing new commands.
  • mode_13h - Wednesday, October 25, 2017 - link

    No, you're not being realistic if you think they're going to just flush x86 down the toilet and start with a clean sheet of paper. There are CPUs like that, but targeted more towards hyperscale and other specialized niches.

Log in

Don't have an account? Sign up now