The U8-Series Microarchitecture

We’ve had the pleasure of being briefed on the key aspects of the U8 microarchitecture, and we’ll be able to have a more in-depth look (albeit high-level) at how the new CPU design functions.

At the highest level, the U8 is a 3-wide issue out-of-order CPU with a pipeline depth of 12 stages, feeding 3 execution units. It’s a pretty traditional OoO-design and the noteworthy design choice here is the core’s use of physical register files instead of an architectural one, such as seen in initial Arm designs such as the A72.

One thing to note as we’re covering the microarchitecture is that SiFive didn’t disclose the exact sizes of some of the structures, which is somewhat natural given the core’s purported scalable configuration design where one can change many aspects of the IP, and we’re only covering the generic U8-Series microarchitecture as individual implementations (Such as an U84) will have different configurations.

The fetch unit of the core is able to request instructions out of the L1I at 16 bytes per cycle and put it into the fetch queue of the front-end. The RISC-V ISA has a variable instruction encoding size, so it’s not possible to map this to an exact number on instructions as one can on the Arm ISA, but if we naively assume a 32-bit average, it would correspond to 4 instructions per cycle. Of course, this isn’t surprising as the decoder on the U8 is 4-wide, feeding expanded instructions into the instruction queue.

The interesting thing here about the core is that the instruction queue is only able to issue 3 instructions out to the rename stage. Having the fetch width being higher than your issuing rate helps in the case of branch mispredictions and bubbles and allows the front-end to catch up with the execution backend, something we’ve also seen in other cores; however, we never quite saw an implementation in which the decoder was wider than the issue rate (Actually, only Intel's recent Tremont microarchitecture would also fit this characteristic). Beyond it being a deliberate design decision for the balance of the microarchitecture, maybe it’s also a forward-looking implementation on the part of the decoder whilst we may see wider issue configurations in future U8 designs.

Moving on to the mid-core, we see a traditional design into the rename stage, a re-order buffer and three dispatch engines feeding into the execution pipelines. The diagram here is a bit misleading in terms of the arrows going into the issue queues – it doesn’t mean that it’s only one instruction per issue queue, the core can still dispatch up to 3 instructions into the integer issue queues for example.

It would have been interesting to hear about the exact structure sizes on this part of the core but SiFive didn’t cover these details during the presentation.

On the integer execution block, we see that it’s actually composed of three execution pipelines. Each has its own issue queue, feeding into three ALU pipelines with different capabilities. One pipeline serves just as a regular ALU, a second one shares the port with the branch unit, while the third pipeline is a more complex one capable of integer multiplication and division.

Unfortunately, SiFive didn’t go into any detail of the floating-point pipelines or the L/S units. On the FP side, things should be relatively simple in terms of the execution capabilities, at least on the U84 core. Currently, RISC-V does not have any SIMD/Vector instructions as that ISA extension has not been finalized yet. SiFive explains that this might happen at the end of the year, and the U87 is poised to adopt the new vector capabilities next year.

SiFive and RISC-V Performance Targets, PPA and Conclusion
Comments Locked


View All Comments

  • zmatt - Friday, November 1, 2019 - link

    Stop calling it a MIPS variant. Just because they reached similar conclusions doesn't mean they are related. By your logic Ryzen is a variant of Core.

    Furthermore I'd argue that your criticisms of RISC-V and MIPS lacking instructions misses the entire point of RISC. Storage is cheap. Who cares if the code is bigger? Mobile devices are packing hundreds of gigs of storage and PCs have terabytes today. Save the silicon, every bit counts there when its making heat, drawing power and complicating clock propagation.
  • Wilco1 - Friday, November 1, 2019 - link

    Would you prefer it being called a MIPS clone instead? I haven't seen two ISAs with such a great similarity as MIPS and RISC-V.

    You're applying 80's RISC dogma which are no longer relevant. Transistors are cheap and efficient today, so we don't need to minimize them. We no longer optimize just the core or decoder but optimize the system as a whole. Who cares if you saved a few mW in the decoder when moving the extra instructions between DRAM and caches costs 10-100 times as much?

    The RISC-V focus on simple instructions and decode is as crazy as a cult. They even want to add instruction fusion for eg. indexed accesses. So first simplify decode by leaving out useful instructions, then make it more complex again to try to make up for the missing instructions...
  • zmatt - Monday, November 4, 2019 - link

    I've no problem with making comparisons to aspects of MIPS but saying its a clone or derivative of it is reductionist.

    You're applying 80's RISC dogma which are no longer relevant{/quote]

    You do realize that RISC won right? Since Pentium Pro all x86 cores have been internally RISC with a big decoder slapped on the front so nobody had to rehash years of work in their compilers or break legacy code.

    The RISC-V focus on simple instructions and decode is as crazy as a cult. They even want to add instruction fusion for eg. indexed accesses. So first simplify decode by leaving out useful instructions, then make it more complex again to try to make up for the missing instructions...

    You accuse everyone else of holding on to an 80's dogma yet you are the one who sounds like they are from the 80's. Like some die hard greybeard who want give up their VAX.
  • Threska - Wednesday, November 6, 2019 - link

    Storage is cheap. Bandwidth isn't. Moving more around to get the same effect isn't always better.
  • rahvin - Friday, November 1, 2019 - link

    ARM nor any instruction set has any inherent advantage over any other. Anybody making a statement like that is just plain ignorant of how modern CPU's are designed. This is besides the fact that if ARM was inherently better than x86 as you claim it would have already displaced x86 on the desktop and server. In fact, every single desktop and server ARM architecture developed so far has fallen on it's face in competition against the x86 processors.

    x86 CPU's haven't used x86 instructions internally since the Pentium Pro in the mid 90's. The shift to out of order execution required that an x86 instruction decoder be added and abstraction from the instruction set became the norm. Since the x86 instruction set was abstracted with a hardware abstraction layer I dare say every single Intel CPU since the Pentium pro has used a different internal RISC architecture than every other generation with no two being exactly identical. This has allowed Intel massive flexibility to pursue whatever internal architecture works best with their FAB process while maintaining x86 compatibility through the decoder which occupies almost no space anymore. On modern processors that decoder occupies something like 0.001% of the die and simply translates all those x86 instructions into whatever internal architecture the CPU actually uses.

    If I'm not mistake ARM moved to an instruction decoder with the shift to out of order execution as well and their designs since no longer use pure ARM instructions within the core although the simplicity of the ARM risc architecture means they don't need as much abstraction as x86, there is no point in being anchored to the design parameters of the instruction set when hardware decoders are so cheap.

    The only reason ARM dominates the markets it does without Intel competition is that Intel is unwilling to compete in those markets at those prices. If Intel was to produce and cell smartphone chips that were competitive in both performance and price with the ARM chips they'd cannibalize their higher margin products when OEM grabbed those chips and started making higher end products by stapling 10 inexpensive cell phone processors together and ending up with a product that's competitive the chips they sell for $1000. That's why on a lot of the cheaper products Intel sells they put restrictions on their use.

    You might not remember but Intel went on a design spree in 2008 when there were market indications and predictions that the tablet and smartphone were going to destroy the PC marketplace. They had almost a dozen design teams producing low power and high performance CPU's. The products that came out of that Ranged from Edison on the low end to the server atoms like Avoton that were 25watt 8 core CPU's. Intel's executives canceled most of these products or put major restrictions (such as amount of RAM, wattage, etc) on their use to try to avoid cannibalizing higher margin products (for example Avoton had some ridiculous restrictions such as no more than two memory slots). In this time period they produced a mostly competitive product for smartphones (it was about 5% slower than the highest end qualacom chip at the time) but they didn't sell any because they set the price higher than what Qualacom wanted for their ARM chip. You can find articles on those Chips on google and you will note the reviewers that lamented about the price and restrictions Intel put on the chip because they destroyed it's competitiveness. But that's the thing, Intel's executives and board didn't want to compete in this market.

    Intel has always struggled with competing in these lower margin products because they know that if they produce a performant low power chip and sell it ARM cheap (ARM chips typically sell with single digit margins) there will be a dozen OEM's like Dell, HP or Lenovo that start stapling a dozen together and selling them as replacements for very high margin x86 products (Intel has 60% percent margins on their higher end products and can push margins as high as 75% on their server chips).

    ARM doesn't have any inherent advantage over Intel or AMD because of their instruction set. They do have a slight advantage because of their business structure allows them to avoid the production side and focus on design and they have a lot of partners to help advance the ecosystem while ARM the company isn't effected by Qualcomm or Broadcom selling ARM chips with 5% margins. But make no mistake, IMO if Intel wanted to slash their margins to the level that the ARM chip makers get (and watch their stock price crater) they could easily put an x86 chip into every market ARM dominates right now and become the number one seller. They choose not to because of the damage it would do to their stock price and the high end market.
  • Wilco1 - Saturday, November 2, 2019 - link

    That's quite a long-winded way of saying "I don't believe ISA matters"...

    But the fact is, it does. Intel spent over $10 Billion to get into the phone/tablet market. They didn't just lower their margins, they slashed them - they literally paid $100 for each chip they "sold"! And despite having a process advantage at the time, the mobile Atoms still weren't competitive on power or performance. Given how hard they tried and how much money they spent, it's safe to say the x86 ISA complexity prevented them making competitive chips.

    The same is true at the high end. Mobile phones already have the same single-threaded performance as the fastest x86 CPU you can buy today. Do you think (or hope) it will end there? Arm consistently improves performance by 20-30% per year. In the next few years both Intel and AMD are in for some serious competition from much faster Arm cores in laptops and servers.
  • vladpetric - Wednesday, October 30, 2019 - link

    Classic SIMD (SSE/AVX or Neon) is not nearly as helpful as Dynamic Scheduling (or Out of order execution). Yes, you can have hand-coded loops with good performance, but that's it. And they only work for very regular code.

    In the 80s, instruction sets made a significant difference.

    But in the 90s, superscalar out-of-order came out and it beat everything else, by a large margin. These days, that's how you get performance, pretty much (high IPC from dynamic scheduling).
  • Threska - Friday, November 1, 2019 - link

    "But in the 90s, superscalar out-of-order came out and it beat everything else, by a large margin."

    And now we're paying the security price.
  • vladpetric - Thursday, November 7, 2019 - link

    At this time, turn off hyper-threading and you'll be fine.
  • Findecanor - Sunday, November 3, 2019 - link

    With that "classic SIMD", the instruction set and register width sometimes increased a lot with each generational jump, and developers had been limited to produce code for an ISA a couple generations back: for the lowest-spec hardware that users were expected to own.
    There have also not been very good development tools and compilers, which have forced developers to hand-code or to use libraries that were geared towards only certain kinds of loops.

    The first of these is about to change with new ISA. RISC-V's leading SIMD proposal and the SVE extension to ARM processors use _scalable_ vectors, where the register width is not limited by the ISA but by the specific processor it runs on. These ISAs are therefore expected to remain more stable than classic SIMD ISAs have.
    Compilers are also now much better than before at auto-vectorising code to run on SIMD hardware.
    These two improvements together mean that more code could be SIMD instructions, and that more of a processor's potential could be taken advantage of.

    High-performance computing has been largely taken over by GPUs, which are in essence super-wide SIMD machines, using predicate vectors for much of its flow control. (Predicates being only late additions to SSE and Neon)
    The scalable vector proposal for RISC-V is by some considered so promising that there have been even been talks about building GPUs based around the RISC-V SIMD ISA -- optimised for SIMD first and general-compute second.

Log in

Don't have an account? Sign up now