Over the past decade I have been involved in several projects that have designed either instruction set architecture (ISA) extensions or clean-slate ISAs for various kinds of processors (you will even find my name in the acknowledgments for the RISC-V spec, right back to the first public version). When I started, I had very little idea about what makes a good ISA, and, as far as I can tell, this is not formally taught anywhere. With the rise of RISC-V as an open base for custom instruction sets, however, the barrier to entry has become much lower and the number of people trying to design some, or all, an instruction set has grown immeasurably.
An instruction set is a lingua franca between compilers and microarchitecture. As such, it has a lot in common with compiler intermediate languages, a subject on which Fred Chow has written an excellent overview.2
Programmers see details of the target platform at three levels:
The application binary interface (ABI) is a set of conventions that define how compilers use visible hardware features. This may be private to a single compiler or shared as a convention between multiple interoperable compilers.
The architecture defines everything the hardware guarantees. This is a contract between the people implementing compilers and operating systems and those implementing the hardware. The architecture includes mechanisms for enumerating devices, configuring interrupts, and so on. The ISA is the core part of the architecture that defines the encoding and behavior of instructions and the operands they consume.
The microarchitecture is a specific implementation of the architecture. Ideally, programmers do not care about the specific details of microarchitectures, but these details often leak. For example, cache-line sizes may be a microarchitectural detail, but they impact false sharing and so can have a large performance impact. If you care about side channels, then you may find the microarchitecture is very important.
Conventions can often live in either the ABI or ISA. There is no hard-and-fast rule for where any of these should live, but here are a couple of helpful rules of thumb:
If different languages are going to want to do something different, it should be in the ABI.
If software needs to do a specific task to take advantage of a microarchitectural feature, that belongs in the ISA and not the ABI.
No Such Thing as a General-Purpose ISA
I have written before that there is no such thing as a general-purpose processor,1 but there is also no such thing as a general-purpose ISA. An ISA must be efficient for compilers to translate a set of source languages into. It must also be efficient to implement in the kinds of microarchitecture that hardware will adopt.
Designing an ISA for all possible source languages is difficult. For example, consider C, Erlang, and Compute Unified Device Architecture (CUDA). Each has a very different abstract machine. C has large amounts of mutable state and a bolted-on concurrency model that relies on shared everything, locking, and, typically, small numbers of threads. Erlang has a shared-nothing concurrency model and scales to very large numbers of processes. CUDA has a complex sharing model that is tightly coupled to its parallelism model.
You can compile any of these languages to any Turing-complete target (by definition), but that may not be effective. If it were easy to compile C code to GPUs (and take advantage of the parallelism), then CUDA would not need to exist. Any family of languages has a set of implicit assumptions that drive decisions about the most efficient targets.
Algol-family languages, including C, typically have good locality of reference (both spatial and temporal), but somewhat random-access patterns. They have a single stack, and a large proportion of memory accesses will be to the current stack frame. They allocate memory in objects that are typically small, and most are not shared between threads. Object-oriented languages typically do more indirect branches and more pointer chasing. Array-processing languages and shading languages typically do a lot of memory accesses with predictable access patterns.
If you do not articulate the properties of the source languages you are optimizing for, then you are almost certainly baking in some implicit assumptions that may or may not actually hold.
Similarly, looking down toward the microarchitecture, a good ISA for a small, embedded microcontroller may be a terrible ISA for a large superscalar out-of-order processor or a massively parallel accelerator. There are good reasons why 32-bit Arm failed to compete with Intel for performance, and why x86 has failed to displace Arm in low-power markets. The things that you want to optimize for at different sizes are different.
Designing an ISA that scales to both very large and very small cores is difficult. Arm’s decision to separate its 32- and 64-bit ISAs meant it could assume a baseline of register renaming and speculative execution in its 64-bit A profile and in-order execution in its 32-bit M profile, and tune both, assuming a subset of possible implementations. RISC-V aims to scale from tiny microcontrollers up to massive server processors. It’s an open research question whether this is possible (certainly no prior architecture has succeeded).
Business Is Not a Separable Concern
One kind of generality does matter: Is the ISA a stable contract? This is more a business question than a technical one. A stable ISA can enter a feedback cycle where people buy it because they have software that runs on it and people write software to run on it because they have it. Motorola benefited from this with its 68000 line for a long time, Intel with its x86 line for even longer.
This comes with a cost: In every future product you will be stuck with any design decision you made in the current generation. When it started testing simulations of early Pentium prototypes, Intel discovered that a lot of game designers had found they could shave one instruction off a hot loop by relying on a bug in the flag-setting behavior of Intel’s 486 microprocessor. This bug had to be made part of the architecture: If the Pentium did not run popular 486 games, customers would blame Intel, not the game authors.
If you buy an NVIDIA GPU, you do not get a document explaining the instruction set. It, and many other parts of the architecture, are secret. If you want to write code for it and do not want to use NVIDIA’s toolchain, you are expected to generate PTX, which is a somewhat portable intermediate language that the NVIDIA drivers can consume. This means that NVIDIA can completely change the instruction set between GPU revisions without breaking your code. In contrast, an x86 CPU is expected to run the original PC DOS (assuming it has BIOS emulation in the firmware) and every OS and every piece of user-space software released for PC platforms since 1978.
This difference impacts the degree to which you can overfit your ISA to the microarchitecture. Both x86 and 32-bit Arm were heavily influenced by what was feasible to build at the time they were created. If you are designing a GPU or workload-specific accelerator, however, then the ISA may change radically between releases. Early AMD GPUs were very long instruction word (VLIW) architectures; modern ones are not but can still run shaders written for the older designs.
A stable ISA also impacts how experimental you can be. If you add an instruction that might not be useful (or might be difficult to implement in future microarchitectures) to x86 or AArch64, then you will find that some popular bit of code uses it in some critical place and you will be stuck with it. If you do the same in a GPU or AI accelerator, then you can quietly remove it in the next generation.
Architecture Matters
A belief that has gained some popularity in recent years is that the ISA does not matter. This belief is largely the result of an oversimplification of an observation that is obviously true: Microarchitecture makes more of a difference than architecture in performance. A simple in-order pipeline may execute around 0.7 instructions per cycle. A complex out-of-order pipeline may execute five or more per cycle (per core), giving almost an order of magnitude difference between two implementations of the same ISA. In contrast, in most of the projects that I have worked on, I have seen the difference between a mediocre ISA and a good one giving no more than a 20% performance difference on comparable microarchitectures.
Two parts of this comparison are worth pointing out. The first is that designing a good ISA is a lot cheaper than designing a good microarchitecture. These days, if you go to a CPU vendor and say, “I have a new technique that will produce a 20% performance improvement,” they will probably not believe you. That kind of overall speedup does not come from a single technique; it comes from applying a load of different bits of very careful design. Leaving that on the table is incredibly wasteful.
The second key point is contained in the caveat at the end: “… on comparable microarchitectures.” The ISA constrains the design space of possible implementations. It’s possible to add things to the architecture that either enable or prevent specific microarchitectural optimizations.
For example, consider an arbitrary-length vector extension that operates with the source and destination operands in memory. If the user writes a + b * c
(where all three operands are large vectors), then a pipelined implementation is going to want to load from all three locations, perform the add, perform the multiply, and then store the result. If you have to take an interrupt in the middle and you are only halfway down, what do you do? You might just say, “Well, add and multiply are idempotent, so we can restart and everything is fine,” but that introduces additional constraints. In particular, the hardware must ensure the destination does not alias any of the source values. If these values overlap, simply restarting is difficult. You can expose registers that report the progress through the add, but that prevents the pipelined operation because you cannot report that you are partway through the add and the multiply. If you are building a GPU, then this is less important because typically, you are not handling interrupts within kernels (and if you are, then waiting a few hundred cycles to flush all in-flight state is fine).
The same problem applies to microcode. You must be able to take an interrupt immediately before or after a microcoded instruction. A simple microcode engine pauses the pipeline, issues a set of instructions expanded from the microcoded instruction, and then resumes. On a simple pipeline, this is fine (aside from the impact on interrupt latency) and may give you better code density. On a more complex pipeline, this prevents speculative execution across microcode and will come with a big performance penalty. If you want microcode and good performance from a high-end core, you need to use much more complicated techniques for implementing the microcode engine. This then applies pressure back on the ISA: If you have invested a lot of silicon in the microcode engine, then it makes sense to add new microcoded instructions.
What Do Small Cores Want?
If you are designing an ISA for a simple single-issue in-order core, you have a clear set of constraints. In-order cores do not worry much about data dependencies; each instruction runs with the results of the previous one available. Only the larger ones do register renaming, so using lots of temporaries is fine.
They typically do care about decoder complexity. The original RISC architectures had simple decoders because CISC (complex instruction set computer) decoders took a large fraction of total area. An in-order core may consist of a few tens of thousands of gates, whereas a complex decoder can easily double the size (and, therefore, cost and power consumption). Simple decoding is important on this scale.
Small code is also important. A small microcontroller core may be as small as 10KB of SRAM (static random-access memory). A small decrease in encoding efficiency can dwarf everything when considering the total area cost: If you need 20% more SRAM for your code, then that can be equivalent to doubling the core area. Unfortunately, this constraint almost directly contradicts the previous one. This is why Thumb-2 and RISC-V focused on a variable length encoding that is simple to decode: They save code size without significantly increasing decoder complexity.
This is a complex tradeoff that is made even more complicated when considering multiple languages. For example, Arm briefly supported Jazelle DBX (direct bytecode execution) on some of its mobile cores. This involved decoding Java bytecode directly, with Java VM (virtual machine) state mapped into specific registers. A Java add instruction, implemented in a software interpreter, requires at least one load to read the instruction, a conditional branch to find the right handler, and then another to perform the add. With Jazelle, the load happens via instruction fetch, and the add would add the two registers that represented the top of the Java stack. This was far more efficient than an interpreter but did not perform as well as a JIT (just-in-time) compiler, which could do a bit more analysis between Java bytecodes.
Jazelle DBX is an interesting case study because it made sense only in the context of a specific set of source languages and microarchitectures. It provided no benefits for languages that did not run in a Java VM. By the time devices had more than about 4MB of RAM, Jazelle was outperformed by a JIT. Within that envelope, however, it was a good design choice.
Jazelle DBX should serve as a reminder that optimizations for one size of core can be incredibly bad choices for other cores.
What Do Big Cores Want?
As cores get bigger, other factors start to dominate. We have seen the end of Dennard scaling but not of Moore’s Law. Each generation still gets more transistors for a fixed price, but if you try to power them all, then your chip catches fire (the so-called “dark silicon” problem). This is part of the reason that on-SoC (system-on-a-chip) accelerators have become popular in recent years. If you can add hardware that makes a particular workload faster but is powered off entirely at other times, then that can be a big win for power consumption. Components that need to be powered all of the time are the most likely to become performance-limiting factors.
On a lot of high-end cores, the register rename logic is often the single biggest consumer of power. Register rename is what enables speculative and out-of-order execution. Rename registers are similar to the static single assignment (SSA) form that compilers use. When an instruction is dispatched, a new rename register is allocated to hold the result. When another instruction wants to consume that result, it is dispatched to use this rename register. Architectural registers are just names for mapping to SSA registers.
A rename register consumes space from the point at which an instruction that defines it enters speculative execution until another instruction that writes to the same rename register exits speculation (that is, definitely happens). If a temporary value is live at the end of a basic block, then it continues to consume a rename register. The branch at the end of the basic block will start speculatively issuing instructions somewhere else, but until that branch is no longer speculative and a following instruction has written to the register, the core may need to roll back everything up to that branch and restore that value. The ISA can have a big impact on the likelihood of encountering this kind of problem.
Complex addressing modes often end up being useful on big cores. AArch64 and x86-64 both benefit from them, and the T-Head extensions add them to RISC-V. If you are doing address calculation in a loop (for example, iterating over an array), then folding this into the load-store pipeline provides two key benefits: First, there is no need to allocate a rename register for the intermediate value; second, this computed value is never accidentally live across loop iterations. The power consumption of an extra add is less than that of allocating a new rename register.
Note that this is less the case for very complex addressing modes, such as the pre- and post-increment addressing modes on Arm, which update the base and thus still require a rename register. These modes still win to a degree because it’s cheaper (particularly for pre-increment) to forward the result to the next stage in a load-store pipeline than to send it via the rename logic.
One microarchitect building a high-end RISC-V core gave a particularly insightful critique of the RISC-V C extension, observing that it optimizes for the smallest encoding of instructions rather than for the smallest number of instructions. This is the right thing to do for small embedded cores, but large cores have a lot of fixed overheads associated with each executed instruction. Executing fewer instructions to do the same work is usually a win. This is why SIMD (single instruction, multiple data) instructions have been so popular: The fixed overheads are amortized over a larger amount of work.
Even if you do not make the arithmetic logic units (ALUs) the full width of the registers and take two cycles to push each half through the execution pipeline, you still save a lot of the bookkeeping overhead. SIMD instructions are a good use of longer encodings in a variable-length instruction set: For four instructions’ worth of work, a 48-bit encoding is probably still a big savings in code size, leaving the denser encodings available for more frequent operations.
Complex instruction scheduling causes additional pain. Even moderately large in-order cores suffer from branch misprediction penalties. The original Berkeley RISC project analyzed the output of C compilers and found that, on average, there was one branch per seven instructions. This has proven to be a surprisingly durable heuristic for C/C++ code.
With a seven-stage dual-issue pipeline, you might have 14 instructions in flight at a time. If you incorrectly predict a branch, half of these will be the wrong ones and will need to be rolled back, making your real throughput only half of your theoretical throughput. Modern high-end cores typically have around 200 in-flight instructions—that is over 28 basic blocks, so a 95% branch predictor accuracy rate gives less than a 24% probability of correctly predicting every branch being executed. Big cores really like anything that can reduce the cost of misprediction penalties.
The 32-bit Arm ISA allowed any instruction to be predicated (conditionally executed depending on the value in a condition-code register). This was great for small to medium in-order cores because they could avoid branches, but the complexity of making everything predicated was high for big cores. The encoding space consumed by predication was large. For AArch64, Arm considered eliminating predicated execution entirely, but conditional move and a few other conditional instructions provided such a large performance win that Arm kept them.
You Don’t Win Points for Purity
Bjarne Stroustrup said, “There are only two kinds of languages: the ones people complain about and the ones nobody uses.” This holds for instruction sets (the lowest-level programming languages most people will encounter) just as much as for higher-level ones. Good instruction sets are always compromises.
For example, consider the jump-and-link instructions in RISC-V. These let you specify an arbitrary register as a link register. RISC-V has 32 registers, so specifying one requires a full five-bit operand in a 32-bit instruction. Almost 1% of the total 32-bit encoding space is consumed by the RISC-V jump-and-link instruction. RISC-V is, as far as I am aware, unique in this decision.
Arm, MIPS, and PowerPC all have a designated link register that their branch-and-link instructions use. Thus, they require one bit to differentiate between jump-and-link and plain jump. RISC-V chooses to avoid baking the ABI into the ISA but, as a result, requires 16 times as much encoding space for this instruction.
This decision is even worse because the ABI leaks into the microarchitecture but not the architecture. RISC-V does not have a dedicated return instruction, but implementations will typically (and the ISA specification notes that this is a good idea) treat a jump-register instruction with the ABI-defined link register as a return. This means that using any link register other than the one defined in the ABI will likely result in branch mispredictions. The result is dealing with all of the downsides of baking the ABI into the ISA but enjoying none of the benefits.
This kind of reasoning applies even more strongly to the stack pointer. AArch64 and x86 both have special instructions for operating on the stack. In most code from C-like languages, the stack pointer is modified only in function prologs and epilogs, but there are many loads and stores relative to it. This has the potential for optimization in the encoding, which can lead to further optimization in the microarchitecture. For example, modern x86 chips accumulate the stack-pointer displacement for push and pop instructions, emitting them as offsets to the rename register that contains the stack pointer (so they are independent and can be issued in parallel) and then doing a single update to the stack pointer at the end.
This kind of optimization is possible even if the stack pointer is just an ABI convention, but this again is a convention that is shared by the ABI and the microarchitecture, so why not take advantage of it to improve encoding efficiency in the ISA?
Finally, big cores really care about parallel decoding. Apple’s M2, for example, benefits hugely from the fixed-width ISA because it can fetch a block of instructions and start decoding them in parallel. The x86 instruction set, at the opposite extreme, needs more of a parser than a decoder. Each instruction is between one and 15 bytes, which may include a number of prefixes. High-end x86 chips cache decoded instructions (particularly in hot loops), but this consumes power and area that could be used for execution.
This is not necessarily a bad idea. As with small cores and instruction density, a variable-length instruction encoding may permit a smaller instruction cache, and that savings may offset the cost of the complex decoder.
Although RISC-V uses variable-length encoding, it’s very cheap to determine the length. This makes it possible to build an extra pipeline stage that reads a block of words and forwards a set of instructions to the real decoder. This is nowhere near as complex as decoding x86.
Some Source Languages Are Not Really Source Languages
A new ISA often takes a long time to gain widespread adoption. The simplest way of bootstrapping a software ecosystem is to be a good emulation target. Efficient emulation of x86 was an explicit design goal of both AArch64 and PowerPC for precisely this reason (although AArch64 had the advantage of a couple of decades more research in binary translation to draw on in its design). Apple’s Rosetta 2 manages to translate most x86-64 instructions into one or two AArch64 ones.
A few of its features make AArch64 (and, especially, Apple’s slight variation on it) amenable to fast and lightweight x86-64 emulation. The first is having more registers, which allows all x86-64 state to be stored in registers. Second, Apple has an opt-in TSO (total store ordering) model, which makes the memory model the same as x86. (RISC-V has this as an option as well, although I am not aware of an extension that allows dynamically switching between the relaxed memory model and TSO, as Apple’s hardware permits.)
Without this mode, you either need variants of all of your loads and stores that can provide the relevant barriers or you need to insert explicit fences around all of them. The former consumes a huge amount of encoding space (loads and stores make up the largest single consumer of encoding space on AArch64) the latter, many more instructions.
After TSO, flags are the second-most annoying feature of x86 from the perspective of an emulator. Lots of x86 instructions set flags. Virtual PC for Mac (x86 on PowerPC) puts a lot of effort into dynamically avoiding setting flags if nothing has consumed them (for example, if two flag-setting instructions were back to back).
QEMU does something similar, preserving the source operands and the opcode of operations that set flags and computing the flags only when something checks the flags’ value. AArch64 has a similar set of flags to x86, so flag-setting instructions can be translated into one or two instructions. Arm did not get this right (from an emulation perspective) in the first version of the ISA. Both Microsoft and Apple (two companies that ship operating systems that run on Arm and need to run a lot of legacy x86 code) provided feedback, and ARMv8.4-CondM and ARMv8.5-CondM added extra modes and instructions for setting these flags differently. Apple goes further with an extension that sets the two flags present in x86 but not Arm in some unused bits of the flags register, where they can be extracted and moved into other flag bits when needed.
RISC-V made the decision not to have condition codes. These have always been a feature that microarchitects hate—for a few reasons. In the worst case (and, somehow, the worst case is always x86), instructions can set some flags. In the case of x86, this is particularly painful because the carry flag and the interrupts-disabled flag are in the same word (which led to some very entertaining operating-system bugs, because the ABI states that the flags register is not preserved across calls, so calling a function to disable interrupts in the kernel was followed by the compiler helpfully reenabling them to restore the flags).
Anything that updates part of a register is painful because it means allocating a new rename register and then doing the masked update from the old value. Even without that, condition codes mean that a lot of instructions update more than one register.
Arm, even in AArch32 days, made this a lot less painful by having variants of instructions that set flags and not setting them for most operations. RISC-V decided to avoid this and instead folds comparisons into branches and has instructions that set a register to a value (typically one or zero) that can then be used with a compare-and-branch instruction such as branch if [not] equal (which can be used with register zero to mean branch of [not] zero).
Emulating x86-64 quickly on RISC-V is likely to be much harder because of this choice.
Avoiding flags also has some interesting effects on encoding density. Conditional branch on zero is incredibly common in C/C++ code for checking that parameters are not null. On x86-64, this is done as a testq
(three-byte) instruction, followed by a je (jump if the test set the condition flags for equality), which is a two-byte instruction. This incurs all the annoyances of allocating a new rename register for the flags mentioned previously, including the fact the flags register remains live until the next flag-setting instruction exits speculation.
The decision to avoid condition codes also makes adding other predicated operations much more difficult. The Arm conditional select and increment instruction looks strange at first glance, but using it provides more than a 10% speedup on some compression benchmarks. This is a moderately large instruction in AArch64: three registers and a four-bit field indicating the condition to test. This means that it consumes 19 bits in operand space. An equivalent RISC-V instruction would either need an additional source register and variants for the comparisons to perform or take a single-source operand but need a comparison instruction to set that register to zero or non-zero first.
Always Measure
In 2015, I supervised an undergraduate student extending an in-order RISC-V core with a conditional move and extending the LLVM back end to take advantage of it. His conclusion was that, for simple in-order pipelines, the conditional move instruction had a 20% performance increase on several benchmarks, no performance reduction on any of them, and a tiny area overhead. Or, examining the results in the opposite direction, achieving the same performance without a conditional move required around four times as much branch predictor state.
This result, I am told, reflected the analysis that Arm conducted (although didn’t publish) on larger and wider pipelines when designing AArch64. This is, apparently, one of the results that every experienced CPU designer knows but no one bothers to write down.
AArch64 removed almost all the predication but kept a few instructions that had a disproportionately high benefit relative to the microarchitectural complexity. The RISC-V decision to omit conditional move was based largely on a paper by the authors of the Alpha, who regretted adding conditional move because it required an extra read port on their register file. This is because a conditional move must write back either the argument or the original value.
The interesting part of this argument is that it applies to an incredibly narrow set of microarchitectures. Anything that is small enough to not do forwarding does not need to read the old value; it just does not write back a value. Anything that is doing register renaming can fold the conditional move into the register rename logic and get it almost for free. The Alpha happened to be in the narrow gap between the two.
It’s very easy to gain intuition about what makes an ISA fast or slow based on implementations for a particular scale. These can rapidly go wrong (or start out wrong if you are working on a completely different scale or different problem space). New techniques, such as the way that NVIDIA Project Denver and Apple M-series chips can forward outputs from one instruction to another in the same bundle, can have a significant impact on performance and change the impact of different ISA decisions. Does your ISA encourage compilers to generate code the new technique can accelerate?
If you come back to this article in five to 10 years, remember that technology advances. Any suggestions that I have made here may have been rendered untrue by newer techniques. If you have a good idea, measure it on simulations of different microarchitectures and see whether it makes a difference.
Join the Discussion (0)
Become a Member or Sign In to Post a Comment