The Mips R4000

This superpipelined RISC processor will be the spearhead of the ACE consortium

Brett Glass

On October 1, Mips Computer Systems formally announced the R4000--the newest member of the Mips family of RISC processors. Like Sun Microsystems' SPARC (see ``SPARC Revealed,'' April BYTE) and IBM's RISC System/6000, this architecture is a strong contender for the hearts, minds, and checkbooks of workstation designers and users. It's also the core of the Advanced Computing Environment (ACE), a workstation standard currently under development by a consortium that includes Compaq, DEC, and Microsoft.
What does the R4000 have to offer, and why did the ACE vendors choose it--before it existed in silicon--over other established (and proven) architectures? This month, I'll take a look at some of the novel features that make the R4000 a competitive RISC for the 1990s.

Extending the Line

The R4000's architecture wasn't created ex nihilo. Rather, it's an extension of earlier Mips designs--the 32-bit R2000, R3000, and R6000--all of which have solid track records. The most important difference, however, is that the R4000 increases the widths of the internal and external data paths--as well as the widths of the addresses, registers, and ALUs--to 64 bits.

Although this expansion makes the chip larger, it has several advantages. First, it allows a bigger address space--large enough to let an operating system map more than a terabyte of files directly into the memory space for easy access. In contrast, a 32-bit address can map ``only'' 4 gigabytes. One-gigabyte drives are now common and relatively inexpensive; 32-bit micros lack sufficient address space to use more than four such drives for virtual memory. (For more on the advantages of 64-bit addressing, see ``64-bit Computing,'' September BYTE.)
With 64-bit data paths, the R4000 can process certain types of data--such as single-precision IEEE floating-point numbers and strings up to eight characters long--in single gulps. Certain important algorithms, such as the Data Encryption Standard, benefit greatly from the enlarged word size. Moreover, the larger word size helps reduce the impact of a minor weakness in the Mips architectures: Because the ALU cannot propagate a carry, multiple-precision additions, subtractions, and shifts take several more instructions than they do on some other RISCs.

Downward Compatibility

Users with 32-bit Mips software won't be disappointed when the new chips come out. Mips has wisely protected customers' investments in previous generations of Mips software by making the R4000 fully downward-compatible with 32-bit software. All the original 32-bit instructions function exactly as they did before, and the same machine can run 32- and 64-bit software simultaneously with no glitches. This is done in an especially clever way: Operations performed in the R4000's 32-bit mode actually do work on all 64 bits of each register, but in a downward-compatible fashion. Also, software written for ``big-endian'' Mips chips (which store the most significant byte of a variable in the lowest address) and ``little-endian'' Mips chips (which store the least significant byte in the lowest address) can run concurrently on the R4000, as long as programs that have opposite orientations do not try to communicate through shared memory.

Like the R3000, the R4000 has a bank of 32 integer registers (see figure 1), all but two of which are completely general-purpose. As on many RISC chips, register 0--written $0--is hard wired to 0 and discards what's written to it. The last of the registers, $31, receives the return address during a subroutine call and thus must be reserved for that purpose. Also, there is a program-counter register and two special arithmetic registers called hi and lo. These last two hold the 128-bit result after a multiplication and the quotient and remainder after a division.
All R4000 instructions are 32 bits wide, so they can be fetched from memory two at a time. In typical RISC fashion, all the instructions are laid out in one of three formats, as shown in figure 2: I-type, for instructions that contain immediate constants; J-type, for jumps to absolute addresses; and R-type, for instructions that work on values found in registers.

Load/Store Architecture

The R4000, like most RISCs, is a load/ store architecture. All load and store instructions are I-type, and there's only one addressing mode: base register plus 16-bit offset. This addressing mode adds the contents of a register to the 16-bit immediate field to produce the memory address for the load or store.
To load a constant into a register, the programmer (or compiler) can use I-type instructions such as ADDI (for ADD Immediate) with $0 as the source register. The LUI (for Load Upper Immediate) instruction, inherited from earlier processors, loads bits 16 to 32 of a register, so any 32-bit constant can be produced in two instructions without the need to do a load from memory. There's no analogous instruction to load the upper 32 bits of a 64-bit register, though, because instructions are still only 32 bits wide. Therefore, loading a number of 64-bit constants is best done via a memory access or a longer sequence of immediate and arithmetic instructions.

Most complex instruction-set computers handle ``misaligned'' loads and stores by performing extra memory accesses invisibly, while RISCs often make them illegal. The R4000, however, has explicit instructions that perform misaligned loads and stores in, at most, two instructions. The R4000 does integer computations using R-type and I-type instructions--depending on whether the operands are two registers or one register and one immediate value. The available operations are the same as in most microprocessors--however, the R4000 implements a NOR instruction, which isn't found in many architectures.

Transfers of Control

J-type instructions execute an unconditional jump or subroutine call, but not in the way you might expect. The 26-bit target field is shifted 2 bits to the left (instructions must be aligned on 32-bit boundaries) and is combined with the high-order bits of the current program counter. This means that a J-type jump instruction must always lead to an instruction in the same 256-MB block of memory. Jumps that use the R-type instructions get their target addresses from 32-bit or 64-bit registers and thus are less restrictive. Jumps on the R4000 execute after a one-instruction delay; in other words, the instruction immediately following the jump instruction is executed before control is transferred to the target address.

Branches (i.e., conditional jumps and calls) use R-type instructions. Because the R4000 does not have condition-code flags, each branch instruction compares a register either to 0 or to another register and bases its decision about whether to jump on the results of the comparison. One form of a branch, also supported by the R3000, is a delayed branch, in which the instruction immediately following the branch instruction (said to occupy the delay slot) is always executed whether or not the branch is taken.

Programmers (and compilers) can use one of three tactics to fill the delay slot. One way is to move an instruction forward from the block of code before the branch (since the delay slot is logically part of that block). Another is to insert a NOP, effectively wasting the delay slot. The third is to move the instruction at the target of the branch. However, in this case, the moved instruction must not cause any untoward effects if the branch is not taken--a constraint that often precludes this tactic.

To make it easier to shift an instruction from the target of a branch into the delay slot, the R4000 adds a feature found in SPARC but not in earlier Mips processors: annulling branches. These instructions annul, or cancel, the effects of the instruction in the delay slot if a branch is not taken, so it's always possible to shift an instruction from the target to the delay slot. The names of annulling branch instructions have the word likely at the end (e.g., ``Branch on Less Than Zero and Link Likely'') to indicate that they offer the greatest benefit when a branch is likely to be taken.

Subroutine Calls:

Jumping and Linking

The R4000 has six instructions--two unconditional and four conditional--that can call subroutines. When executed, these instructions leave a link, or return address, in $31. If the called routine wants to call yet another subroutine, it is responsible for saving the old return address either in another register or on a stack. There are no explicit stack-oriented instructions on the R4000, however, so the logistics of maintaining a stack pointer can be handled by the compiler designer or the operating system in any way either sees fit.

To see how a typical language might create and use a stack, I took a peek at the Unix System V release 4.0 ABI (for Application Binary Interface). This document designates $29 as a stack pointer and specifies a stack-frame structure similar to those of C implementations on other processors. Updating the stack pointer takes slightly more work on the R4000 than on a processor with explicit stack instructions like PUSH and POP. However, the R4000 has the advantage of being completely general. Languages like Forth (which uses two stacks) and Prolog (which can use several) don't need to force themselves to behave like other languages to work on this CPU.

Accessories Included

Mips was the first RISC manufacturer to place a memory management unit on the same chip as the central processor, and the R4000 continues this trend with a still-higher degree of integration. It contains an MMU, two 8-KB caches (one for instructions and one for data), and an FPU built right in.

The R4000 is designed to allow expansion of the internal caches: They can be enlarged to 16 KB on a 0.8-micron implementation and 32 KB on a 0.6-micron design. A controller for a secondary cache (as well as for the two primary caches) is also implemented on-chip. All these devices are managed as coprocessors, which are given numbers from Coprocessor 0 (CP0) to CP3.

The CPU accesses CP0 to control memory management and caching; CP1 is the FPU. The FPU has 32 floating-point registers--twice as many as on older Mips FPUs--and there's a control bit that can be used to mask the new registers from software that doesn't expect them to be present. The FPU is fast, but its instruction set is unremarkable; it's very similar to what you see on other FPUs.

The R4000 was designed to fit into a wide range of systems, from desktops to supermicrocomputers. The cache controllers on the R4000 were designed to support every coherency scheme the designers could imagine--including several that work well in multiprocessing systems. The external bus that communicates with the secondary cache (or with main memory) can be configured to stagger its accesses--a doubleword every three cycles, two every four cycles, four every five cycles, and so on--so that it can match the speed of virtually any memory subsystem. This makes it possible to design systems in which the user can plug in a new CPU and speed up the processor clock without worrying about overtaxing the main memory. The external buses can provide bytewise parity checking or error-correction codes to ensure data integrity, and they offer a peak throughput of 400 MBps.

The R4000's external buses also contain a unique hardware feature that guarantees accurate timing in a wide range of silicon and board designs. When laying out an R4000 motherboard, a designer adds a trace that leaves the R4000, loops around the motherboard, and returns to another nearby pin--traveling the same distance as the longest connection between the R4000 and the bus control logic. The R4000 uses an internal, high-speed phase-locked loop to measure the length of the loop and adjusts the slew rate of its outputs (i.e., the rate at which they turn on and off) so that the signals arriving at the bus controller have exactly the right timing relationship.

Superscalar vs. Superpipeline

Perhaps the most controversial aspect of the R4000 is the structure of its internal pipeline. Most RISC designers are moving toward superscalar designs, but Mips eschewed this approach and made the R4000 a superpipelined CPU instead.

Most current RISC processors can achieve execution speeds that approach one instruction per system clock cycle, but the race is on to do better. So far, two classes of CPUs have evolved to offer multiple instructions per clock cycle: superscalar and superpipelined architectures.
To understand the relationship between the two approaches, imagine that you own a car wash business and want to increase the number of cars you can wash per hour. You know that each car passes through a series of stages: soapy water, various scrub brushes, a rinse, and a dryer--followed by stations where employees perform manual tasks such as polishing the windows and vacuuming the carpets. Finally, a foreman inspects the car before it's delivered to the customer.

One way to speed things up would be to install a second car wash next to the first one, so that two cars could be processed at one time. This is how a superscalar processor works: The chip designer literally builds a second processor pipeline beside the first. This works well as long as there are no dependencies between what goes on in the two lines. However, if there is a dependency--for instance, if there's only one foreman and he has to run back and forth between the two lines--it's possible for things to bog down due to contention for the shared resource. Similarly, if instructions in two microprocessor pipelines compete for resources--an ALU, a bus interface, or control of a register--a superscalar processor may not be able to deliver optimal performance.

Another way to speed up your car wash would be to increase the rate at which cars enter and leave the car wash. After watching the car wash in operation, you realize that the maximum speed at which cars can advance depends on the time it takes to finish the task that's done at the slowest station--for example, the one at which an employee polishes the front and rear windows. You realize that if you split this job between two stations--one for the front window and another for the rear window--neither one would be the limiting factor any longer, and the line could advance as quickly as the next fastest job could be done. Repeating the procedure, you subdivide the next slowest job, and then the next, until it does not make sense to divide the work any more. This is how superpipelining works: The CPU's internal pipeline is broken up into small, fast stages so that instructions can advance through it more quickly. However, as with the car wash, you can only divide the stages so much before it becomes unfeasible to divide it any more. At that point, it would be time to consider adding a second line--that is, going superscalar.
The R4000's eight-stage pipeline evolved from the five-stage pipeline of the R3000 (see figure 3). The larger number of steps means that the R4000 can process eight instructions simultaneously. Also, because the pipeline advances at double the system clock rate, two instructions can be issued on every clock cycle.

Mips's architects admit that the superpipelined approach will take them only so far in their efforts to produce a top-performing CPU. Mips believes, however, that superpipelining yields a more uniform increase in overall performance (superscalar CPUs, they say, speed up floating-point operations more than integer operations) and requires less silicon for comparable throughput. Time will tell if they're right. Since most of Mips's competitors are taking the superscalar approach, it won't be long before users will be able to evaluate the trade-offs with real-world benchmarks.

A ``Portable'' Design

Although Mips makes and sells computer systems, it doesn't make chips. Instead, it operates as a ``fabless'' design center and licenses other companies to create the silicon. Thus, it's important to Mips that the design for the R4000 be portable (i.e., manufacturable by a large number of vendors). The companies that actually make Mips chips--including IDT, PSC, LSI Logic, Siemens-Nixdorf, and NEC--all receive masks from Mips as well as copies of the CAD database that was used to produce them.

To make certain that mainstream computer vendors have a choice, each chip maker is contractually obligated to do a plug-compatible ``generic'' version of the chip but may then do others as well. IDT, for instance, produces versions of the R3000 with reduced pin-outs for embedded applications, and LSI Logic produces single-chip microcomputers with integrated peripherals connected to a Mips core.

ACE in the Hole?

On April 9--long before Mips officially announced the R4000--21 companies held a press conference to announce the formation of the ACE consortium. Its purpose was to create a ``nonproprietary, standards-based computing environment that includes two powerful operating systems and two open computer hardware platforms.'' The two hardware platforms were to include PC-compatible systems and a new architecture called Advanced RISC Computing, or ARC, based on the Mips R4000, while the two operating systems were to be Microsoft's NT (for New Technology) environment and SCO Unix. Microsoft also owns a substantial share of The Santa Cruz Operation, so it would have a lock on the operating-system market for the new platform.

A political reason that ACE chose the R4000 over competing RISC architectures was that the other members of the consortium--including Microsoft, DEC, and Compaq--saw Sun as a fierce competitor. They also felt that Sun's close ties to SunSoft (makers of Solaris--formerly SunOS), its control of SPARC distribution channels, and its early access to new and better chips gave it too much of an edge in the SPARC marketplace.

Other reasons were technological: The members of the consortium accepted Mips's arguments that the lean, mean R4000--with its superpipelined architecture--would be able to successfully compete with the first generation of superscalar SPARCs. An important logistical concern was that the R4000's capability to operate as a ``little-endian'' chip set, like the Intel 80xxx series, allowed greater portability of software and data from the PC to the new platform.

At this writing, few technical details of the proposed ARC platform have been announced. Given its technical merits, the R4000 is likely to succeed as a processor in its own right. If the ACE consortium realizes its goals, it will guarantee the R4000 a place on many desktops for the foreseeable future.

ACKNOWLEDGMENT

Many thanks to John Mashey, Andy Keane, Carleen LeVasseur, and Joanne Hasegawa for their help in preparing this article.

Figure 1: Only two of the R4000's registers have special functions: Register 0, following RISC convention, is a bit bucket, and register 31 gets the return address during subroutine calls. The multiply and divide registers store the 64- or 128-bit results of integer multiplication operations or the quotient and remainder of integer division operations.

Figure 2: Every instruction is a 32-bit word, but there are three distinct formats: I-type for immediate operations, J-type for jumps, and R-type for register-to-register operations.

Figure 3: The R4000's eight-stage pipeline advances twice per clock cycle; the R3000's five-stage pipeline advances once per cycle. A superpipelined processor requires less silicon and less logic design effort than a superscalar processor. But since superpipelining depends on fast logic, circuit design is tougher. And since the pipeline can be split into only so many stages, the approach eventually runs out of steam--at which point the designers must contemplate a superscalar design.

Brett Glass is a freelance programmer, author, and hardware designer residing in Palo Alto, California. You can contact him on BIX as ``glass.'