CPU Architecture

The 5 Classic CPU Stages

If we return to our processor schematic, we can break down all the things that a CPU needs to do for every instruction. These responsibilities are called stages:

Fetch the instruction from the instruction memory.
Decode the instruction bits, producing control signals to orchestrate the rest of the processor. Read the operand values from the register file. For example, this stage needs to convert from a binary encoding of each register index into a “one-hot” signal to read from the appropriate register.
EXecute the actual computation for the instruction, using the arithmetic logic unit (ALU): add the numbers, shift the values, whatever the instruction requires.
Access Memory, reading or writing an address in the external data memory. Only some instructions need this stage—just loads and stores.
Write results back into the register file. The result could come from the ALU or from memory, if it’s a load instruction.

As the bolding in this list implies, computer architects often abbreviate these stages with a single letter: F, D, X, M, or W.

Architecture Styles

To design a processor, we have to decide how to map these stages for each instruction onto clock cycles.

One basic trade-off at work here is that we could do more work in a given cycle, but this inevitably makes the clock period longer. There is no “free lunch” available by breaking the work up across multiple cycles. However, doing this does open up an opportunity for a secondary optimization.

There are three main architecture styles:

Single-cycle processor. This is the most obvious approach: do all the work for a single instruction in one cycle. Because it’s a lot of work, the clock period is long, but you can execute $n$ instructions in $n$ cycles.
Multi-cycle processor. Do just one stage per cycle. If we use the 5 stages above, now every instruction takes 5 cycles to execute—but those cycles can be much shorter than for a single-cycle processor. A multi-cycle processor requires adding registers to hold signals between cycles.
Pipelined processor. If you build a multi-cycle processor, you quickly notice that much of your circuit remains idle most of the time. For example, the part of the processor for the Fetch stage is only active every 5th cycle. We can exploit that idle time!

The idea is to overlap the executions of different instructions. While we Decode one instruction, we can simultaneously Fetch the next instruction. If everything overlaps perfectly in a 5-stage pipeline, it takes only $4 + n$ cycles to execute $n$ instructions. And the clock period can be nearly as fast as for a multi-cycle processor.

Pipelining is such a useful idea that the vast majority of real processors use it. Real processors actually tend to break instruction processing into many more than 5 stages. It’s difficult to find public information about the specifics, but, as one data point, this reliable source claims that an oldish Intel processor had somewhere between 14 and 19 stages.

CS 3410

CPU Architecture

The 5 Classic CPU Stages

Architecture Styles