Pipelining & Performance

In this lecture we will consider the massively important topic of processor performance. We’ll first learn how to quantitatively estimate performance. Afterwards, we will analyze the performance of three architecture styles: single-cycle, multi-cycle, and pipelined CPUs.

Iron Law of Processor Performance

First, let’s define what we mean by processor performance. The performance of a processor is simply the amount of time it takes to execute a program, denoted by $\frac{\mathrm{Time}}{\mathrm{Program}}$ . The Iron Law of Processor Performance breaks this down into three parts:

$\frac{\mathrm{Time}}{\mathrm{Program}} = \frac{\mathrm{Instructions}}{\mathrm{Program}} \times \frac{\mathrm{Cycles}}{\mathrm{Instruction}} \times \frac{\mathrm{Time}}{\mathrm{Cycles}}$

In English, the performance of a processor is the product of:

the number of instructions in the program,
the number of clock cycles it takes to execute a single instruction (a.k.a., cycles per instruction or CPI),
and how long a clock cycle is (a.k.a., the clock period¹).

With the Iron Law of Processor Performance in mind, how can we make a processor that runs programs faster?

We can’t usually change the number of instructions in a program as that is largely determined by the ISA and the compiler. We do have some control over the CPI and the clock period, but there is a trade-off. We can either do more work in a given cycle by decreasing the CPI, but this inevitably makes the clock period longer. Alternatively, we could make the clock period shorter, but this generally means we are doing less each cycle. There is also a third option.

Architecture Styles

Recall our processor schematic depicting the five stages of a CPU: Fetch, Decode, EXecute, Memory, and Writeback. To design a processor, we have to decide how to map these stages for each instruction onto clock cycles.

There are three main architecture styles: single-cycle, multi-cycle, and pipelined.

Single-Cycle Processors

This is the most obvious approach to designing a processor: all the work for a single instruction is done in one cycle. Because there’s a lot of work that needs to be done, the clock period is long. In fact, the clock period must be long enough such that the slowest instruction can complete in a single cycle. As we saw in the last lecture, data transfer instructions take the longest to execute, in particular load instructions².

Let’s analyze the performance of a single-cycle CPU. Since each instruction takes one cycle to execute, the CPI for single-cycle processors is $1$ . This means that we can execute $n$ instructions $n$ (long) cycles.

Multi-Cycle Processors

The key downside to single-cycle processors is that the clock period is tied to the latency³ of the slowest instruction (e.g., load instructions). This means that relatively fast instructions (e.g., instructions that don’t access memory) take the same amount of time as the slowest instruction.

Multi-cycle processors get around this restriction by running just one stage per cycle instead of one instruction per cycle. In this setup, one instruction executes over multiple cycles. To facilitate this, registers must be inserted at the end of each stage to hold control signals and values between cycles⁴.

These registers allow instructions to take a different number of cycles to execute dependent upon which stages they need to run. For example, the ld instruction has work to do in each of the five stages so it will take five cycles to execute. On the other hand, the add instruction can skip the memory stage and so will only take four cycles to run.

Regarding performance, multi-cycle processors are the opposite of single-cycle processors. Multi-cycle processors boast a very short clock period, but a high CPI as now instructions take multiple cycles to execute.

Single-Cycle vs. Multi-Cycle

Let’s now compare the performance of single-cycle and multi-cycle processors by comparing their clock periods and CPIs.

The clock period of a single-cycle processor is equal to the time it takes to run each of the five CPU stages (i.e., the latency of the slowest instruction). In comparison, the clock period of a multi-cycle processor is equal to the time it takes to run the longest CPU stage plus some $\epsilon$ to account for the overhead of accessing the registers between stages.

The CPI of single-cycle processors is always $1$ as each instruction takes one cycle to execute. For multi-cycle processors, the CPI is wholly dependent on what programs are run as different instructions take a different number of cycles to run. Since each program is different, we often use the average CPI to estimate the performance of multi-cycle CPUs.

For example, suppose that we have a program that consists of 20% branch instructions, 20% load instructions, and 60% ALU instructions. On a multi-cycle processor, branch instructions take 3 cycles, load instructions take five cycles, and ALU instructions take four cycles. The average CPI of a multi-cycle processor given this workload would be

$0.2 \times 3 + 0.2 \times 5 + 0.6 \times 4 = 4$

Pipelined Processors

For most workloads, multi-cycle processors are faster than single-cycle processors. But can we do better?

If you build a multi-cycle processor, you quickly notice that much of your circuit remains idle most of the time. For example, the part of the processor for the Fetch stage is only active every ~5th cycle. We can exploit that idle time using pipelining.

The general idea behind pipelining is to overlap the executions of different tasks. In fact, you all likely use pipelining when you do laundry. There are three “stages” to doing laundry: washing, drying, and folding. Let’s assume that it takes 20 minutes for the washing machine to run, 30 minutes for the dryer to run, and 10 for you to fold the dry clothes. A single load of laundry then takes 60 minutes as we first wash the clothes for 20 minutes, move the wet clothes to the dryer to dry for 30 minutes, and lastly spend 10 minutes folding the clothes once the dryer finishes.

Suppose you’re backed up and need to do multiple loads of laundry. You start the same by putting the first load of laundry into the washer. After 20 minutes, you move the wet clothes into the dryer as before. However, at this point you probably put the second load of laundry in the washing machine so that the washing machine and the dryer are running at the same time. It would be inefficient if you waited until after you folded the first load of laundry to start the next load of laundry.

Pipelined processors do very nearly the same thing! While we Decode one instruction, we can simultaneously Fetch the next instruction. Then in the next cycle, we can eXecute the instruction we just decoded, Decode the instruction we just Fetched, all while Fetching the next instruction.

We can build pipelined processors in a similar way to multi-cycle ones. Like multi-cycle processors, pipelined processors break the datapath into multiple cycles where each stage completes in one cycle. We also need to add pipeline registers between the stages.

Pipelining is such a useful idea that the vast majority of real processors use it. Real processors actually tend to break instruction processing into many more than 5 stages. It’s difficult to find public information about the specifics, but, as one data point, this reliable source claims that an oldish Intel processor had somewhere between 14 and 19 stages.

Performance of Pipelined Processors

Now let’s consider the performance of a pipelined processor.

Suppose that all of the instructions overlap perfectly in a 5-stage pipeline. In this scenario, the first instruction finishes after the 5th cycle. The second instruction then finishes after the 6th cycle. The third instruction finishes after the 7th cycle and, so on. So, on average, an instruction finishes executing every cycle resulting in a CPI of 1! More precisely, it takes only $4 + n$ cycles to execute $n$ instructions.

The clock period of pipelined processors can be nearly as short as a multi-cycle processor too! Again, this is because the clock period needs to be long enough such that the slowest stage can execute plus some additional time to account for the overhead of accessing the pipeline registers.

The table below compares the clock period and the CPI of single-cycle, multi-cycle, and pipelined processors.

Metric	Single-Cycle	Multi-Cycle	Pipelined
Clock Period	$\mathbf{F}+\mathbf{D}+\mathbf{X}+\mathbf{M}+\mathbf{W}$	$\mathrm{max}(\mathbf{F},\mathbf{D},\mathbf{X},\mathbf{M},\mathbf{W})+\epsilon_M$	$\mathrm{max}(\mathbf{F},\mathbf{D},\mathbf{X},\mathbf{M},\mathbf{W})+\epsilon_P$
Cycles Per Instruction (CPI)	1	It depends!	1

As you can see, pipelined processors are the best of both worlds! They have the clock period of multi-cycle processors with the CPI of single-cycle ones!

Single-Cycle vs. Multi-Cycle vs. Pipelined

To drive home the point, let’s see a concrete example!

Suppose that you stumble upon a mysterious program alongside a README containing the following table:

Instruction Type	Stages	Percentage of Program
Branches	F,D,X	20%
Memory	F,D,X,M,W	20%
Arithmetic & Logical	F,D,X,W	60%

Something compels you to estimate the performance ( $\frac{\mathrm{Time}}{\mathrm{Instruction}}$ ) of this mystery program. Luckily, you’re fortunate to have single-cycle, multi-cycle, and pipelined versions of the same base processor with the following stage latencies:

Stage	Latency (ns)
Fetch	170 ns
Decode	180 ns
EXecute	200 ns
Memory	200 ns
Writeback	150 ns

In the multi-cycle and pipelined versions, let the overhead of the registers between the stages be 5 nanoseconds ( $\epsilon_M = \epsilon_P = 5~\mathrm{ns}$ ). We now have everything we need to estimate the performance of our mystery program on each architecture style!

Metric	Single-Cycle	Multi-Cycle	Pipelined
Clock Period	900 ns	205 ns	205 ns
Cycles Per Instruction (CPI)	1	4	1
Performance ( $\frac{\mathrm{Time}}{\mathrm{Instruction}}$ )	900 ns	820 ns	205 ns

Notice how the pipelined processor is 4X faster than the multi-cycle processor and ~4.39X faster than the single-cycle processor! Wow!!

Latency vs. Throughput

It is important to note that pipelined processors don’t execute any one instruction faster than a multi-cycle processor. Actually, the instruction latency of pipelined processors is generally worse than multi-cycle processors. What makes pipelined processors fast is their high throughput by executing multiple instructions in parallel.

Hazards

This is the part of the lecture where I have to come clean and admit that I lied to you. Unfortunately, pipelining isn’t that straight-forward.

To see why, suppose that our program contained the following two RISC-V assembly instructions:


j EXIT
addi x10, x11, 1

After j EXIT is done, the next instruction that should be run is not addi x10, x11, 1, rather it should be whatever instruction is after the EXIT label. But pipelined processors will have just finished running the Memory stage of the addi instruction! Now all the work that has been done needs to be thrown away and we need to start again by Fetching the instruction at EXIT.

This is just one of the many ways where pipelining can go wrong, appropriately named hazards! However, they are out of scope for this class. If you’re interested, see sections 4.8–4.9 in [P&H].

The clock period is the inverse of the clock frequency or clock speed. That is, the clock period is how long a single clock cycle takes whereas the clock frequency is how many cycles can be run during a fixed unit of time. Clock frequency is often used as a measure of how fast a CPU is, usually in GHz.

Load instructions take the longest as the processor needs to do work in every stage to execute a load instruction. On the other hand, the processor doesn’t need to do any work in the writeback stage for store instructions which shaves off a couple nanoseconds.

The latency of an instruction is the time it takes to execute an instruction.

⁴

What would go wrong if we omitted the registers at the end of each stage? Why don’t we need a register at the end of the writeback stage?

CS 3410