

# **Pipelining & Performance**

CS 3410: Computer System Organization and Programming

Spring 2025







[K. Bala, A. Bracy, E. Sirer, Z. Susag, and H. Weatherspoon]

#### Today's Goals



How to quantitatively estimate performance Analyze performance / behavior with diagrams



How to design processors with **better performance** 

Single-cycle CPU Multi-cycle CPU Pipelined CPU





### Single-Cycle RISC-V Datapath



Clock frequency must be slow enough for the very **slowest** instruction to complete in **1 cycle** 



#### PollEverywhere



#### Which instruction, on average, will take the longest?





#### **Ever been to Chipotle?**





#### That's more like it!





#### Iron Law of Processor Performance

How do we make **a processor** that runs **programs faster**?



**TODAY**: tradeoff between CPI and clock period!



#### First Step: Shorten Clock Period





# First Step: Multi-Cycle RISC-V Datapath



- Break datapath into **multiple cycles** (here 5)
- Add **registers** to store results at the end of each cycle
- Fetch, decode, and execute **1** instruction over multiple cycles
- Allows instructions to take *different* numbers of cycles
- Opposite of single-cycle: short clock period, high CPI



# First Step: Multi-Cycle RISC-V Datapath



- Break datapath into **multiple cycles** (here 5)
- Add **registers** to store results at the end of each cycle
- Fetch, decode, and execute 1 instruction over multiple cycles
- Allows instructions to take *different* numbers of cycles
- Opposite of single-cycle: short clock period, high CPI







| Metric | Single Cycle | Multi Cycle |
|--------|--------------|-------------|
|--------|--------------|-------------|



| Metric                         | Single Cycle       | Multi Cycle              |
|--------------------------------|--------------------|--------------------------|
| Clock Period<br>(time / cycle) | F + D + X + M + WB | MAX (F, D, X, M, WB) + ε |

ε is the overhead of accessing stage registers



| Metric                          | Single Cycle       | Multi Cycle                |
|---------------------------------|--------------------|----------------------------|
| Clock Period<br>(time / cycle)  | F + D + X + M + WB | MAX (F, D, X, M, WB) + ε   |
| Cycles Per Instruction<br>(CPI) | 1                  | <b>??</b><br>(It depends!) |

#### **Use Average CPI –** Depends on what programs (workloads) you run!

- E.g.: Branch: 20% (3 cycles), Load: 20% (5 cycles), ALU: 60% (4 cycles)
  - $CPI = 0.2 \times 3 + 0.2 \times 5 + 0.6 \times 4 = 4$
- *Caveat:* calculation ignores many effects
  - Back-of-the-receipt arguments only (i.e., it's a rough estimate)



| Metric                              | Single Cycle                   | Multi Cycle              |
|-------------------------------------|--------------------------------|--------------------------|
| Clock Period<br>(time / cycle)      | F + D + X + M + WB             | MAX (F, D, X, M, WB) + ε |
| Cycles Per Instruction<br>(CPI)     | 1                              | (It depends!)            |
| Performance<br>(time / instruction) | Multiply down to see who wins! |                          |



| Metric                              | Single Cycle | Multi Cycle |
|-------------------------------------|--------------|-------------|
| Clock Period<br>(time / cycle)      | 900 ns       | 205 ns      |
| Cycles Per Instruction<br>(CPI)     | 1            | 4           |
| Performance<br>(time / instruction) | 900 ns       | 820 ns      |

- Some concrete numbers:
  - **Stage latency: F** = 170ns, **D** = 180ns, **X** = 200ns, **M** = 200ns, **W** = 150ns, Register = 5ns
  - Branch: 20% (3 cycles), Load: 20% (5 cycles), ALU: 60% (4 cycles)



A multi-cycle CPU is \*always\* faster than a single-cycle CPU

Multi-cycle CPUs are \*alwaysa\* more efficient than single-cycle CPUs (i.e., less ti...

Adding more stages \*always\* makes the clock period shorter

Multi-cycle CPUs have more complex \*control logic\* than single-cycle CPUs

B & D

#### Which of the following statements is true?

**Ø**0



#### Which of the following statements is true?

**Ø**0



#### Is multi-cycle better?

#### "When you see a good move, look for a better one."



-Emanuel Lasker





















#### Only **one stage** of the CPU is active per cycle!



# Pipelining

An implementation technique in which multiple instructions are **overlapped** in execution.



# **Pipelining Example: Laundry**

- Doing 1 load of laundry requires the sequence:
  - Wash



• Dry



• Fold





#### Laundry Example



































## Multi-Cycle $\rightarrow$ Pipelined



Each instruction takes *n* short cycles based on the work that needs to be done

**Pipelined** Each instruction takes *n* **short cycles** no matter what, but runs **multiple** instructions **in parallel** 





## **Principles of Pipelining**

Break datapath into **multiple cycles** (5 for our RISC-V example)

- Parallel execution increases throughput
- Balanced pipeline very important
  - Slowest stage determines clock rate
  - Imbalance kills performance

#### Add **pipeline registers (flip-flops)** for isolation

- Stage *begins* by reading values *from* previous register
- Stage *ends* by writing values *to* next register

Number of tasks completed in a fixed period of time



#### **Pipeline Stages**

| Stage               | Functionality | Values of Interest<br>(to be latched) |
|---------------------|---------------|---------------------------------------|
|                     |               |                                       |
|                     |               |                                       |
|                     |               |                                       |
|                     |               |                                       |
|                     |               |                                       |
| Cornell Rowers Cils |               |                                       |

Consider a non-pipelined processor with a clock period p (e.g., 50ns). If you divide the processor into n stages (e.g., 5), your new clock period would be:



**@ 0** 

Consider a non-pipelined processor with a clock period p (e.g., 50ns). If you divide the processor into n stages (e.g., 5), your new clock period would be:



Consider a non-pipelined processor with a clock period p (e.g., 50ns). If you divide the processor into n stages (e.g., 5), your new clock period would be:



# RISC-V is *Designed* for Pipelining

Instructions same length (32 bits)

- easy to fetch
- easy to decode

Few instruction formats

- Easy to decode
  - Easy to route bits between stages
- Can read a register source before even knowing what the instruction is!

Memory accessed through lw and sw only

• Access memory after ALU

|                 | Cornell Bowers C <sup>·</sup> IS |
|-----------------|----------------------------------|
| TO MOED A.P. 19 | Computer Science                 |

| funct | rs2 | rs1 | funct3 | rd  | ор |
|-------|-----|-----|--------|-----|----|
| im    | m   | rs1 | funct3 | rd  | ор |
| imm   | rs2 | rs1 | funct3 | imm | ор |
|       | i   | mm  |        | rd  | ор |

#### **RISC-V** Pipelining in Action!

addx3, x1, x3andx6, x4, x5lwx4, 20(x7)subx5, x2, x5swx7, 12(x3)



# Pipelining in Action (1)



add x3,x1,x3



# Pipelining in Action (2)



and x6, x4, x5 add x3, x1, x3



# Pipelining in Action (3)





# Pipelining in Action (4)



sub x5, x2, x5 lw x4, 20(x7) and x6, x4, x5 add x3,x1,x3



### Pipelining in Action (5)



sw x7, 12(x3) sub x5, x2, x5 lw x4, 20(x7) and x6, x4, x5 add x3,x1,x3



#### Interface vs. Implementation

**Pipelining** is a powerful technique to mask latencies and increase **throughput** 

- Logically, instructions execute one at a time
- Physically, instructions execute in parallel

#### Abstraction promotes decoupling

• Interface (ISA) vs. implementation (Pipeline)

|                         | 1. | add    | x3, | x1, |
|-------------------------|----|--------|-----|-----|
| Compiler                |    | x3     |     |     |
| Thinks About            | 2. | and    | х6, | x4, |
| THINKS ADOUL.           |    | x5     |     |     |
|                         | 3. |        | x4, |     |
|                         |    | 20(x7) | _   | •   |
|                         | 4. | SUD    | x5, | x2, |
|                         |    | X5     | -   |     |
| Cornell Bowers CIS      | 5. | SW     | х7, |     |
| Signal Computer Science |    | 12(x3) |     |     |

Architect Builds:



You can fetch and decode the same instruction at the same time.

You can fetch two instructions at the same time.

You can fetch one instruction while decoding another.

Instructions only need to visit the pipeline stages that they require.

C & D

#### Pipelining is great because...

| $\checkmark$ | 0 |
|--------------|---|
|--------------|---|

| You can fetch and decode the same instruction at the same time.        |    |
|------------------------------------------------------------------------|----|
|                                                                        | 0% |
| You can fetch two instructions at the same time                        |    |
|                                                                        | 0% |
|                                                                        |    |
| You can fetch one instruction while decoding another.                  | 0% |
|                                                                        |    |
| Instructions only need to visit the pipeline stages that they require. |    |
|                                                                        | 0% |
| C & D                                                                  |    |
|                                                                        | 0% |

#### Pipelining is great because...

| $\checkmark$ | 0 |
|--------------|---|
|--------------|---|

| You can fetch and decode the same instruction at the same time.        |    |
|------------------------------------------------------------------------|----|
|                                                                        | 0% |
| You can fetch two instructions at the same time                        |    |
|                                                                        | 0% |
|                                                                        |    |
| You can fetch one instruction while decoding another.                  | 0% |
|                                                                        |    |
| Instructions only need to visit the pipeline stages that they require. |    |
|                                                                        | 0% |
| C & D                                                                  |    |
|                                                                        | 0% |

#### **CPU Performance**

| Metric                              | Single Cycle                   | Multi Cycle                | Pipelined                  |
|-------------------------------------|--------------------------------|----------------------------|----------------------------|
| Clock Period<br>(time / cycle)      | F + D + X + M + W              | MAX (F, D, X, M, W)<br>+ ε | MAX (F, D, X, M, W)<br>+ ε |
| Cycles Per<br>Instruction<br>(CPI)  | 1                              | (It depends!)              | 1                          |
| Performance<br>(time / instruction) | Multiply down to see who wins! |                            |                            |

Pipelining is the best of both worlds!!



#### **Pipeline Diagrams**





# **Pipeline Diagrams**

What two instruction sequence would not function correctly given this pipelined processor?





# **Pipeline Diagrams**

D

⊢

┢

2

Μ

D

3

W

Μ

Х

4

5

W

Μ

6

W

7

What two instruction sequence would not function correctly given this pipelined processor?

| lui  | x1, | imm |   |
|------|-----|-----|---|
| addi | x2, | x1, | 3 |

j LABEL
<any instruction>



0

#### Today's Goals



How to quantitatively estimate performance Analyze performance / behavior with diagrams



How to design processors with **better performance** 

Single-cycle CPU Multi-cycle CPU Pipelined CPU



