# Tiered Memory Management: Access Latency is the Key!



Midhul Vuppalapati





**Rachit Agarwal** 

Cornell University.

## **Classical memory architecture in servers**



Modern applications demand larger memory capacity and bandwidth In-memory caches, graph processing engines, ML frameworks, ......

37% of Meta's server costs 50% of Microsoft Azure server costs

## Memory contributes to large fractions of datacenter cost

## **Classical memory architecture: DRAM connected via DDR memory interconnect**

## **Classical memory architecture has reached scaling limits**

Memory interconnect is increasingly oversubscribed



Processor pin and signaling limitations

## **Processor core counts, concurrency-per-core are increasing**

## DDR memory interconnect bandwidth is difficult to scale

## **Emergence of tiered memory architectures**

New memory tiers via alternate interconnects



## **Example alternate interconnect:** Compute Express Link (CXL)

Transparent, cache-coherent access to memory (via standard load/store)

## Memory tiers have different performance characteristics

Example: CXL-attached memory (compared to DDR-attached memory) - Upto 1.04x additional bandwidth - 2x higher access latency

## **Emergence of tiered memory architectures**

New memory tiers via alternate interconnects



[MemStrata, OSDI'24] [Pond, ASPLOS'23] [Demystifying CXL, MICRO'23] Upto 1.61x Upto 1.85x Upto 2x

## **Example alternate interconnect:** Compute Express Link (CXL)

Transparent, cache-coherent access to memory (via standard load/store)

## Memory tiers have different performance characteristics

Example: CXL-attached memory (compared to DDR-attached memory) - Upto 1.04x additional bandwidth - 2x higher access latency

## Data placement across tiers critically impacts applications performance

## Software-based tiered memory management

Goal: Transparently adapt page placement across tiers to maximize application performance







Manual sweep of different possible page placements

Identifies and places all hot pages in default tier

## **Application throughput**



## Implicit assumption: default tier access latency < alternate tier access latency

Despite default tier serving the hottest pages







cement of different possible page placement

**5P'21]** places all hot pages in default tier

## Application throughp



mption: default tier access latency < alternate tier access late It tier serving the hottest pages







cement of different possible page placements

**P'21]** places all hot pages in default tier



mption: default tier access latency < alternate tier access late t tier serving the hottest pages

**Access Latency** 



## **Access Latency is the key!**



## Packing hottest pages in default tier is no longer optimal

## **Colloid overview**



**Key principle: Principle of balancing access latencies** Adapt page placement to balance (loaded) access latencies of tiers



Access latency with nanosecond precision Low-overhead mechanism to measure per-tier access latency



Page placement algorithm Decide which set of pages to place in each tier



Integration with existing systems

Leverages existing memory management innovations



Evaluation Understand effectiveness over wide range of workloads

## **Colloid overview**



**Key principle: Principle of balancing access latencies** Adapt page placement to balance (loaded) access latencies of tiers



**Access latency with nanosecond precision** Low-overhead mechanism to measure per-tier access latency



Page placement algorithm Decide which set of pages to place in each tier



Integration with existing systems Leverages existing memory management innovations



Evaluation Understand effectiveness over wide range of workloads

## **Principle of balancing access latencies**

Adapt page placement to balance (loaded) access latencies of tiers



## **Principle of balancing access latencies**

Adapt page placement to balance (loaded) access latencies of tiers



## **Principle of balancing access latencies**

Adapt page placement to balance (loaded) access latencies of tiers



## **Colloid overview**



**Key principle: Principle of balancing access latencies** Adapt page placement to balance (loaded) access latencies of tiers



Access latency with nanosecond-scale precision Low-overhead mechanism to measure per-tier access latency



Page placement algorithm Decide which set of pages to place in each tier



Integration with existing systems Leverages existing memory management innovations



Evaluation Understand effectiveness over wide range of workloads

## Access latency with nanosecond-scale precision

Fundamental design aspects of CPU-to-memory datapath enable fine-grained visibility into per-tier access latency

## Understanding the host network

SIGCOMM'24



## **Access latency with nanosecond-scale precision**

Fundamental design aspects of CPU-to-memory datapath enable fine-grained visibility into per-tier access latency

## Understanding the host network

SIGCOMM'24





Alternate tier

Insight #1: Transparent routing of requests to tiers provides vantage point

Insight #2: Requests remain queued until they are serviced from tier



## Access latency with nanosecond-scale precision

Fundamental design aspects of CPU-to-memory datapath enable fine-grained visibility into per-tier access latency

## Understanding the host network

SIGCOMM'24



Insight #1: Transparent routing of requests to tiers provides vantage point

Insight #2: Requests remain queued until they are serviced from tier



## **Colloid overview**



**Key principle: Principle of balancing access latencies** Adapt page placement to balance (loaded) access latencies of tiers



**Access latency with nanosecond precision** Low-overhead mechanism to measure per-tier access latency



Page placement algorithm Decide which set of pages to place in each tier



Integration with existing systems Leverages existing memory management innovations



Evaluation Understand effectiveness over wide range of workloads

Executes periodically at fixed time intervals (quanta) and adapts page placement based on access latencies





Executes periodically at fixed time intervals (quanta) and adapts page placement based on access latencies





Handling dynamic changes in workload



Handling dynamic changes in workload



## **Colloid overview**



**Key principle: Principle of balancing access latencies** Adapt page placement to balance (loaded) access latencies of tiers



**Access latency with nanosecond precision** Low-overhead mechanism to measure per-tier access latency



Page placement algorithm Decide which set of pages to place in each tier



Integration with existing systems Leverages existing memory management innovations



Evaluation Understand effectiveness over wide range of workloads

## **Colloid integrates with existing systems**

Implemented on top of three state-of-the-art tiered memory management systems



|                     | Existing systems   |                    |                    |
|---------------------|--------------------|--------------------|--------------------|
| esign dimensions    | HeMem<br>[SOSP'21] | TPP<br>[ASPLOS'23] | MEMTIS<br>[SOSP'23 |
| Access tracking     |                    | <b>F</b>           |                    |
| Page migration      |                    |                    |                    |
| size determination  |                    |                    |                    |
| latency measurement | Colloid            |                    |                    |
| age placement       |                    | Colloid            |                    |
| Lines of code       | 520                | 411                | 315                |



## **Colloid overview**



**Key principle: Principle of balancing access latencies** Adapt page placement to balance (loaded) access latencies of tiers



**Access latency with nanosecond precision** Low-overhead mechanism to measure per-tier access latency



Page placement algorithm Decide which set of pages to place in each tier



Integration with existing systems Leverages existing memory management innovations



Evaluation Understand effectiveness over wide range of workloads

Colloid enables existing systems to achieve near-optimal performance



Colloid enables existing systems to achieve near-optimal performance



Colloid benefits translate to end-to-end performance improvements for real applications



(Controlled load on default tier memory interconnect generated via memory antagonist)



Colloid consistently enables benefits for a wide variety of workloads and hardware parameters



Varying object size, core count, read/write ratio Colloid continues to achieve near-optimal performance

**Dynamic workloads** Colloid does not impact timescale for reaction to change in access patterns Enables reacting to change in memory interconnect contention as similar timescale

**Real applications** 



## Varying alternate tier unloaded latency

Colloid provides benefits even with larger alternate tier unloaded latencies

Colloid achieves up to 2.1x improvement in end-to-end performance

Graph processing engine

In-memory Silo transactional database



In-memory key-value cache

## Access latency is the key!



Enables existing systems to realize the principle of balancing access latencies

## https://github.com/host-architecture/colloid