Page Reclaim

What happens when memory runs low

What Is Page Reclaim?

When the system runs low on free memory, the kernel must reclaim pages from existing users. Page reclaim finds pages that can be freed or written to swap, making room for new allocations.

Memory Pressure
      │
      v
┌─────────────────┐
│  Page Reclaim   │
├─────────────────┤
│ - File pages    │──> Write dirty pages, drop clean
│ - Anonymous     │──> Swap out
│ - Slab caches   │──> Shrink caches
└─────────────────┘
      │
      v
Free Pages Available

Two Paths to Reclaim

1. Background Reclaim (kswapd)

kswapd is a per-node kernel thread that reclaims pages in the background when memory falls below the low watermark.

Free pages fall below low watermark
           │
           v
    Wake kswapd
           │
           v
    Reclaim pages until high watermark reached
           │
           v
    kswapd sleeps

Key characteristic: Processes don't block. kswapd works asynchronously.

2. Direct Reclaim

When kswapd can't keep up and free pages fall below the min watermark, allocating processes must reclaim pages themselves.

Allocation request
        │
        v
    Free pages < min?
        │ yes
        v
    Direct reclaim (process blocks)
        │
        v
    Retry allocation

Key characteristic: The allocating process blocks until pages are freed.

Watermarks

Each zone has three watermarks controlling reclaim behavior:

             ┌──────────────────┐
  high ─────>│                  │  Zone healthy, no action
             ├──────────────────┤
  low  ─────>│  Wake kswapd     │  Background reclaim starts
             ├──────────────────┤
  min  ─────>│  Direct reclaim  │  Allocating process helps
             ├──────────────────┤
             │  OOM territory   │  Kill processes
             └──────────────────┘

Watermarks are derived from vm.min_free_kbytes:

# View watermarks
cat /proc/zoneinfo | grep -E "min|low|high"

# Adjust base value (affects all watermarks)
sysctl vm.min_free_kbytes=65536

LRU Evolution

Single-list LRU and the streaming problem

The original page cache LRU was a single list: recently accessed pages at the head, oldest pages at the tail. Eviction took from the tail. Simple and correct for workloads with stable working sets.

The problem: a process that reads a large file sequentially ("streaming") accesses thousands of pages exactly once. As each page lands at the head of the LRU, it pushes all the hot working-set pages toward the tail. By the time the streaming read finishes, the kernel has evicted the database buffer pool or the application's code pages to make room for file data that will never be read again.

This was observed early in Linux history. The fix — two lists — was in place by Linux 2.6.

Two-list LRU (active + inactive)

The two-list model divides the LRU into active and inactive lists:

New pages enter the inactive list (on first fault/read)
A page accessed a second time is promoted to active
Eviction takes pages from inactive only; active pages are protected

A streaming read fills the inactive list, but those pages never get a second access, so they're evicted without ever touching the active list. The hot working set stays in active and survives the stream.

This worked well for decades but had a fundamental limitation: a single bit of information ("has this page been accessed since it was put on the inactive list?") is a coarse approximation of "how hot is this page?" A page accessed continuously for the last hour looks identical to a page accessed once, both landing in active.

Multi-Gen LRU (MGLRU, Linux 6.1, 2022)

Yu Zhao (Google) designed MGLRU after observing Android devices (with limited RAM) thrashing between hot and warm pages using the two-list model. The core insight: instead of two categories (active/inactive), use multiple generations representing time bands.

Pages are assigned a generation number when faulted. Periodically, the kernel "ages" the generations: recently accessed pages get a newer generation number; pages that haven't been accessed stay in their old generation. Eviction always takes the oldest generation first.

This gives the kernel much finer-grained information about page age: a page in generation 8 is older than generation 12, and the eviction algorithm can make better choices. MGLRU also uses hardware dirty bits and access bits more aggressively to track page activity without explicit software tracking on every access.

LRU Lists

The kernel tracks page usage with LRU (Least Recently Used) lists. Pages move between lists based on access patterns.

Classic LRU (Two-List)

              Active                    Inactive
         ┌─────────────┐           ┌─────────────┐
  Hot    │   Recent    │           │   Recent    │
 pages   │   access    │ ────────> │   access    │ ──> Reclaim
         │             │  demote   │             │
         └─────────────┘           └─────────────┘
               ^                          │
               │        promote           │
               └──────────────────────────┘

Four lists per node (two types x two states): - LRU_INACTIVE_ANON - Inactive anonymous pages - LRU_ACTIVE_ANON - Active anonymous pages - LRU_INACTIVE_FILE - Inactive file-backed pages - LRU_ACTIVE_FILE - Active file-backed pages

Multi-Gen LRU (MGLRU)

Introduced in v6.1, MGLRU replaces the two-list model with multiple generations for better aging accuracy (commit ec1c86b25f4b).

Generation 0 (oldest) ──> Generation 1 ──> ... ──> Generation N (youngest)
      │
      v
   Reclaim

Benefits: - Better detection of hot vs cold pages - Reduced CPU overhead from page table scanning - Improved performance under memory pressure

# Check if MGLRU is enabled (hex bitmask)
cat /sys/kernel/mm/lru_gen/enabled
# 0x0007 = fully enabled (0x0001 | 0x0002 | 0x0004)

# Enable all MGLRU features (if built with CONFIG_LRU_GEN)
echo 0x0007 > /sys/kernel/mm/lru_gen/enabled
# Bits: 0x0001=base, 0x0002=clear accessed in leaf PTEs, 0x0004=non-leaf PTEs

What Gets Reclaimed

File Pages (Page Cache)

Pages backed by files on disk:

State	Action
Clean	Drop immediately (can re-read from disk)
Dirty	Write to disk first, then drop

Anonymous Pages

Pages with no file backing (heap, stack):

Swap Available	Action
Yes	Write to swap, then free
No	Cannot reclaim without killing owner (OOM killer)

Slab Caches

Kernel object caches can shrink via shrinkers:

/* Shrinkers registered by subsystems */
register_shrinker(&my_shrinker);

/* Called during reclaim */
struct shrinker {
    unsigned long (*count_objects)(struct shrinker *, ...);
    unsigned long (*scan_objects)(struct shrinker *, ...);
};

Examples: dentry cache, inode cache, vfs caches.

# View shrinker stats
cat /sys/kernel/debug/shrinker/*

Swappiness

vm.swappiness controls the balance between reclaiming file pages vs anonymous pages:

Value	Behavior
0	Avoid swapping, prefer file cache reclaim
60	Default balance
100	Aggressively swap anonymous pages

# View current value
cat /proc/sys/vm/swappiness

# Adjust (persistent in /etc/sysctl.conf)
sysctl vm.swappiness=10

OOM Killer

When reclaim fails and memory is exhausted, the OOM (Out of Memory) killer terminates processes to free memory.

OOM Score

Each process has an OOM score (0-1000). Higher = more likely to be killed.

# View OOM score
cat /proc/<pid>/oom_score

# Adjust OOM priority (-1000 to 1000)
echo -500 > /proc/<pid>/oom_score_adj  # Less likely to be killed
echo 1000 > /proc/<pid>/oom_score_adj  # Kill this first

OOM Selection Criteria

The kernel considers: 1. Memory usage (primary factor) 2. oom_score_adj adjustments 3. Whether killing frees memory (children, shared pages) 4. Root processes got slight protection (3% bonus, removed in v4.17)

# Watch OOM events
dmesg | grep -i "out of memory"

# Or via tracing
echo 1 > /sys/kernel/debug/tracing/events/oom/oom_score_adj_update/enable

Reclaim Behavior Tunables

Sysctl	Description
`vm.min_free_kbytes`	Base for watermark calculation
`vm.swappiness`	Anon vs file reclaim balance
`vm.vfs_cache_pressure`	Pressure on dentry/inode caches
`vm.dirty_ratio`	Max dirty pages before blocking writes
`vm.dirty_background_ratio`	When background writeback starts

History

Origins

Page reclaim has existed since early Linux. The basic kswapd mechanism was added by Stephen Tweedie in January 1996 (noted in the vmscan.c header), predating git history.

rmap (Reverse Mapping, v2.5)

Problem: To reclaim a page, kernel needed to find all PTEs pointing to it. Required scanning all page tables.

Solution: Reverse mapping - each page tracks which PTEs map it.

Split LRU (v2.6.28, 2008)

Commit: 4f98a2fee8ac ("vmscan: split LRU lists into anon & file sets")

Author: Rik van Riel

Note: Pre-2009 LKML archives on lore.kernel.org are sparse. The commit message documents the rationale.

Split single LRU into separate anonymous and file lists for better reclaim decisions.

MGLRU (v6.1, 2022)

Commit: ac35a4902374 ("mm: multi-gen LRU: minimal implementation") | LKML

Author: Yu Zhao

Multi-generational LRU for improved page aging and reduced overhead.

Try It Yourself

Monitor Reclaim Activity

# Real-time reclaim stats
watch -n 1 'cat /proc/vmstat | grep -E "pgscan|pgsteal|kswapd"'

# Key metrics:
# pgscan_kswapd  - Pages scanned by kswapd
# pgscan_direct  - Pages scanned by direct reclaim
# pgsteal_*      - Pages successfully reclaimed

Trigger Memory Pressure

# Consume memory to trigger reclaim
stress --vm 1 --vm-bytes 2G --vm-keep

# Watch kswapd wake up
watch -n 1 'ps aux | grep kswapd'

Trace Reclaim Events

# Enable reclaim tracing
echo 1 > /sys/kernel/debug/tracing/events/vmscan/mm_vmscan_direct_reclaim_begin/enable
echo 1 > /sys/kernel/debug/tracing/events/vmscan/mm_vmscan_direct_reclaim_end/enable

# Watch events
cat /sys/kernel/debug/tracing/trace_pipe

Check Zone Pressure

# Detailed zone info including watermarks
cat /proc/zoneinfo

# Quick check
cat /proc/buddyinfo

Common Issues

Direct Reclaim Stalls

Processes blocked in direct reclaim, causing latency spikes.

Debug: Check allocstall_* in /proc/vmstat

Solutions: - Increase vm.min_free_kbytes - Add swap if missing - Reduce memory pressure

kswapd CPU Usage

kswapd consuming excessive CPU.

Debug: top or perf top

Causes: - Too little free memory - Workload constantly dirtying pages - Swap thrashing

OOM Kills Despite Free Memory

Free memory exists but in wrong zone or fragmented.

Debug: Check /proc/buddyinfo and /proc/zoneinfo

References

Key Code

File	Description
`mm/vmscan.c`	Core reclaim logic
`mm/oom_kill.c`	OOM killer
`mm/page_alloc.c`	Watermark checks

Kernel Documentation

Mailing List Discussions

Multi-Gen LRU Framework v14 - Yu Zhao's August 2022 patch series
MGLRU design doc - Detailed design explaining generations, page table walks, and the PID controller

page-allocator - Watermarks, zones
glossary - LRU, OOM, swap definitions
memcg - Per-cgroup memory limits and reclaim