Memory Management: The Story
The Problem (1991)
When Linus Torvalds started Linux, he faced a fundamental question: how do you let multiple programs share limited RAM?
Every program thinks it has the entire computer's memory to itself. But there's only so much physical RAM. The memory management (mm) subsystem solves this with two key tricks:
- Virtual memory: Each program gets its own "fake" address space
- Paging: Only keep in RAM what's actually being used
This document tells the story of how Linux's memory allocators evolved over 30+ years.
The Big Picture
How Linux memory allocators relate to each other:
+---------------------------------------------+
| User Space |
| malloc() / mmap() / brk() |
+-----------------------+---------------------+
| system calls
+-----------------------v---------------------+
| Kernel |
| |
+------------------------+---------------------------------------------+
| | |
| +--------------------v--------------------+ |
| | kmalloc / kfree | <--- Small objects |
| | (physically contiguous) | (<page size) |
| +--------------------+--------------------+ |
| | |
| +--------------------v--------------------+ |
| | SLUB Allocator | <--- Object caches |
| | (carves pages into fixed objects) | (inodes, tasks) |
| +--------------------+--------------------+ |
| | |
| +--------------------v--------------------+ +-----------------+ |
| | Buddy Allocator |<--| vmalloc | |
| | (manages physical page frames) | | (virtual only) | |
| | mm/page_alloc.c | | mm/vmalloc.c | |
| +--------------------+--------------------+ +--------+--------+ |
| | | |
+------------------------+---------------------------------+-----------+
| |
v v
+------------------------------------------------------------+
| Physical RAM |
| [page][page][page][page][page][page][page][page] |
| ^ ^ ^ ^ ^ |
| | | | | | |
| +-----+-----+ +-----------+ |
| contiguous scattered |
| (kmalloc) (vmalloc) |
+------------------------------------------------------------+
Key insight: All allocators ultimately get pages from the buddy allocator. The difference is how they present memory to callers: - kmalloc/SLUB: Physically contiguous, fast, small allocations - vmalloc: Virtually contiguous, can satisfy large requests from fragmented memory
Chapter 1: The Page Allocator (The Foundation)
The Original Problem
Physical memory is finite. You need a way to hand out chunks of it and get them back.
Note: The buddy allocator dates to Linux 0.01 (1991). This predates LKML - early Linux development happened on comp.os.minix and early mailing lists without good archives. Detailed mm discussions from this era are largely lost to history.
Historical Context: How Linux Started
On August 25, 1991, Linus Torvalds posted to comp.os.minix: "I'm doing a (free) operating system (just a hobby, won't be big and professional like gnu) for 386(486) AT clones." He was a 21-year-old student in Helsinki, frustrated that MINIX (a teaching OS) couldn't be freely modified. Linux 0.01 had basic memory management - a simple buddy allocator for physical pages. Within months, others were contributing.
The Monolithic vs Microkernel Debate (1992)
In January 1992, Andrew Tanenbaum (MINIX creator, OS professor) declared "Linux is obsolete" on comp.os.minix. His argument: Linux's monolithic kernel (where mm/, filesystems, drivers all run in kernel space) was a "giant step back into the 1970s." Microkernels (like MINIX) were the future - minimal kernel, everything else in userspace.
Torvalds responded the next day: monolithic was a pragmatic choice. Microkernels had theoretical elegance but real-world performance overhead from context switches and message passing. He prioritized getting things working on real hardware (the 386) over architectural purity. History proved him right - Linux runs everywhere from phones to supercomputers, while pure microkernels remain niche.
Why this matters for mm/: Linux's monolithic design means memory management runs in kernel space with direct hardware access. No message-passing overhead.
kmalloc()is a function call, not an IPC. This shaped every allocator discussed below - they optimize for this tight integration.
The Solution: Buddy System
Linux uses a "buddy allocator" (borrowed from older systems). Here's the clever idea:
Imagine you have 16 pages of RAM:
[________________] <- 16 pages (order 4)
Someone asks for 2 pages. You split:
[________][______] <- two 8-page blocks
[____][__][______] <- split again: 4, 4, 8
[__][_][_][______] <- split again: 2, 2, 4, 8
^^
Given to requester
When they free those 2 pages, you merge them back with their "buddy":
[__][_][_] -> [____][_] -> [________] -> etc.
Why it works: Fast allocation (O(1) from a given order's free list), automatic merging reduces fragmentation.
What it can't do: Give you, say, 3 pages. You'll get 4 (next power of 2).
Chapter 2: The Slab Allocator (v2.0, ~1996)
The Problem
The page allocator works in page-sized chunks (4KB). But kernel objects are often tiny:
- An inode? ~600 bytes
- A dentry? ~200 bytes
- A socket buffer? ~200 bytes
Giving someone a whole 4KB page for a 200-byte object wastes memory.
The Solution: Slab
The "slab allocator" (borrowed from SunOS) carves pages into fixed-size object caches:
Page for 200-byte objects:
[obj][obj][obj][obj][obj][obj]...[obj] <- ~20 objects fit
Page for 600-byte objects:
[object][object][object][object]... <- ~6 objects fit
Each object type gets its own cache. Allocating is fast (grab from cache), freeing is fast (return to cache).
The Evolution
SLAB (Original, ~1996): Ported from SunOS, based on Bonwick's 1994 USENIX paper. Complex. Multiple levels of caches. Good debugging. But hard to maintain.
Note: SLAB predates good LKML archives. The Bonwick paper is the canonical design reference.
SLUB (v2.6.22, 2007): Simpler redesign by Christoph Lameter. - Commit: 81819f0fc828 - Removed complex queuing - Lower memory overhead - Better NUMA support - Same API, simpler internals
Note: The commit message contains the design rationale. Pre-2008 LKML archives are sparse.
Why SLUB won: Kernel developers learned that simpler code is easier to maintain and debug. SLAB's complexity wasn't worth the marginal performance gains.
SLOB (embedded, removed v6.4): Minimal allocator for tiny systems. Removed because SLUB became efficient enough for embedded too.
The API (kmalloc/kfree):
Chapter 3: vmalloc (1993 → Present)
The Problem
kmalloc gives you physically contiguous memory. As the system runs, physical memory becomes fragmented. Eventually, you can't find a contiguous 1MB chunk even if total free memory is 100MB.
The Solution: Virtual Contiguity
What if the memory only needs to look contiguous to the user?
Virtual addresses: Physical pages:
0xFFFF0000 ──────────→ Page at 0x1234000
0xFFFF1000 ──────────→ Page at 0x8765000 (scattered!)
0xFFFF2000 ──────────→ Page at 0x2222000
The program sees contiguous 0xFFFF0000-0xFFFF3000, but the physical pages are scattered. This is vmalloc().
The Cost
Page tables must be set up. TLB (translation cache) entries are consumed. It's slower than kmalloc. But it can satisfy large allocations when kmalloc can't.
Major Evolution
v2.6.28 (2008): The Scalability Crisis
The problem: vunmap needed to flush TLB entries. On multi-core systems, this meant an IPI (interrupt) to every CPU. Under a global lock. As core counts grew from 4 to 64+, this became quadratic slowdown.
The fix: Nick Piggin rewrote vmalloc from scratch:
- Commit: db64fe02258f
- Lazy TLB flushing: Don't flush immediately, batch multiple unmaps
- RBTree for address lookup: O(log n) instead of O(n) list scan
- Per-CPU frontend: Reduce lock contention
Note: The commit message documents the detailed design rationale for lazy TLB flushing and RCU.
v5.13 (2021): Huge Pages
The problem: With huge vmalloc buffers (BPF programs, modules), each 4KB page needed a TLB entry. TLB is limited. Lots of TLB misses.
The fix: Use huge pages (2MB) when possible:
- Commit: 121e6f3258fe | LKML
- Fewer TLB entries needed
- Trade-off: More internal fragmentation
v6.12 (2024): vrealloc
The problem: You have a vmalloc buffer and want to resize it. Previously, you'd have to allocate new, copy, free old.
The fix: vrealloc() - resize in place when possible:
- Commit: 3ddc2fefe6f3
- Motivated by Rust's allocator needs (Vec resizing)
- See vrealloc for the full story and bugs found
See LKML patch series by Danilo Krummrich.
Chapter 4: When Things Broke
Real bugs teach more than perfect code. Here are instructive failures:
KASAN vs vrealloc (v6.12)
What broke: vrealloc reused existing memory but didn't update KASAN (memory sanitizer) annotations. Result: false-positive "use-after-free" reports.
The lesson: When you have memory safety tooling, you must keep it in sync with actual memory state. Annotations are code, not comments.
Fix: d699440f58ce
The Return Value Bug (v6.13)
What broke: vrealloc was refactored to support in-place growing. The code did all the work correctly... then returned the wrong variable. Growing allocations silently fell back to the slow path.
The lesson: After complex refactoring, double-check your return statements. The happy path isn't always the executed path.
Impact: BPF verifier performance regression - it uses vrealloc heavily.
Fix: f7a35a3c36d1 - one line change.
Redundant Zeroing (v6.13)
What broke: With init_on_alloc enabled, vrealloc was zeroing memory on both grow and shrink. But on grow, the new memory was already zeroed by the allocator.
The lesson: Understand what upstream code already did. Don't assume you need to do everything yourself.
Fix: 70d1eb031a68
Chapter 5: Choosing the Right Allocator
After all this history, here's the practical guidance:
| Situation | Use | Why |
|---|---|---|
| Small object (<page) | kmalloc() |
Fast, low overhead |
| Large buffer (pages+) | vmalloc() |
Doesn't need physical contiguity |
| Don't know size ahead | kvmalloc() |
Tries kmalloc, falls back to vmalloc |
| DMA buffer | dma_alloc_coherent() |
Need physical address for hardware |
| Resizable buffer | krealloc()/vrealloc() |
Efficient resize |
Try It Yourself
Explore the Buddy Allocator
# View buddy allocator state per zone
cat /proc/buddyinfo
# Example output:
# Node 0, zone Normal 1024 512 256 128 64 32 16 8 4 2 1
# ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^
# | | | | | | | | | | order 10 (4MB blocks)
# order 0 (4KB pages)
# Each number = count of free blocks at that order
# Order 0 = 4KB, Order 1 = 8KB, ... Order 10 = 4MB
Explore the Slab Allocator
# View all slab caches
cat /proc/slabinfo
# Better: use slabtop for live monitoring
slabtop
# Key columns:
# OBJS - Total objects (active + inactive)
# ACTIVE - Objects currently in use
# USE - Percentage utilization
# OBJ SIZE - Size of each object
# SLABS - Number of slabs
# Find biggest memory consumers
slabtop -s c # Sort by cache size
Check Overall Memory
# High-level memory overview
cat /proc/meminfo
# Key fields for mm/:
# MemTotal - Total usable RAM
# MemFree - Completely unused pages
# Buffers - Block device cache
# Cached - Page cache (file data)
# Slab - Kernel slab allocator total
# SReclaimable - Slab memory that can be reclaimed
# VmallocUsed - vmalloc usage
Run Kernel Memory Tests
# Build test modules (in kernel tree with CONFIG_TEST_* enabled)
make M=lib/ lib/test_vmalloc.ko
make M=mm/ mm/test_hmm.ko
# Run vmalloc tests
modprobe test_vmalloc run_test_mask=0xFFFF
dmesg | grep test_vmalloc
# Run via virtme-ng (recommended for testing)
virtme-ng --kdir . --append 'test_vmalloc.run_test_mask=0xFFFF'
Key Code Locations
| Subsystem | File | What to Look For |
|---|---|---|
| Buddy allocator | mm/page_alloc.c |
__alloc_pages(), free_pages() |
| SLUB allocator | mm/slub.c |
kmem_cache_alloc(), kmalloc() |
| vmalloc | mm/vmalloc.c |
vmalloc(), vfree() |
| Memory reclaim | mm/vmscan.c |
shrink_node(), kswapd() |
| OOM killer | mm/oom_kill.c |
out_of_memory(), oom_badness() - see overcommit |
Timeline Summary
| Year | Kernel | What Happened |
|---|---|---|
| 1991 | 0.01 | Buddy allocator for pages |
| ~1996 | 2.0 | SLAB allocator for small objects |
| 1993-2002 | 2.x | vmalloc evolution (Linus, then Christoph Hellwig) |
| 2007 | v2.6.22 | SLUB replaces SLAB as default |
| 2008 | v2.6.28 | vmalloc rewrite (RBTree, lazy TLB) |
| 2021 | v5.13 | vmalloc huge page support |
| 2023 | v6.4 | SLOB removed |
| 2024 | v6.12 | vrealloc introduced |
| 2025 | v6.13+ | vrealloc bugs fixed, shrink optimization |
Further Reading
- vmalloc - Deep dive into virtual memory allocation
- vrealloc - The full story of vmalloc resizing
- page-allocator - Buddy system details
- slab - SLUB internals
- glossary - Terminology reference
Explainers
- virtual-physical-resident - VSZ vs RSS vs PSS explained
- overcommit - Why overcommit is the only sane default
- brk-vs-mmap - Two ways to get memory from the kernel
- contiguous-memory - Why large allocations fail despite free memory
- cow - Copy-on-write edge cases and when it breaks