Maple Tree: VMA Management Data Structure
The B-tree variant that replaced the rbtree + linked list for virtual memory area management in Linux 6.1
Key Source Files
| File | Purpose |
|---|---|
include/linux/maple_tree.h |
Public API: struct maple_tree, struct ma_state, all macros and inline functions |
lib/maple_tree.c |
Full implementation, maple_tree_init(), mas_walk(), mas_find(), mas_store(), mas_erase() |
include/linux/mm_types.h |
struct mm_struct embedding (mm_mt), struct vma_iterator, MM_MT_FLAGS |
mm/mmap.c |
find_vma(), find_vma_intersection(), process teardown with __mt_destroy() |
mm/vma.c |
VMA insertion via vma_iter_store_gfp(), removal via vma_iter_clear() |
kernel/fork.c |
mm_init(): maple tree initialization with mt_init_flags() |
Why the Maple Tree Replaced rbtree + Linked List
Before Linux 6.1, every mm_struct maintained two redundant data structures for VMAs:
- An red-black tree (
mm->mm_rb) keyed onvm_start, providing O(log n) lookup by address. - A doubly-linked list (
vma->vm_next/vma->vm_prev) threading all VMAs in address order, used for sequential iteration.
Every VMA insertion, deletion, or modification had to update both structures atomically under mmap_lock. Beyond the extra code complexity, there were two concrete performance costs:
-
Cache misses during sequential scans. Walking the linked list requires following
vm_nextpointers, one per VMA, with each pointer potentially pointing to a different cache line in a different slab object. On a process with thousands of mappings this becomes a long chain of pointer-chasing loads. -
Memory overhead. Each
vm_area_structcarried three rbtree node fields (rb_node, plus parent and colour embedded in the pointer) and two list pointers, costing ~40 bytes per VMA solely for indexing.
The maple tree (merged in v6.1, authored at Oracle by Liam Howlett and Matthew Wilcox) unifies both structures into a single B-tree variant that natively supports range-keyed lookup and efficient sequential iteration. Because B-tree nodes pack multiple pivots and slots into one 256-byte cache line, a single node access provides several pivot comparisons, dramatically reducing cache misses compared to an rbtree where each comparison requires loading a different node.
What the Maple Tree Is
The maple tree is an RCU-safe, range-keyed B-tree. It stores non-overlapping ranges of unsigned long keys, mapping each range to an opaque pointer value. For VMA management the key is the address range [vm_start, vm_end - 1] and the value is struct vm_area_struct *.
Node types and sizes
Three node types are defined (from include/linux/maple_tree.h):
/* On 64-bit: 256 bytes per node, aligned to 256 bytes */
#define MAPLE_NODE_SLOTS 31 /* slots in a generic node */
#define MAPLE_RANGE64_SLOTS 16 /* range node: 15 pivots + 16 slots */
#define MAPLE_ARANGE64_SLOTS 10 /* alloc-range node: 9 pivots + 10 slots + gaps */
The two node variants in use:
maple_range_64 — the standard internal/leaf node
struct maple_range_64 {
struct maple_pnode *parent;
unsigned long pivot[MAPLE_RANGE64_SLOTS - 1]; /* 15 range boundaries */
union {
void __rcu *slot[MAPLE_RANGE64_SLOTS]; /* 16 child or value pointers */
struct {
void __rcu *pad[MAPLE_RANGE64_SLOTS - 1];
struct maple_metadata meta; /* end offset + largest gap */
};
};
};
A maple_range_64 node stores up to 15 pivots and 16 slots in 256 bytes. In B-tree terms, pivots are the inclusive upper bounds of the ranges stored in each slot. Slot i covers the range (pivot[i-1], pivot[i]], with slot 0 covering from the node's implied minimum.
maple_arange_64 — the allocation-range node
struct maple_arange_64 {
struct maple_pnode *parent;
unsigned long pivot[MAPLE_ARANGE64_SLOTS - 1]; /* 9 range boundaries */
void __rcu *slot[MAPLE_ARANGE64_SLOTS]; /* 10 pointers */
unsigned long gap[MAPLE_ARANGE64_SLOTS]; /* largest NULL gap per slot */
struct maple_metadata meta;
};
Allocation-range nodes additionally track the largest contiguous gap of NULL entries within each subtree. This supports mas_empty_area() / mas_empty_area_rev(), which find a free address range of a given size without scanning the entire tree — critical for mmap() address selection.
Node allocation
All nodes are allocated from a dedicated kmem_cache named maple_node, created at boot by maple_tree_init():
void __init maple_tree_init(void)
{
struct kmem_cache_args args = {
.align = sizeof(struct maple_node),
.sheaf_capacity = 32,
};
maple_node_cache = kmem_cache_create("maple_node",
sizeof(struct maple_node), &args, SLAB_PANIC);
}
The 256-byte alignment ensures that the bottom 8 bits of a node pointer are always zero. The maple tree uses these bits to encode the node type and tree metadata directly in the pointer, avoiding extra memory accesses.
Maximum tree height
With 16 slots per maple_range_64 node, a tree of height 31 can represent 16^31 distinct ranges — far beyond any realistic address space.
struct maple_tree
/* include/linux/maple_tree.h */
struct maple_tree {
union {
spinlock_t ma_lock;
#ifdef CONFIG_LOCKDEP
struct lockdep_map *ma_external_lock;
#endif
};
unsigned int ma_flags;
void __rcu *ma_root;
};
ma_lock— internal spinlock, used when no external lock is supplied. Formm->mm_mtthis is unused: the tree is configured withMT_FLAGS_LOCK_EXTERNso thatmmap_lockacts as the serialising lock.ma_flags— bitmask of tree-wide properties (see below).ma_root— RCU-protected pointer to the root node (or a single stored value if the tree has exactly one entry at index 0).
Flags
#define MT_FLAGS_ALLOC_RANGE 0x01 /* track largest free gap per node */
#define MT_FLAGS_USE_RCU 0x02 /* nodes are freed via RCU */
#define MT_FLAGS_LOCK_IRQ 0x100 /* ma_lock acquired IRQ-safe */
#define MT_FLAGS_LOCK_BH 0x200 /* ma_lock acquired BH-safe */
#define MT_FLAGS_LOCK_EXTERN 0x300 /* external lock; ma_lock unused */
How mm_struct embeds the maple tree
/* include/linux/mm_types.h */
struct mm_struct {
...
struct maple_tree mm_mt;
...
struct rw_semaphore mmap_lock;
...
};
/* The flags used when initialising mm->mm_mt */
#define MM_MT_FLAGS (MT_FLAGS_ALLOC_RANGE | MT_FLAGS_LOCK_EXTERN | \
MT_FLAGS_USE_RCU)
All three flags are active for mm->mm_mt:
MT_FLAGS_ALLOC_RANGE— enables gap tracking so thatmmap()can find a free region efficiently.MT_FLAGS_LOCK_EXTERN— the tree's internalma_lockspinlock is bypassed; callers must holdmmap_lockfor write before any mutation.MT_FLAGS_USE_RCU— removed nodes are freed viacall_rcu()rather than immediately, enabling lockless RCU readers (see RCU safety below).
The tree is initialised and bound to mmap_lock in mm_init() (kernel/fork.c):
static struct mm_struct *mm_init(struct mm_struct *mm, ...)
{
mt_init_flags(&mm->mm_mt, MM_MT_FLAGS);
mt_set_external_lock(&mm->mm_mt, &mm->mmap_lock);
...
}
RCU Safety
The maple tree achieves lockless reads by ensuring that nodes removed from the tree are not freed until all currently executing RCU read-side critical sections have completed. When MT_FLAGS_USE_RCU is set, the removal path calls call_rcu() instead of freeing immediately:
/* lib/maple_tree.c — node removal path */
if (mt_in_rcu(mt)) {
mt_destroy_walk(enode, mt, false);
call_rcu(&node->rcu, mt_free_walk); /* deferred free */
} else {
mt_destroy_walk(enode, mt, true); /* immediate free */
}
The struct maple_node reuses its slot[] array as an rcu_head after the node is removed from the tree (both are stored in a union). A reader that loaded a pointer from slots[] before the node was unlinked can safely continue dereferencing it; the memory will not be returned to the allocator until after its RCU grace period.
RCU mode lifecycle in the mm
The mm->mm_mt operates with RCU enabled most of the time. During process teardown and the dup_mmap() path, the kernel temporarily disables RCU mode to avoid the overhead of deferred freeing when nodes are being discarded en masse:
/* mm/mmap.c — exit_mmap() */
mt_clear_in_rcu(&mm->mm_mt); /* switch to immediate free for teardown */
...
__mt_destroy(&mm->mm_mt); /* free all remaining nodes */
/* mm/mmap.c — dup_mmap() — after bulk duplication is complete */
mt_set_in_rcu(vmi.mas.tree); /* re-enable RCU on the new mm */
mt_set_in_rcu() and mt_clear_in_rcu() toggle MT_FLAGS_USE_RCU under the tree's lock.
Lockless VMA lookup (per-VMA lock fast path)
A reader holding only rcu_read_lock() — with no mmap_lock — can safely walk mm->mm_mt to find a VMA. This is the foundation of the per-VMA lock fast path (see per-vma-locks.md):
/* mm/mmap_lock.c — lock_vma_under_rcu() */
MA_STATE(mas, &mm->mm_mt, address, address);
rcu_read_lock();
vma = mas_walk(&mas); /* lockless maple tree walk under RCU */
if (!vma)
goto inval;
vma = vma_start_read(mm, vma); /* attempt per-VMA read lock */
...
rcu_read_unlock();
mas_walk() traverses from root to leaf purely by reading __rcu-annotated pointers and pivots. No lock is required because removed nodes remain valid for at least one RCU grace period.
MA_STATE and the Maple State Cursor
The struct ma_state is the operation cursor for all advanced maple tree APIs. It tracks the current position in the tree so that sequential operations (e.g. iterating over all VMAs) do not need to restart from the root on each call.
/* include/linux/maple_tree.h */
struct ma_state {
struct maple_tree *tree; /* the tree being operated on */
unsigned long index; /* range start of current operation */
unsigned long last; /* range end of current operation */
struct maple_enode *node; /* current node */
unsigned long min; /* implied minimum of current node */
unsigned long max; /* implied maximum of current node */
struct slab_sheaf *sheaf; /* pre-allocated node pool */
struct maple_node *alloc; /* single pre-allocated node (fast path) */
unsigned long node_request; /* number of nodes to pre-allocate */
enum maple_status status; /* ma_start / ma_active / ma_none / ma_error / ... */
unsigned char depth; /* depth during write descent */
unsigned char offset; /* slot/pivot index within current node */
unsigned char mas_flags;
unsigned char end; /* last used slot in current node */
enum store_type store_type; /* type of write needed */
};
The status field determines how the next operation treats the cursor:
| Status | Meaning |
|---|---|
ma_start |
Not yet positioned; next operation walks from root |
ma_active |
Cursor is valid at node/offset; in-place continuation is safe |
ma_none |
Search complete; no entry found at index |
ma_root |
Entry found in the root pointer (tree has only one entry) |
ma_pause |
Node data may be stale; restart required (e.g. lock dropped) |
ma_overflow |
Reached the upper search limit |
ma_underflow |
Reached the lower search limit |
ma_error |
Error occurred; node encodes the negative errno |
The MA_STATE macro initialises a cursor on the stack:
#define MA_STATE(name, mt, first, end) \
struct ma_state name = { \
.tree = mt, \
.index = first, \
.last = end, \
.node = NULL, \
.status = ma_start, \
.min = 0, \
.max = ULONG_MAX, \
... \
}
After an operation completes, mas->index and mas->last hold the full range of the entry that was found or stored.
The struct vma_iterator used throughout mm/mmap.c and mm/vma.c is a thin wrapper:
/* include/linux/mm_types.h */
struct vma_iterator {
struct ma_state mas;
};
#define VMA_ITERATOR(name, __mm, __addr) \
struct vma_iterator name = { \
.mas = { \
.tree = &(__mm)->mm_mt, \
.index = __addr, \
.status = ma_start, \
}, \
}
Key Operations
Tree lifecycle: mt_init() and mtree_destroy()
/* Initialise an empty tree with default flags (no external lock, no RCU) */
static inline void mt_init(struct maple_tree *mt)
{
mt_init_flags(mt, 0);
}
/* Initialise with specific flags (used for mm->mm_mt) */
static inline void mt_init_flags(struct maple_tree *mt, unsigned int flags);
/* Destroy a tree, freeing all nodes (acquires ma_lock internally) */
void mtree_destroy(struct maple_tree *mt);
/* Destroy without locking — caller must hold the external lock */
void __mt_destroy(struct maple_tree *mt);
For mm->mm_mt, mm_init() uses mt_init_flags() with MM_MT_FLAGS. Teardown in exit_mmap() and dup_mmap() error paths uses __mt_destroy(&mm->mm_mt) after the caller already holds mmap_lock for write.
mas_walk() — position cursor at index
Searches for mas->index in the tree. On return, mas->index and mas->last are set to the range of the found entry (or to 0 and ULONG_MAX if nothing was found). Does not advance the cursor; calling mas_walk() twice on the same index returns the same entry.
Safe to call under rcu_read_lock() alone (no write lock needed). The per-VMA lock fast path uses this directly:
mas_find() — forward iteration from index
Returns the next entry whose range overlaps [mas->index, max]. After a successful return, mas->index and mas->last describe the range of the returned entry and the cursor is positioned there; the next call starts scanning forward from mas->last + 1.
Returns NULL when no entry exists in the remaining range (sets status to ma_overflow).
Must be called with rcu_read_lock() or the write lock held.
The mas_for_each() macro uses this:
#define mas_for_each(__mas, __entry, __max) \
while (((__entry) = mas_find((__mas), (__max))) != NULL)
mas_store() / mas_store_gfp() — insert or update a range mapping
/* Store entry at [mas->index, mas->last]. Requires pre-allocated nodes. */
void *mas_store(struct ma_state *mas, void *entry);
/* Store with on-demand allocation (returns -ENOMEM on failure). */
int mas_store_gfp(struct ma_state *mas, void *entry, gfp_t gfp);
Both functions write entry into the tree for the range [mas->index, mas->last], which must be set before calling. If an existing entry overlaps the range, it is displaced (split, shrunk, or replaced as needed). Returns the first displaced entry (if any) for mas_store(); returns 0 or -ENOMEM for mas_store_gfp().
Requires the write lock.
In VMA code, vma_iter_store_gfp() wraps mas_store_gfp():
/* mm/vma.h */
static inline int vma_iter_store_gfp(struct vma_iterator *vmi,
struct vm_area_struct *vma, gfp_t gfp)
{
__mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1);
mas_store_gfp(&vmi->mas, vma, gfp);
if (unlikely(mas_is_err(&vmi->mas)))
return -ENOMEM;
vma_mark_attached(vma);
return 0;
}
mas_erase() — remove the entry at the current index
Finds the range containing mas->index, erases the entire range (stores NULL), and returns the previous entry. On return, mas->index and mas->last are set to the erased range.
Requires the write lock.
In VMA code the removal path uses vma_iter_clear(), which calls mas_store_prealloc() with NULL:
/* mm/vma.h */
static inline void vma_iter_clear(struct vma_iterator *vmi)
{
mas_store_prealloc(&vmi->mas, NULL);
}
mt_for_each() — simple full-tree iteration
/* include/linux/maple_tree.h */
#define mt_for_each(__tree, __entry, __index, __max) \
for (__entry = mt_find(__tree, &(__index), __max); \
__entry; \
__entry = mt_find_after(__tree, &(__index), __max))
mt_for_each uses the simpler mt_find() / mt_find_after() interface (which acquires and releases rcu_read_lock() internally on each call), rather than an explicit ma_state. It is convenient for read-only traversals that do not need to maintain cursor state between iterations.
Code Examples
VMA lookup by address
/* find_vma() — mm/mmap.c */
struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
{
unsigned long index = addr;
mmap_assert_locked(mm);
return mt_find(&mm->mm_mt, &index, ULONG_MAX);
}
mt_find() starts at index and returns the first entry whose range covers or follows addr. For address lookup this is equivalent to asking "which VMA covers addr, or if none, which is the next one".
VMA insertion during mmap
/* mm/vma.c — anonymous mapping creation (simplified) */
vma = vm_area_alloc(mm);
vma_set_range(vma, addr, addr + len, pgoff);
vm_flags_init(vma, vm_flags);
vma_start_write(vma); /* mark VMA write-locked */
if (vma_iter_store_gfp(vmi, vma, GFP_KERNEL))
goto mas_store_fail;
mm->map_count++;
Iterating all VMAs (e.g. for /proc/PID/maps)
VMA_ITERATOR(vmi, mm, 0);
struct vm_area_struct *vma;
for_each_vma(vmi, vma) {
/* vma->vm_start, vma->vm_end, vma->vm_flags ... */
}
for_each_vma is built on vma_next() which advances the vma_iterator forward through the maple tree without restarting from the root.
Finding a free address range (mmap hint search)
/* mas_empty_area() — finds a gap of at least `size` bytes in [min, max] */
MA_STATE(mas, &mm->mm_mt, 0, 0);
mas_empty_area(&mas, mmap_min_addr, mmap_end - len, len);
addr = mas.index;
This leverages the gap[] array in maple_arange_64 nodes to skip subtrees that cannot contain a gap large enough, making free-range search O(log n) rather than O(n).
Performance Compared to rbtree
| Property | rbtree + linked list | Maple tree |
|---|---|---|
| Point lookup | O(log n) | O(log n) |
| Sequential scan | O(n) pointer chasing | O(n) sequential node reads |
| Free-gap search | O(n) full scan | O(log n) via gap metadata |
| Cache lines per node | 1 (one VMA pointer per node) | Up to 16 entries per 256-byte node |
| Memory per VMA (index overhead) | ~40 bytes (rb_node + list pointers) | Amortised ~16 bytes (shared node) |
| Lock on reads | mmap_lock required | RCU read lock sufficient |
| Update complexity | Two structure updates | One tree update |
The key advantage for sequential scans (e.g. reading /proc/PID/maps or mprotect() over a large range) is that a maple_range_64 node holds 16 VMA pointers in one 256-byte allocation. A full-node traversal costs one cache miss to load the node, then up to 16 useful entries, compared to one cache miss per entry when following a linked list.
For address spaces with thousands of mappings (JVM, large applications, container runtimes), this reduction in cache pressure is measurable.
Debugging
/proc/PID/maps and /proc/PID/smaps
These files are generated by iterating mm->mm_mt directly. No change is required to read them; the maple tree is the authoritative source.
CONFIG_DEBUG_MAPLE_TREE
Enable via lib/Kconfig.debug:
With this option:
MT_BUG_ON(),MAS_BUG_ON(), andMT_WARN_ON()perform additional consistency checks at each tree operation and dump the full tree on failure.- Extra range validation is performed inside
mas_store()before each write. mt_validate()can be called explicitly to check the structural integrity of any tree:
mt_validate() checks that all pivot arrays are sorted, that no ranges overlap, that gap metadata is consistent, and that the height stored in ma_flags matches the actual tree depth.
mt_dump()
Prints the full tree to the kernel log in decimal (mt_dump_dec) or hexadecimal (mt_dump_hex) format. Available only when CONFIG_DEBUG_MAPLE_TREE=y. Called automatically by MT_BUG_ON() and MAS_BUG_ON() on failure.
mas_dump()
Prints the current state of an ma_state cursor (index, last, node pointer, status, min/max, depth, offset). Useful when debugging an incorrect cursor state.
Tracing maple tree operations
The maple tree has built-in tracepoints (trace_ma_read, trace_ma_write, trace_ma_op) instrumented throughout lib/maple_tree.c. These are visible via ftrace when the corresponding maple_tree event group is enabled and provide per-operation visibility into the tree without requiring CONFIG_DEBUG_MAPLE_TREE.
RCU mode and debugging
When CONFIG_DEBUG_MAPLE_TREE is combined with CONFIG_MAPLE_RCU_DISABLED (a compile-time option at the top of lib/maple_tree.c), RCU freeing is suppressed entirely and nodes are freed immediately. This can help expose use-after-free bugs in tree consumers but is not safe for production.
Relationship to Other Subsystems
- Per-VMA locks (
mm/mmap_lock.c) depend onMT_FLAGS_USE_RCUto perform lockless VMA lookups underrcu_read_lock(). See per-vma-locks.md. /proc/PID/mapsiteratesmm->mm_mtviafor_each_vma()undermmap_lockfor read.dup_mmap()uses__mt_dup()to bulk-copy the parent maple tree into the childmm, then replaces anyVM_DONTCOPYentries withvma_iter_clear_gfp().execve()tears down the address space by callingexit_mmap(), which calls__mt_destroy(&mm->mm_mt)after clearing all page table entries.- vmalloc uses a red-black tree (
free_vmap_area_root, astruct rb_root) for tracking kernel virtual address space — not a maple tree.
Further reading
- LWN: The maple tree, a modern data structure for a complex problem — 2022 article by Liam Howlett and Matthew Wilcox explaining the design rationale, node layout, and benchmarks against the rbtree + linked list combination
- lib/maple_tree.c — the full implementation:
maple_tree_init(),mas_walk(),mas_find(),mas_store(),mas_erase(), RCU node freeing viamt_free_walk(), and theCONFIG_MAPLE_RCU_DISABLEDdebug mode - include/linux/maple_tree.h — public API:
struct maple_tree,struct ma_state,MA_STATE(),mt_for_each(), node-type constants, and all flag definitions - Commit 54a611b60590 — the Linux 6.1 merge of the maple tree replacing
mm->mm_rband the VMA linked list; the merge message links to the full patch series and benchmark results - per-vma-locks.md — the per-VMA lock fast path that depends on
MT_FLAGS_USE_RCUto perform lockless maple tree walks underrcu_read_lock()without holdingmmap_lock - LWN: Splitting VMAs — covers the VMA tree reorganisation work in Linux 6.7 that moved VMA manipulation helpers into
mm/vma.c, enabled by the cleaner maple tree API