NUMA Zonelist Construction and Fallback Ordering
When a NUMA node runs out of memory, the kernel's pre-built fallback list determines where the next allocation lands — and memory policies can override that list entirely.
What Is a Zonelist?
Every NUMA node (pg_data_t, also called pgdat) carries an array of zonelists. A zonelist is an ordered sequence of zoneref entries — one entry per (zone, node) pair — that the allocator walks in priority order when fulfilling a request.
/* include/linux/mmzone.h */
/* Maximum number of zones on a zonelist */
#define MAX_ZONES_PER_ZONELIST (MAX_NUMNODES * MAX_NR_ZONES)
enum {
ZONELIST_FALLBACK, /* zonelist with fallback */
#ifdef CONFIG_NUMA
ZONELIST_NOFALLBACK, /* zonelist without fallback (__GFP_THISNODE) */
#endif
MAX_ZONELISTS
};
/*
* This struct contains information about a zone in a zonelist. It is stored
* here to avoid dereferences into large structures and lookups of tables
*/
struct zoneref {
struct zone *zone; /* Pointer to actual zone */
int zone_idx; /* zone_idx(zoneref->zone) */
};
/*
* One allocation request operates on a zonelist. A zonelist
* is a list of zones, the first one is the 'goal' of the
* allocation, the other zones are fallback zones, in decreasing
* priority.
*/
struct zonelist {
struct zoneref _zonerefs[MAX_ZONES_PER_ZONELIST + 1];
};
Each pgdat holds two zonelists (node_zonelists[MAX_ZONELISTS]):
| Index | Name | Purpose |
|---|---|---|
ZONELIST_FALLBACK |
Fallback list | Normal allocations; preferred node first, then sorted by NUMA distance |
ZONELIST_NOFALLBACK |
No-fallback list | Used when __GFP_THISNODE is set; local node only, never crosses to another node |
The helper node_zonelist(nid, gfp_flags) selects the right list based on whether __GFP_THISNODE is present in the GFP flags:
/* include/linux/gfp.h */
static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
{
return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
}
where gfp_zonelist() returns ZONELIST_NOFALLBACK if __GFP_THISNODE is set, and ZONELIST_FALLBACK otherwise.
How Zonelists Are Built
Entry Point: build_all_zonelists()
Zonelists are constructed at two points in the kernel's lifetime:
- Boot —
build_all_zonelists()is called during memory initialization. Because the system is still inSYSTEM_BOOTING, it delegates tobuild_all_zonelists_init(), which calls__build_all_zonelists(NULL)to rebuild all nodes. - Memory hotplug — When a node comes online,
build_all_zonelists(pgdat)is called again with the new node'spgdat. If that node is not yet online, only its own zonelist is rebuilt; otherwise all nodes are rebuilt so their fallback lists reflect the new topology.
/* mm/page_alloc.c */
void __ref build_all_zonelists(pg_data_t *pgdat)
{
if (system_state == SYSTEM_BOOTING) {
build_all_zonelists_init();
} else {
__build_all_zonelists(pgdat);
}
...
}
The rebuild holds a write-side seqlock (zonelist_update_seq) so that concurrent allocators — which may run at IRQ level with GFP_ATOMIC — can detect a rebuild in progress and re-read the zonelist.
Node Ordering: build_zonelists()
For each pgdat, build_zonelists() constructs the ZONELIST_FALLBACK list by calling find_next_best_node() repeatedly until all nodes with memory have been visited:
/* mm/page_alloc.c */
static void build_zonelists(pg_data_t *pgdat)
{
static int node_order[MAX_NUMNODES];
int node, nr_nodes = 0;
nodemask_t used_mask = NODE_MASK_NONE;
int local_node, prev_node;
local_node = pgdat->node_id;
prev_node = local_node;
while ((node = find_next_best_node(local_node, &used_mask)) >= 0) {
if (node_distance(local_node, node) !=
node_distance(local_node, prev_node))
node_load[node] += 1;
node_order[nr_nodes++] = node;
prev_node = node;
}
build_zonelists_in_node_order(pgdat, node_order, nr_nodes);
build_thisnode_zonelists(pgdat);
...
}
The kernel prints the resulting order at boot: Fallback order for Node N: 0 1 2 ...
Choosing the Next Best Node: find_next_best_node()
find_next_best_node() scores every unvisited node using a composite metric:
- NUMA distance — the primary signal, read from the ACPI SLIT table via
node_distance(). Closer nodes score lower. - CPU presence penalty — nodes that have CPUs (
PENALTY_FOR_NODE_WITH_CPUS) score slightly higher than headless memory-only nodes. Memory-only nodes see less allocation pressure from the kernel itself, making them better overflow targets. - Load balancing — a small
node_loadincrement is applied to the first node in each new distance group, spreading allocations round-robin among equidistant peers.
The preferred (local) node is always placed first.
The numa_zonelist_order Sysctl
The kernel exposes /proc/sys/vm/numa_zonelist_order. In current kernels, only "Node" order is implemented. An older zone-based ordering mode was removed because it was not useful in practice. The sysctl is retained for compatibility but ignores any value that does not begin with 'd', 'D', 'n', or 'N' (matching "Default" and "Node"), emitting a warning otherwise.
Zone Ordering Within a Node
Within a single node, build_zonerefs_node() iterates zones from the highest zone index down to zero and appends each populated zone to the list:
/* mm/page_alloc.c */
static int build_zonerefs_node(pg_data_t *pgdat, struct zoneref *zonerefs)
{
struct zone *zone;
enum zone_type zone_type = MAX_NR_ZONES;
int nr_zones = 0;
do {
zone_type--;
zone = pgdat->node_zones + zone_type;
if (populated_zone(zone)) {
zoneref_set_zone(zone, &zonerefs[nr_zones++]);
check_highest_zone(zone_type);
}
} while (zone_type);
return nr_zones;
}
The zone type enum in ascending order is:
/* include/linux/mmzone.h */
enum zone_type {
ZONE_DMA, /* CONFIG_ZONE_DMA — small DMA-capable window */
ZONE_DMA32, /* CONFIG_ZONE_DMA32 — 32-bit DMA window */
ZONE_NORMAL, /* main addressable memory */
ZONE_MOVABLE, /* pages that can be migrated or offlined */
__MAX_NR_ZONES
};
Because the loop counts down from MAX_NR_ZONES - 1, a typical x86-64 node's zones appear in the zonelist as:
Why this order? The allocator tries the largest-capacity zones first (ZONE_NORMAL and ZONE_MOVABLE together hold the vast majority of RAM). ZONE_DMA32 and ZONE_DMA are preserved for hardware devices that have strict addressing constraints. Using normal memory for ordinary allocations protects these limited DMA-capable regions.
Fallback Algorithm: Iterating the Zonelist
The for_each_zone_zonelist Family
The allocator's fast path (get_page_from_freelist()) iterates the zonelist using:
/* include/linux/mmzone.h */
/* Iterate all zones at or below highest_zoneidx */
#define for_each_zone_zonelist(zone, z, zlist, highidx) \
for_each_zone_zonelist_nodemask(zone, z, zlist, highidx, NULL)
/* Same, but filtered to zones whose node is in nodemask */
#define for_each_zone_zonelist_nodemask(zone, z, zlist, highidx, nodemask) \
for (z = first_zones_zonelist(zlist, highidx, nodemask), zone = zonelist_zone(z); \
zone; \
z = next_zones_zonelist(++z, highidx, nodemask), \
zone = zonelist_zone(z))
z is a cursor (struct zoneref *) into the zonelist's _zonerefs array. Advancing ++z moves to the next entry without re-scanning from the beginning — this is O(1) per step.
Watermark Gating
At each zone the allocator checks the watermark before attempting an allocation:
/* mm/page_alloc.c — inside get_page_from_freelist() */
mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
if (!zone_watermark_fast(zone, order, mark,
ac->highest_zoneidx, alloc_flags,
gfp_mask)) {
/* watermark failed — skip this zone, continue the loop */
...
continue;
}
/* watermark passed — allocate from this zone */
zone_watermark_ok() checks that free_pages - requested > watermark + lowmem_reserve[highest_zoneidx]. The lowmem_reserve array provides a per-zone buffer that prevents higher-zone allocations from completely exhausting DMA-capable zones.
The allocator does not retry the full list
Once the watermark check fails for a zone, that zone is skipped — the loop simply moves to the next entry. There is no re-scan from the beginning. If all zones in the fallback list fail the watermark check, the allocation falls into __alloc_pages_slowpath(), which triggers reclaim, compaction, and ultimately OOM if memory cannot be freed.
How Memory Policies Modify Zonelist Traversal
Memory policies (struct mempolicy) intercept the allocation path in mempolicy.c and can change either the preferred node or the nodemask that filters the zonelist iterator.
/* include/linux/mempolicy.h */
struct mempolicy {
atomic_t refcnt;
unsigned short mode; /* MPOL_DEFAULT, MPOL_BIND, MPOL_PREFERRED, ... */
unsigned short flags; /* MPOL_F_STATIC_NODES, MPOL_F_RELATIVE_NODES, ... */
nodemask_t nodes; /* interleave/bind/preferred/etc. */
int home_node; /* home node for MPOL_BIND and MPOL_PREFERRED_MANY */
union {
nodemask_t cpuset_mems_allowed;
nodemask_t user_nodemask;
} w;
struct rcu_head rcu;
};
The policy_nodemask() function in mempolicy.c translates the policy mode into a (preferred_nid, nodemask) pair that the core allocator uses:
/* mm/mempolicy.c */
static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
pgoff_t ilx, int *nid)
{
nodemask_t *nodemask = NULL;
switch (pol->mode) {
case MPOL_PREFERRED:
*nid = first_node(pol->nodes);
break;
case MPOL_BIND:
if (apply_policy_zone(pol, gfp_zone(gfp)) &&
cpuset_nodemask_valid_mems_allowed(&pol->nodes))
nodemask = &pol->nodes;
...
break;
case MPOL_INTERLEAVE:
*nid = (ilx == NO_INTERLEAVE_INDEX) ?
interleave_nodes(pol) : interleave_nid(pol, ilx);
break;
...
}
return nodemask;
}
MPOL_DEFAULT
No policy object is set (current->mempolicy == NULL). The allocator uses the current CPU's node as the preferred node and no nodemask filter. The full fallback list is available.
MPOL_PREFERRED
The nodes field holds exactly one node. policy_nodemask() sets *nid to that node, making it the preferred starting point. The nodemask filter remains NULL, so if the preferred node is exhausted, the allocator falls back through the full ZONELIST_FALLBACK list.
MPOL_BIND
policy_nodemask() returns &pol->nodes as the nodemask filter. for_each_zone_zonelist_nodemask skips any zone whose node is not in that mask. The allocator therefore only considers the bound nodes. If all bound nodes are exhausted, no fallback to other nodes occurs — the allocation fails and triggers OOM rather than spilling onto an unbound node.
MPOL_BIND and OOM
A process bound with MPOL_BIND to a node that is full will be killed by the OOM killer rather than falling back to remote memory. This is intentional: the application has explicitly declared that it requires local memory.
MPOL_INTERLEAVE
interleave_nodes() advances current->il_prev to the next node in pol->nodes and returns it as the preferred node for this allocation. No nodemask filter is applied to the zonelist — normal fallback is still available if the interleaved node is temporarily exhausted. Over many allocations, pages are spread round-robin across the allowed nodes.
MPOL_LOCAL
Allocate from the current CPU's NUMA node. policy_nodemask() does not modify *nid for MPOL_LOCAL (it falls through to the default case), so numa_node_id() is used. Normal fallback applies.
Summary Table
| Policy | Preferred Node | Nodemask Filter | On Exhaustion |
|---|---|---|---|
MPOL_DEFAULT |
Current CPU's node | None | Falls back normally |
MPOL_PREFERRED |
Specified node | None | Falls back normally |
MPOL_BIND |
First node in mask (or home_node) |
pol->nodes |
OOM — no fallback |
MPOL_INTERLEAVE |
Round-robin across pol->nodes |
None | Falls back normally |
MPOL_LOCAL |
Current CPU's node | None | Falls back normally |
Practical Effect: Two-Node Example
Consider a system with node 0 (32 GB) and node 1 (32 GB). NUMA distance: local = 10, remote = 20. Node 0's ZONELIST_FALLBACK list is:
Scenario 1 — Default policy, node 0 is full:
A task running on node 0 with no explicit policy fills node 0's ZONE_NORMAL. get_page_from_freelist() fails the watermark check on ZONE_NORMAL/node0, skips ZONE_DMA32/node0 (watermark check also fails or lowmem_reserve denies it), and succeeds on ZONE_NORMAL/node1. The allocation is served from node 1 at remote latency. numastat records a numa_miss on node 0 and a numa_foreign on node 1.
Scenario 2 — MPOL_BIND to node 0, node 0 is full:
The nodemask filter restricts the iterator to node 0 entries only. Both ZONE_NORMAL/node0 and ZONE_DMA32/node0 fail the watermark check. No other zones pass the nodemask filter. The slow path fails to reclaim enough memory and the OOM killer is invoked. The task does not fall back to node 1.
Scenario 3 — MPOL_PREFERRED for node 0, node 0 is full:
The preferred nid is overridden to node 0, but no nodemask filter is applied. When node 0's zones fail the watermark check, the iterator continues to ZONE_NORMAL/node1 and the allocation succeeds. Behaviour is identical to the default policy except that node 0 is tried first even if the task happens to be scheduled on node 1.
Observing Fallback in Practice
numastat
| Counter | Meaning |
|---|---|
| sysfs name | /proc/vmstat name |
| --- | --- |
numa_hit |
numa_hit |
numa_miss |
numa_miss |
numa_foreign |
numa_foreign |
interleave_hit |
numa_interleave |
local_node |
numa_local |
other_node |
numa_other |
High numa_miss counts indicate memory pressure on the preferred node and frequent fallback to remote memory.
/proc/zoneinfo
Per-zone free page counts and the three watermarks (min, low, high) are visible here. When free approaches min, the zone will fail zone_watermark_ok() for most allocations and the allocator will skip to the next zonelist entry.
Kernel Boot Log
At boot, the kernel prints each node's fallback order:
This comes directly from build_zonelists():
pr_info("Fallback order for Node %d: ", local_node);
for (node = 0; node < nr_nodes; node++)
pr_cont("%d ", node_order[node]);
Key Source Files
| File | What to Look For |
|---|---|
include/linux/mmzone.h |
struct zoneref, struct zonelist, enum zone_type, ZONELIST_FALLBACK, ZONELIST_NOFALLBACK, MAX_ZONES_PER_ZONELIST, for_each_zone_zonelist, for_each_zone_zonelist_nodemask, enum numa_stat_item |
include/linux/mempolicy.h |
struct mempolicy — mode, flags, nodes, home_node |
include/uapi/linux/mempolicy.h |
MPOL_DEFAULT, MPOL_PREFERRED, MPOL_BIND, MPOL_INTERLEAVE, MPOL_LOCAL, MPOL_PREFERRED_MANY |
include/linux/gfp.h |
node_zonelist(), gfp_zonelist() |
mm/page_alloc.c |
build_all_zonelists(), __build_all_zonelists(), build_zonelists(), build_zonelists_in_node_order(), build_thisnode_zonelists(), build_zonerefs_node(), find_next_best_node(), get_page_from_freelist(), zone_watermark_ok(), __zone_watermark_ok() |
mm/mempolicy.c |
policy_nodemask(), alloc_pages_mpol(), interleave_nodes(), mempolicy_slab_node() |
Further reading
Kernel source
- mm/page_alloc.c —
build_all_zonelists(),build_zonelists(),find_next_best_node(),get_page_from_freelist(),zone_watermark_ok() - mm/mempolicy.c —
policy_nodemask(),alloc_pages_mpol(), and all memory policy mode implementations - include/linux/mmzone.h —
struct zonelist,struct zoneref,for_each_zone_zonelist_nodemask,ZONELIST_FALLBACK,ZONELIST_NOFALLBACK
Kernel documentation
Documentation/admin-guide/mm/numa_memory_policy.rst— reference for allMPOL_*policy modes and how they interact with the zonelist (rendered)
LWN articles
- NUMA zonelist ordering — the history of zonelist ordering modes and why node-order won out over zone-order
- Memory policies and the allocator — how
MPOL_BIND,MPOL_INTERLEAVE, andMPOL_PREFERREDmap to zonelist traversal - GFP flags and memory zones — how GFP flags select zones and interact with the fallback list
Related docs
- NUMA Memory Management — memory policy overview:
set_mempolicy(),mbind(),numactl, and automatic NUMA balancing - NUMA Distance and Inter-Socket Latency — how
node_distance()values drivefind_next_best_node()scoring - NUMA Effects on Memory Reclaim — per-node kswapd, watermarks, and how
MPOL_BINDinteracts with OOM - Page Allocator — the full allocation path from
__alloc_pages()through the zonelist to the buddy allocator