Memory Bandwidth Optimization

When latency is not the bottleneck — maximizing throughput across NUMA nodes

The bandwidth wall

Most NUMA documentation focuses on latency: keep data close to the CPU that uses it. But a different bottleneck appears in bandwidth-bound workloads — large matrix operations, streaming analytics, ML training data shuffles, in-memory caches serving millions of keys per second. These workloads do not suffer from high latency on individual accesses so much as they hit a hard ceiling on how many bytes per second a single NUMA node can deliver.

Each NUMA node has a fixed memory bandwidth determined by its DDR controller and channel configuration. A dual-channel DDR5-4800 system delivers roughly 77 GB/s per channel × 2 channels ≈ 154 GB/s per node. Modern server CPUs can each drive 20 GB/s or more of memory traffic at peak. The math closes quickly:

Node 0 peak bandwidth:   ~154 GB/s (2-channel DDR5-4800)
Peak per CPU core:       ~20 GB/s (sustained streaming)
Saturation point:        ~8 cores driving max bandwidth on one node

A 32-core system on one node: 4x over-subscribed

When all threads allocate from the same NUMA node, you get local latency but artificially cap throughput at one node's bandwidth. Distributing allocations across nodes multiplies available bandwidth at the cost of some remote-access latency. For bandwidth-bound workloads this trade-off is favorable.

These numbers are illustrative

Actual bandwidth depends on your DDR generation, channel count, memory interleaving at the hardware level, and workload access pattern. Measure with stream on your target hardware.

Bandwidth vs latency: choosing a strategy

The right strategy depends on whether your workload is latency-bound or bandwidth-bound:

Workload type	Example	Bottleneck	Strategy
Latency-bound	OLTP database, in-memory cache with sub-ms SLO	Access latency per operation	`MPOL_BIND` to local node; keep data near CPU
Bandwidth-bound	Large matrix multiply, ML data loading, Kafka consumers, Redis cluster	Aggregate GB/s	`MPOL_INTERLEAVE` across nodes; trade latency for bandwidth
Mixed	Multi-tenant server with varied workloads	Depends on tenant	Per-VMA policy; profile first

The crossover point is workload-specific. A workload that repeatedly accesses a 512MB working set fits comfortably in one node's bandwidth budget. A workload that streams through 200GB of data per second does not. Profile before tuning.

MPOL_INTERLEAVE: the kernel mechanism

MPOL_INTERLEAVE is Linux's primary tool for spreading memory allocations across NUMA nodes. It round-robins page allocations across a specified nodemask: page 0 goes to node 0, page 1 to node 1, and so on. Each node receives an approximately equal share of the region's physical pages.

The implementation lives in mm/mempolicy.c. For process-level interleaving, interleave_nodes() advances current->il_prev through the policy nodemask on each allocation:

/* mm/mempolicy.c */
static unsigned int interleave_nodes(struct mempolicy *policy)
{
    unsigned int nid;
    unsigned int cpuset_mems_cookie;

    do {
        cpuset_mems_cookie = read_mems_allowed_begin();
        nid = next_node_in(current->il_prev, policy->nodes);
    } while (read_mems_allowed_retry(cpuset_mems_cookie));

    if (nid < MAX_NUMNODES)
        current->il_prev = nid;
    return nid;
}

For VMA-based interleaving (used by mbind()), interleave_nid() derives the target node from the page's offset within the VMA, so the same page always maps to the same node even after a process restart with the same mapping layout.

See NUMA memory management for the full policy API including MPOL_BIND, MPOL_PREFERRED, and MPOL_LOCAL.

Applying MPOL_INTERLEAVE

Per-process (set_mempolicy) — all subsequent allocations by this process interleave across all nodes:

#include <numaif.h>
#include <numa.h>

/* Interleave across all available nodes */
struct bitmask *all_nodes = numa_all_nodes_ptr;
set_mempolicy(MPOL_INTERLEAVE,
              all_nodes->maskp,
              all_nodes->size + 1);

Per-region (mbind) — apply interleave to a specific already-allocated region:

#include <numaif.h>
#include <sys/mman.h>

void *buf = mmap(NULL, size, PROT_READ | PROT_WRITE,
                 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

unsigned long nodemask = 0x3;  /* nodes 0 and 1 */
mbind(buf, size, MPOL_INTERLEAVE,
      &nodemask, 3,             /* maxnode = 3 (bits 0 and 1 valid) */
      MPOL_MF_MOVE);            /* migrate existing pages now */

Command line (numactl) — wrap an existing binary with no code changes:

# Interleave across all nodes
numactl --interleave=all ./myapp

# Interleave across specific nodes
numactl --interleave=0,1 ./myapp

When to use MPOL_INTERLEAVE

Best-fit workloads for MPOL_INTERLEAVE

Large in-memory datasets: memcached and Redis cluster instances that serve a dataset significantly larger than one node's bandwidth budget
Scientific computing arrays: NumPy/SciPy operations on arrays that fit in RAM but exceed single-node bandwidth
ML training data loaders: preprocessing pipelines that stream large datasets into GPU memory
Analytics engines: in-memory columnar stores (DuckDB, Spark on DRAM) performing full-table scans

MPOL_WEIGHTED_INTERLEAVE

Linux 6.9 added MPOL_WEIGHTED_INTERLEAVE (commit fa3bea4e1f82), which distributes pages proportionally rather than evenly. If node 0 has 256 GB and node 1 has 128 GB, you can weight allocations 2:1 to match available capacity. The weights are configured via /sys/kernel/mm/mempolicy/weighted_interleave/nodeN.

First-touch policy and its bandwidth implications

Linux uses first-touch allocation: the page is allocated on the NUMA node where the faulting thread is running at the time of the first write to that page. This is the most common reason deliberate interleaving fails silently.

The problem pattern:

// Main thread (on node 0) initializes a large buffer
// All pages land on node 0 — MPOL_INTERLEAVE has no effect here
// because mbind was not called BEFORE the first touch
char *buf = malloc(10UL << 30);  /* 10 GB */
memset(buf, 0, 10UL << 30);      /* first touch: all pages go to node 0 */

// Now set policy and spawn workers — too late
set_mempolicy(MPOL_INTERLEAVE, ...);
spawn_workers(buf);

MPOL_INTERLEAVE via set_mempolicy() affects future allocations. Pages already faulted in are unaffected. Similarly, mbind() without MPOL_MF_MOVE changes the policy for future faults but does not migrate existing pages.

Fixing first-touch problems

Option 1 — Set policy before first touch:

set_mempolicy(MPOL_INTERLEAVE, &all_nodes_mask, max_node);
char *buf = malloc(size);
memset(buf, 0, size);  /* first touch now interleaves correctly */

Option 2 — mbind with MPOL_MF_MOVE:

char *buf = malloc(size);
memset(buf, 0, size);          /* all on node 0 */
mbind(buf, size, MPOL_INTERLEAVE,
      &nodemask, max_node, MPOL_MF_MOVE);  /* migrates existing pages */

Option 3 — Parallel initialization:

/* Pin workers to different nodes; each touches its own portion */
#pragma omp parallel for schedule(static)
for (size_t i = 0; i < n; i++)
    buf[i] = 0;

Option 3 is the most effective for C++ programs with static or global initialization, where a single thread runs all constructors before main().

Hardware prefetchers and access pattern efficiency

The CPU's hardware prefetcher detects sequential access patterns and pre-fetches cache lines from memory before they are requested. This has direct implications for bandwidth utilization: a prefetcher-friendly access pattern saturates available bandwidth more effectively than a random one.

Access pattern efficiency (approximate ranking):

Sequential (stride 1):   ~100% of peak bandwidth
Strided (stride 2-8):    50–80% of peak bandwidth
Strided (stride 16+):    25–50% of peak bandwidth
Random (pointer chase):  5–15% of peak bandwidth

For anonymous memory, the kernel has no visibility into the access pattern — the hardware prefetcher operates independently. But for file-backed pages, the kernel's readahead mechanism provides a software analogue: it detects sequential file reads and issues speculative page_cache_ra_unbounded() calls to bring pages into the page cache before they are faulted in. This is effective for streaming analytics on files but not for anonymous heap allocations.

Row-major vs column-major

Matrix code is the canonical example. For a row-major C array A[rows][cols]:

/* Bandwidth-friendly: sequential access, prefetcher happy */
for (int i = 0; i < rows; i++)
    for (int j = 0; j < cols; j++)
        sum += A[i][j];

/* Bandwidth-unfriendly: column-major stride = cols * sizeof(float) */
for (int j = 0; j < cols; j++)
    for (int i = 0; i < rows; i++)
        sum += A[i][j];

The column-major loop on a 10,000×10,000 float array has a stride of 40,000 bytes between accesses. This defeats the prefetcher and reduces effective bandwidth by 5–10x on a typical server.

For bandwidth-bound matrix workloads

Combine MPOL_INTERLEAVE with row-major access patterns and 2MB huge pages. Each component addresses a different bottleneck: interleave adds aggregate bandwidth, row-major access enables prefetching, and huge pages reduce TLB overhead on large arrays.

Transparent huge pages and bandwidth

Transparent huge pages reduce TLB pressure by mapping memory in 2MB blocks instead of 4KB. Fewer TLB entries means more of the CPU's memory access capacity goes to actual data transfers rather than page-table walks.

For large bandwidth-bound workloads, the combination THP + MPOL_INTERLEAVE addresses two separate bottlenecks:

Mechanism	What it fixes
`MPOL_INTERLEAVE`	Single-node bandwidth ceiling
THP (2MB pages)	TLB thrashing on large working sets

Enable THP for the relevant region:

void *buf = mmap(NULL, size, PROT_READ | PROT_WRITE,
                 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
madvise(buf, size, MADV_HUGEPAGE);     /* request THP */

unsigned long nodemask = (1UL << nr_nodes) - 1;
set_mempolicy(MPOL_INTERLEAVE, &nodemask, nr_nodes + 1);

memset(buf, 0, size);  /* first touch: interleaved 2MB huge pages */

Note: with THP enabled, khugepaged may collapse adjacent 4KB pages into a 2MB page asynchronously. The resulting huge page lands on whichever node held most of its component 4KB pages. If interleaving is done correctly at first-touch time, this should not cause problems. See THP for the interaction between NUMA balancing and THP.

NUMA balancing and deliberate interleaving

NUMA automatic balancing (/proc/sys/kernel/numa_balancing) works by periodically unmapping pages to catch access faults, then migrating pages toward the nodes that access them most. For most workloads this is beneficial. For deliberately interleaved workloads, it can silently undo your tuning by migrating all pages toward the node where the majority of threads run.

# Check current state
cat /proc/sys/kernel/numa_balancing

# Disable for deliberately interleaved workloads
echo 0 > /proc/sys/kernel/numa_balancing

When to disable NUMA balancing

Disable NUMA balancing when:

You have explicitly set MPOL_INTERLEAVE and want to preserve the distribution
The workload accesses memory evenly from all nodes (e.g., each worker thread processes its own shard)
You see unexpectedly high numa_miss counts after interleaving is set up correctly

Leave NUMA balancing enabled when:

You have not set explicit policies
The workload's access pattern is irregular or changes over time
You need the kernel to automatically correct poor initial placement

Also note: kernel allocations are not subject to user mempolicies. set_mempolicy() and mbind() affect user-space virtual memory regions only. Kernel data structures (slab objects, page tables, network buffers) are allocated by the kernel's own NUMA-aware allocator and are not interleaved by user policy.

Measuring memory bandwidth

Benchmark tools

stream — the gold standard for measuring sustainable memory bandwidth:

# Build stream benchmark
wget https://www.cs.virginia.edu/stream/FTP/Code/stream.c
gcc -O3 -march=native -fopenmp stream.c -o stream
./stream

# Example output:
# Function    Best Rate MB/s  Avg time     Min time     Max time
# Copy:          145234.4     0.011004     0.010983     0.011031
# Scale:         140123.8     0.011388     0.011419     0.011375
# Add:           148902.1     0.016134     0.016117     0.016152
# Triad:         149015.3     0.016100     0.016105     0.016093

# Measure bandwidth per NUMA node
numactl --membind=0 ./stream    # node 0 only
numactl --membind=1 ./stream    # node 1 only
numactl --interleave=all ./stream  # interleaved

mbw — simpler bandwidth measurement:

mbw 1024    # test with 1GB

Topology inspection

# Inspect NUMA hardware topology
numactl --hardware

# Example output:
# available: 2 nodes (0-1)
# node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
# node 0 size: 129020 MB
# node 0 free: 121432 MB
# node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
# node 1 size: 131072 MB
# node 1 free: 122804 MB
# node distances:
# node   0   1
#   0:  10  21
#   1:  21  10

# Per-process NUMA distribution
numastat -m -p <pid>

# Example output:
#                           Per-node process memory usage (MB)
#                                            Node 0   Node 1    Total
#                           --------------- -------- -------- --------
# myapp (pid 12345)            Private           987      991     1978

Hardware performance counters

# LLC misses indicate memory traffic leaving the cache hierarchy
perf stat -e LLC-load-misses,LLC-store-misses -p <pid>

# NUMA-specific events (Intel, where supported)
perf stat -e offcore_response.all_requests.l3_miss.any_snoop \
          -p <pid>

/proc/vmstat NUMA counters

cat /proc/vmstat | grep numa

The key counters (confirmed in mm/vmstat.c):

Counter	Meaning	What a high value means
`numa_hit`	Allocations satisfied on the preferred node	Good: memory landing where intended
`numa_miss`	Allocations that fell through to a non-preferred node	High miss rate: node pressure, or interleave not set
`numa_foreign`	Pages intended for this node that were placed elsewhere	Cross-node pressure
`numa_interleave`	Allocations placed by `MPOL_INTERLEAVE`	Confirms interleave policy is active
`numa_local`	Allocations on the same node as the running thread	High: good for latency-bound workloads
`numa_other`	Allocations on a different node than the running thread	High: remote allocations (may be intentional with interleave)

Reading the counters

A healthy interleaved workload shows high numa_interleave and roughly equal N0 and N1 counts in numastat -m. A high numa_miss with low numa_interleave suggests first-touch placement is concentrating pages on one node despite your policy setting.

CXL memory and bandwidth tiering

CXL (Compute Express Link) memory appears to Linux as additional NUMA nodes with no attached CPUs. The bandwidth of a PCIe Gen 5 x16 CXL link is lower than local DRAM (local DDR5 ≈ 150+ GB/s vs. CXL over PCIe Gen 5 x16 ≈ 64 GB/s bidirectional), but higher than NVMe. CXL is appropriate for cold data that needs fast access but does not need the full bandwidth of local DRAM.

For bandwidth-bound workloads, CXL nodes are generally not suitable targets for MPOL_INTERLEAVE — spreading hot data to a lower-bandwidth tier reduces aggregate throughput. CXL is better used through the kernel's automatic memory tiering (demotion of cold pages from DRAM to CXL).

See CXL Memory Tiering for the kernel's tiering framework and how CXL nodes are assigned to tiers based on HMAT/CDAT performance data.

Decision flowchart

Is your workload memory bandwidth-bound?
  |
  No → Focus on MPOL_BIND / latency; bandwidth is not the issue
  |
  Yes
  |
  ├── Single NUMA node?
  │     Yes → Bandwidth ceiling is fixed; consider CXL or upgrade
  │     No ↓
  |
  ├── Are pages landing where you expect?
  │     Check: numastat -m, /proc/vmstat numa_interleave
  │     No → First-touch problem → see "First-touch" section above
  │     Yes ↓
  |
  ├── Is NUMA balancing migrating pages away from interleave?
  │     Check: watch -n1 'cat /proc/vmstat | grep numa'
  │     Yes → echo 0 > /proc/sys/kernel/numa_balancing
  │     Already off ↓
  |
  └── Are access patterns sequential?
        No → Optimize data layout (row-major, structure of arrays)
        Yes → Add MADV_HUGEPAGE for TLB headroom

Key source files

File	Description
`mm/mempolicy.c`	`MPOL_INTERLEAVE` and `MPOL_WEIGHTED_INTERLEAVE` implementation; `interleave_nodes()`, `interleave_nid()`
`include/linux/mempolicy.h`	`struct mempolicy`, policy mode constants
`include/uapi/linux/mempolicy.h`	Userspace-visible policy constants (`MPOL_INTERLEAVE`, `MPOL_WEIGHTED_INTERLEAVE`, etc.)
`mm/vmstat.c`	`numa_hit`, `numa_miss`, `numa_interleave` counter names
`mm/page_alloc.c`	Where `NUMA_HIT`/`NUMA_MISS` counters are incremented
`arch/x86/mm/numa.c`	x86 NUMA topology discovery from ACPI SRAT/SLIT

References

Kernel documentation

Documentation/admin-guide/mm/numa_memory_policy.rst
Documentation/mm/numa.rst — NUMA balancing implementation details

Mailing list discussions

MPOL_WEIGHTED_INTERLEAVE introduction — Gregory Price (Qualcomm), Nov 2023
NUMA balancing and THP interaction — Mel Gorman on THP NUMA migration design

External benchmarks

STREAM benchmark — McCalpin, J.D., standard for memory bandwidth measurement

NUMA memory management — policy API, automatic balancing, monitoring
Transparent huge pages — THP configuration, khugepaged, TLB pressure
CXL Memory Tiering — heterogeneous memory tiers
Tuning databases — huge pages, THP, and latency-sensitive workloads