TLB Optimization Techniques

How the Linux kernel minimizes the cost of keeping Translation Lookaside Buffers coherent across CPUs

Key Source Files

File	Purpose
`arch/x86/include/asm/tlbflush.h`	`flush_tlb_mm()`, `flush_tlb_range()`, `flush_tlb_page()`, `struct tlb_state`, `struct flush_tlb_info`
`arch/x86/mm/tlb.c`	TLB flush batching, lazy TLB mode, PCID/ASID management, `flush_tlb_mm_range()`, `enter_lazy_tlb()`, `switch_mm_irqs_off()`
`include/asm-generic/tlb.h`	`struct mmu_gather`, `tlb_gather_mmu()`, `tlb_finish_mmu()`, `tlb_flush_mmu()`
`mm/mmu_gather.c`	mmu_gather implementation: `tlb_gather_mmu()`, `tlb_gather_mmu_fullmm()`, `tlb_finish_mmu()`, `tlb_flush_mmu()`

Why the TLB Matters for Performance

The Translation Lookaside Buffer (TLB) is a small hardware cache that stores recent virtual-to-physical address translations. Without it, every memory access would require a full page-table walk — typically three to five additional memory reads depending on the paging depth.

The Cost Difference

TLB hit:   ~1  cycle    (~0.3 ns)
TLB miss:  ~50 cycles   (~17 ns) + page table walk
Page walk: up to 4 serialized cache misses on a cold hierarchy

On modern x86, the L1 dTLB holds 64–96 entries for 4KB pages and 32 entries for 2MB pages. A second-level TLB (STLB) provides 1024–3072 additional entries. These numbers are dwarfed by realistic working sets.

Working Set	4KB Pages Needed	TLB Entries Required	Fits in TLB?
4MB	1,024	1,024	No (4x STLB)
1GB	262,144	262,144	No (256x STLB)
100GB	26,214,400	26,214,400	No

Processes with large virtual address spaces, many VMAs, or frequent mmap()/munmap() activity generate significant TLB pressure. Every stale entry that must be invalidated and every subsequent miss is paid in cycles — and on multi-CPU systems, in IPI latency.

TLB Flush Basics

The kernel must flush TLB entries whenever the underlying mapping changes in a way that could allow a CPU to observe incorrect data:

Page unmapped: a live translation must not be used after the physical page is freed.
Permission downgraded: e.g., write-protecting a page; the old cached read-write entry must be evicted.
PTE changed: any modification to a present PTE that narrows access rights or changes the physical address.

The rule is: flush before freeing the page, never after.

The Three Core Flush Functions (x86)

These are defined in arch/x86/include/asm/tlbflush.h. On x86 they are all implemented as thin wrappers around the underlying flush_tlb_mm_range():

/* Flush the entire address space of mm on all CPUs */
#define flush_tlb_mm(mm) \
    flush_tlb_mm_range(mm, 0UL, TLB_FLUSH_ALL, 0UL, true)

/* Flush a range of pages in vma->vm_mm, respecting huge page stride */
#define flush_tlb_range(vma, start, end) \
    flush_tlb_mm_range((vma)->vm_mm, start, end, \
                       ((vma)->vm_flags & VM_HUGETLB) \
                           ? huge_page_shift(hstate_vma(vma)) \
                           : PAGE_SHIFT, true)

/* Flush a single page */
static inline void flush_tlb_page(struct vm_area_struct *vma,
                                   unsigned long a)
{
    flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false);
}

For kernel mappings, flush_tlb_kernel_range(start, end) sends a flush to every CPU — there is no per-process scoping for kernel space.

IPIs to Remote CPUs

When a mapping changes, every CPU that might have a cached translation must be notified. The kernel uses inter-processor interrupts (IPIs) for this: flush_tlb_multi() sends a flush_tlb_func() handler to all CPUs in the target mm_cpumask(mm). Each remote CPU then executes an INVLPG (single page) or CR3 reload (full flush) to evict the stale entries.

IPI delivery has non-trivial cost: on a loaded system, round-trip latency can be several microseconds. On NUMA systems with many sockets, this cost multiplies.

mmu_gather — Batched TLB Flushing

Naively, unmapping N pages would require N separate TLB flushes. The kernel avoids this with mmu_gather, a per-operation accumulator that collects all address ranges to be flushed and delivers a single batched flush at the end.

The struct

Defined in include/asm-generic/tlb.h:

struct mmu_gather {
    struct mm_struct    *mm;

    unsigned long        start;   /* lowest address unmapped */
    unsigned long        end;     /* highest address unmapped */

    unsigned int         fullmm        : 1; /* tearing down entire mm? */
    unsigned int         need_flush_all: 1; /* arch needs full flush */
    unsigned int         freed_tables  : 1; /* freed page table pages */
    unsigned int         delayed_rmap  : 1; /* pending rmap removals */

    /* which page-table levels were cleared */
    unsigned int         cleared_ptes : 1;
    unsigned int         cleared_pmds : 1;
    unsigned int         cleared_puds : 1;
    unsigned int         cleared_p4ds : 1;

    /* VMA flags captured at tlb_start_vma() */
    unsigned int         vma_exec : 1;
    unsigned int         vma_huge : 1;
    unsigned int         vma_pfn  : 1;

    unsigned int         batch_count;

    struct mmu_gather_batch *active;  /* current page-free batch */
    struct mmu_gather_batch  local;   /* on-stack initial batch */
    struct page         *__pages[MMU_GATHER_BUNDLE]; /* small initial array */
};

Lifecycle

The gather lifecycle is three steps, implemented in mm/mmu_gather.c:

tlb_gather_mmu(tlb, mm)          ← begin gathering; resets range and page list
    |
    | ... page-table walk, PTE clears, tlb_remove_page() calls ...
    | Each cleared PTE updates tlb->start / tlb->end via __tlb_adjust_range()
    |
tlb_finish_mmu(tlb)              ← issue TLB flush, free queued pages

tlb_gather_mmu() initialises the structure (range reset to [TASK_SIZE, 0], page batch empty) and increments mm->tlb_flush_pending to signal that a nested TLB flush may be needed.

tlb_finish_mmu() calls tlb_flush_mmu(), which in turn: 1. Calls tlb_flush_mmu_tlbonly() — issues flush_tlb_range() or flush_tlb_mm() covering the full accumulated range. 2. Frees all batched pages (tlb_batch_pages_flush()).

Finally, dec_tlb_flush_pending() balances the counter.

For the full-mm case (exec, exit), tlb_gather_mmu_fullmm() sets fullmm = 1. This skips per-VMA range tracking — there is no point narrowing a flush that will cover the entire address space anyway. On architectures with ASIDs, fullmm also enables cycling to a new ASID rather than sending IPIs.

Why Batching Matters

Consider munmap() of a 1GB anonymous mapping (262,144 pages). Without batching:

262,144 × flush_tlb_page() = 262,144 IPI round-trips on an SMP system.
Each IPI costs ~1–5 µs; total: potentially over a second of IPI overhead.

With mmu_gather, the kernel walks all PTEs, accumulates the range [start, start+1GB), then issues one flush_tlb_range() call. The single IPI flushes the entire range on each remote CPU at once.

Performance tip

If your workload performs many small munmap() calls in a tight loop, consider batching them into a single larger unmap or replacing them with madvise(MADV_FREE), which defers the actual page release and avoids immediate TLB shootdowns.

Lazy TLB Mode

When a CPU switches from a user-space process to a kernel thread (or another process whose mm differs), it does not immediately need to flush the TLB for the previous mm — there are no user mappings to protect from a kernel thread. Flushing early wastes CPU cycles.

The kernel exploits this with lazy TLB mode, tracked via the is_lazy flag in struct tlb_state_shared (per-CPU, cpu_tlbstate_shared):

struct tlb_state_shared {
    bool is_lazy;
};
DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state_shared, cpu_tlbstate_shared);

Entering Lazy Mode

enter_lazy_tlb() is called by the scheduler when switching to a kernel thread. On x86:

void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
{
    if (this_cpu_read(cpu_tlbstate.loaded_mm) == &init_mm)
        return;

    this_cpu_write(cpu_tlbstate_shared.is_lazy, true);
}

The CPU's bit in mm_cpumask(mm) remains set — the CPU is still in the cpumask so it will eventually receive flush IPIs — but the is_lazy flag tells the IPI sender it can skip sending unless page tables are being freed.

Skipping IPIs to Lazy CPUs

In native_flush_tlb_multi() (arch/x86/mm/tlb.c), the IPI is sent only to CPUs that are actively running the mm:

if (info->freed_tables || mm_in_asid_transition(info->mm))
    on_each_cpu_mask(cpumask, flush_tlb_func, (void *)info, true);
else
    on_each_cpu_cond_mask(should_flush_tlb, flush_tlb_func,
                          (void *)info, 1, cpumask);

should_flush_tlb() returns false for CPUs with is_lazy == true (unless page tables are being freed, in which case all CPUs must be flushed to prevent speculative walks of freed memory).

When the lazy CPU next runs user code, switch_mm_irqs_off() resets is_lazy to false and checks tlb_gen — if the generation counter has advanced while the CPU was lazy, a full flush is issued at that point.

What this means in practice

Idle CPUs and kernel threads on a large machine contribute zero IPI cost for typical page-table changes. The savings scale linearly with the number of CPUs that are in kernel threads or idle at the time of the flush.

Huge Pages and TLB Coverage

A 2MB huge page is mapped by a single PMD-level TLB entry rather than 512 individual 4KB PTE entries. The TLB coverage per entry is 512× larger.

Working Set	4KB Page TLB Entries	2MB Page TLB Entries	Reduction
1GB	262,144	512	512×
4GB	1,048,576	2,048	512×

This is the primary performance benefit of Transparent Huge Pages (THP): reducing TLB pressure rather than reducing memory usage.

How the Kernel Tracks Huge-Page Entries

struct mmu_gather tracks which page-table levels were cleared during an unmap operation:

unsigned int  cleared_ptes : 1;  /* 4KB pages */
unsigned int  cleared_pmds : 1;  /* 2MB pages */
unsigned int  cleared_puds : 1;  /* 1GB pages */
unsigned int  cleared_p4ds : 1;

tlb_get_unmap_shift() uses these bits to return the smallest TLB entry size that was unmapped, so tlb_flush() can pass the correct stride_shift to flush_tlb_range(). For huge-page VMAs, flush_tlb_range() receives PMD_SHIFT (21) as the stride, telling the hardware to flush at 2MB granularity rather than 4KB.

The vma_huge bit in struct mmu_gather is set by tlb_update_vma_flags() when processing a VM_HUGETLB VMA, allowing the flush to pass the correct stride even when an intermediate tlb_flush_mmu() fires before tlb_finish_mmu().

TLB-aware allocation

For workloads with multi-gigabyte working sets, enabling THP (/sys/kernel/mm/transparent_hugepage/enabled = always) or using madvise(MADV_HUGEPAGE) on hot regions dramatically reduces TLB miss rates. The reduction in TLB pressure typically outweighs any memory fragmentation cost for large, long-lived allocations.

PCID — Process Context Identifiers

Without PCID, every context switch flushes the entire TLB (by reloading CR3 with the new page-table root). With PCID, the TLB is tagged: each entry carries the PCID of the process that created it, and the CPU can keep entries for multiple processes simultaneously. Context switches no longer require a flush — the incoming process simply starts using its own tagged entries.

PCID in the Hardware

x86 supports up to 4096 PCIDs (12-bit field in CR3). To switch to a process without flushing its existing TLB entries, the kernel sets the CR3_NOFLUSH bit (bit 63) when writing CR3.

Linux's ASID Model

Linux does not give each process a globally unique PCID. Instead it maintains a small per-CPU array of dynamic ASIDs (TLB_NR_DYN_ASIDS = 6), caching the last few processes that ran on each CPU:

struct tlb_context {
    u64 ctx_id;   /* which mm owns this slot */
    u64 tlb_gen;  /* generation when last flushed */
};

struct tlb_state {
    struct mm_struct *loaded_mm;
    u16               loaded_mm_asid;
    u16               next_asid;
    bool              invalidate_other;  /* flush non-loaded ctxs next switch */
    unsigned short    user_pcid_flush_mask;
    struct tlb_context ctxs[TLB_NR_DYN_ASIDS];
    /* ... */
};
DECLARE_PER_CPU_ALIGNED(struct tlb_state, cpu_tlbstate);

choose_new_asid() scans ctxs[] looking for an existing slot whose ctx_id matches next->context.ctx_id. If found, it checks whether tlb_gen is current; if so, need_flush = false and the CR3 load uses CR3_NOFLUSH:

static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid,
                                               unsigned long lam)
{
    VM_WARN_ON_ONCE(!boot_cpu_has(X86_FEATURE_PCID));
    return build_cr3(pgd, asid, lam) | CR3_NOFLUSH;
}

If the generation has advanced (a TLB flush happened while this CPU was running another process), need_flush = true and a full CR3 reload (without CR3_NOFLUSH) evicts stale entries for that ASID.

KPTI and Doubled PCIDs

Why KPTI exists

On January 3, 2018, the Meltdown vulnerability (CVE-2017-5754) was disclosed. The root cause: for performance, Linux had always mapped the kernel's virtual address space into every process's page table. Switching from user to kernel mode (syscall, interrupt) required no CR3 reload — the kernel's virtual addresses were already reachable. This saved a TLB flush on every kernel entry and exit.

Meltdown showed that a user process could speculatively read kernel memory through this shared mapping using a side channel (cache timing), bypassing the hardware permission bits. The fix — Kernel Page Table Isolation (KPTI) — was conceptually simple: give each process two separate sets of page tables. The "user" page table has almost no kernel mappings (just a tiny trampoline needed to handle the mode switch). The "kernel" page table is only active while in kernel mode.

The cost was severe: every syscall and interrupt required two CR3 writes (one to switch to kernel page tables on entry, one to switch back on return). On a pre-PCID processor, each CR3 write flushes the entire TLB. Benchmarks showed 5–30% performance regression on syscall-heavy workloads.

PCID made KPTI bearable

PCID (Process Context Identifiers) had been introduced in Linux 4.7 but barely used — there was little reason to avoid TLB flushes on context switches when all process page tables were already in memory. Meltdown changed that calculation completely.

With PCID active, the two CR3 writes per syscall no longer flush the TLB: the kernel and user page tables each get a distinct PCID tag, and the CPU keeps both in the TLB simultaneously. An entry to the kernel switches to the kernel PCID; a return to userspace switches back to the user PCID. Warm TLB entries survive the transition.

Linux 4.14 enabled both KPTI and PCID together. On PCID-capable hardware (Westmere and newer), the KPTI overhead dropped from ~30% to ~5–10% for syscall-heavy workloads. On older hardware without PCID (which still needed to flush on every CR3 write), KPTI remained expensive and was sometimes disabled by administrators on trusted workloads.

When Kernel Page Table Isolation (KPTI) is active, each mm needs two PCIDs: one for user-space page tables and one for kernel-space. This halves the available PCID space to 2048 effective ASIDs (PTI_CONSUMED_PCID_BITS = 1).

Global ASIDs (INVLPGB)

On AMD processors supporting INVLPGB (broadcast TLB invalidation), the kernel can assign a global ASID to processes that are simultaneously active on four or more CPUs. A global ASID is identical on every CPU, enabling hardware-assisted remote TLB invalidation without IPIs:

/* From arch/x86/mm/tlb.c */
static void consider_global_asid(struct mm_struct *mm)
{
    if (!cpu_feature_enabled(X86_FEATURE_INVLPGB))
        return;

    if (mm_active_cpus_exceeds(mm, 3))
        use_global_asid(mm);
}

broadcast_tlb_flush() then uses invlpgb_flush_user_nr_nosync() followed by __tlbsync() to flush the TLB across all CPUs in hardware, completely eliminating IPI overhead for heavily multi-threaded processes.

PCID benefit at a glance

On a workload with frequent context switches between the same small set of processes (e.g., a web server with a fixed worker pool), PCID avoids all TLB flushes on CPUs where the process's ASID slot is still valid and up to date. The savings are proportional to TLB warm-up time per context switch.

TLB Shootdown Cost on NUMA Systems

On multi-socket NUMA machines, a TLB shootdown IPI must cross inter-socket interconnects to reach remote CPUs. These cross-socket IPIs are substantially more expensive than intra-socket ones.

Cost Anatomy

A TLB shootdown for a single munmap() on a 64-CPU, 4-socket machine:

IPI send: broadcast to up to 63 CPUs, crossing NUMA interconnects.
IPI receive overhead: each CPU must enter the IPI handler, execute INVLPG or CR3 reload, and return.
Synchronization: the initiating CPU waits for all acknowledgements before returning from flush_tlb_multi().

The mmu_gather batching directly amortises this cost: the number of IPIs is bounded by the number of distinct flush events, not the number of pages unmapped. Regardless of whether 1 or 1,000,000 pages are unmapped in a single munmap(), at most one flush IPI round-trip is needed.

`tlb_single_page_flush_ceiling`

For small ranges, individual INVLPG instructions (one per page) are cheaper than a full flush. For larger ranges, a full flush wins. The kernel switches strategies at tlb_single_page_flush_ceiling pages (default: 33):

/* arch/x86/mm/tlb.c */
unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;

Inside get_flush_tlb_info(), if (end - start) >> stride_shift > tlb_single_page_flush_ceiling, the range is promoted to TLB_FLUSH_ALL.

This threshold is tunable via /sys/kernel/debug/x86/tlb_single_page_flush_ceiling.

Reducing shootdown cost

Avoid frequent small unmaps in hot paths; coalesce munmap() calls over larger ranges.
Use madvise(MADV_FREE) instead of munmap() where possible: the kernel defers actual unmapping until memory pressure occurs, avoiding immediate TLB invalidation.
Use madvise(MADV_DONTNEED) to drop page contents without unmapping the VMA, which also defers TLB shootdown work.
On workloads with many threads, consider evaluating whether the kernel can use global ASIDs (AMD INVLPGB) to replace IPIs with hardware broadcast flushing.

Kernel TLB Pressure: vmalloc and Global Entries

Kernel virtual memory allocated via vmalloc() uses the global page-table region. Kernel mappings are not tagged per-process — they appear in every address space via the shared kernel PGD entries. Flushing a kernel TLB entry requires invalidating it on every CPU, regardless of which process is running.

`flush_tlb_kernel_range()`

void flush_tlb_kernel_range(unsigned long start, unsigned long end)

This function sends a flush to every CPU on the system. On a 64-CPU machine, every vfree() that triggers a kernel TLB flush costs 63 IPIs. The _PAGE_GLOBAL flag on kernel PTEs means these entries are not evicted by normal user-space CR3 reloads, so there is no opportunity to avoid the IPIs.

On processors with INVLPGB, flush_tlb_kernel_range() uses invlpgb_flush_addr_nosync() + __tlbsync() instead of IPIs, eliminating the per-CPU interrupt overhead.

Why vmalloc Is Expensive

vmalloc() call path:
  1. Allocate physical pages (GFP_KERNEL)
  2. Find a free range in the vmalloc virtual address space
  3. Map pages into kernel page tables
  4. flush_tlb_kernel_range() → IPI to ALL CPUs

vfree() call path:
  1. Unmap pages from kernel page tables
  2. flush_tlb_kernel_range() → IPI to ALL CPUs
  3. Free physical pages

The global flush on every vmalloc()/vfree() pair makes vmalloc unsuitable for high-frequency allocation in hot paths.

vmalloc guidelines

Avoid vmalloc in hot paths. Use kmalloc() (slab) for objects under ~128KB; it operates within the direct-mapped region and requires no global TLB invalidation.
For large kernel buffers that are allocated once and reused, vmalloc is acceptable; the TLB cost is paid once at allocation and deallocation.
vmalloc_huge() uses 2MB huge pages for vmalloc mappings, which reduces the number of kernel TLB entries required and can improve performance for large, persistent vmalloc areas.

Summary: TLB Optimization Strategies

Technique	Mechanism	Kernel Implementation
Flush batching	Accumulate ranges; one flush per unmap operation	`struct mmu_gather`, `tlb_gather_mmu()` / `tlb_finish_mmu()`
Lazy TLB mode	Skip IPIs to CPUs running kernel threads	`enter_lazy_tlb()`, `cpu_tlbstate_shared.is_lazy`, `should_flush_tlb()`
PCID / ASID caching	Survive context switches without flushing	`choose_new_asid()`, `build_cr3_noflush()`, `cpu_tlbstate.ctxs[]`
Huge pages	512× TLB coverage per entry	`tlb_update_vma_flags()`, `vma_huge`, PMD-level flush stride
Global ASIDs (INVLPGB)	Hardware broadcast; no IPIs	`use_global_asid()`, `broadcast_tlb_flush()`
Range-vs-full heuristic	Switch to full flush beyond 33 pages	`tlb_single_page_flush_ceiling`, `get_flush_tlb_info()`
Avoid kernel vmalloc	No global TLB flush for slab allocs	Use `kmalloc()` instead of `vmalloc()` in hot paths

TLB Optimization Techniques

Key Source Files

Why the TLB Matters for Performance

The Cost Difference

TLB Flush Basics

The Three Core Flush Functions (x86)

IPIs to Remote CPUs

mmu_gather — Batched TLB Flushing

The struct

Lifecycle

Why Batching Matters

Lazy TLB Mode

Entering Lazy Mode

Skipping IPIs to Lazy CPUs

Huge Pages and TLB Coverage

How the Kernel Tracks Huge-Page Entries

PCID — Process Context Identifiers

PCID in the Hardware

Linux's ASID Model

KPTI and Doubled PCIDs

Why KPTI exists

PCID made KPTI bearable

Global ASIDs (INVLPGB)

TLB Shootdown Cost on NUMA Systems

Cost Anatomy

`tlb_single_page_flush_ceiling`

Kernel TLB Pressure: vmalloc and Global Entries

`flush_tlb_kernel_range()`

Why vmalloc Is Expensive

Summary: TLB Optimization Strategies

Further reading

Kernel source

Kernel documentation

LWN articles

TLB Optimization Techniques

Key Source Files

Why the TLB Matters for Performance

The Cost Difference

TLB Flush Basics

The Three Core Flush Functions (x86)

IPIs to Remote CPUs

mmu_gather — Batched TLB Flushing

The struct

Lifecycle

Why Batching Matters

Lazy TLB Mode

Entering Lazy Mode

Skipping IPIs to Lazy CPUs

Huge Pages and TLB Coverage

How the Kernel Tracks Huge-Page Entries

PCID — Process Context Identifiers

PCID in the Hardware

Linux's ASID Model

KPTI and Doubled PCIDs

Why KPTI exists

PCID made KPTI bearable

Global ASIDs (INVLPGB)

TLB Shootdown Cost on NUMA Systems

Cost Anatomy

tlb_single_page_flush_ceiling

Kernel TLB Pressure: vmalloc and Global Entries

flush_tlb_kernel_range()

Why vmalloc Is Expensive

Summary: TLB Optimization Strategies

Further reading

Kernel source

Kernel documentation

Related pages

LWN articles

`tlb_single_page_flush_ceiling`

`flush_tlb_kernel_range()`