Skip to content

Page Tables

Mapping virtual addresses to physical memory

What Are Page Tables?

Page tables are hierarchical data structures that map virtual addresses (what programs see) to physical addresses (actual RAM locations). The MMU (Memory Management Unit) hardware uses them to translate every memory access.

Virtual Address                Physical Address
0x7fff12340000   ──────────>   0x1a2b3c000
       │                              │
       │    Page Table Walk           │
       └──────────────────────────────┘

The Five-Level Hierarchy

Linux uses a five-level page table hierarchy (since v4.14 for x86-64, see x86_64 memory map):

+-----+
| PGD |  Page Global Directory
+-----+
   v
+-----+
| P4D |  Page Level 4 Directory (5-level paging only)
+-----+
   v
+-----+
| PUD |  Page Upper Directory
+-----+
   v
+-----+
| PMD |  Page Middle Directory
+-----+
   v
+-----+
| PTE |  Page Table Entry
+-----+
   v
 PAGE    Physical memory page

Level Details

Level Type Bits (x86-64) Entries Maps
PGD pgd_t 47-39 (4-level) or 56-48 (5-level) 512 512GB or 128PB
P4D p4d_t 47-39 512 512GB (folded if 4-level)
PUD pud_t 38-30 512 1GB
PMD pmd_t 29-21 512 2MB
PTE pte_t 20-12 512 4KB

Address Translation (x86-64, 4-level)

A 48-bit virtual address breaks down as:

 47      39 38      30 29      21 20      12 11       0
+----------+----------+----------+----------+----------+
| PGD idx  | PUD idx  | PMD idx  | PTE idx  |  Offset  |
+----------+----------+----------+----------+----------+
   9 bits     9 bits     9 bits     9 bits    12 bits

Page Table Folding

Not all architectures need all five levels. Linux handles this through folding - unused levels are compile-time optimized away.

Architecture Levels Notes
x86-64 (LA57) 5 57-bit virtual addresses
x86-64 (standard) 4 48-bit virtual addresses
x86-32 (PAE) 3 36-bit physical addresses
x86-32 2 Original 32-bit
ARM64 3-4 Configurable

When a level is folded, functions like p4d_offset() become no-ops that return the input unchanged.

Key Data Structures

Per-Process Page Tables

Each process has its own page tables via mm_struct:

struct mm_struct {
    pgd_t *pgd;              /* Top-level page table */
    atomic_t mm_users;       /* Users of this mm */
    atomic_t mm_count;       /* References to struct */
    /* ... */
};

The kernel has a single swapper_pg_dir for kernel mappings, shared across all processes.

Page Table Entry Format (x86-64 specific)

 63    62    52 51                           12 11  9 8 7 6 5 4 3 2 1 0
+----+--------+-------------------------------+-----+-+-+-+-+-+-+-+-+-+
| XD | (avail)|        Physical Address       |avail|G|S|D|A|C|W|U|W|P|
+----+--------+-------------------------------+-----+-+-+-+-+-+-+-+-+-+
  │                        │                          │ │ │ │ │ │ │ │ │
  │                        │                          │ │ │ │ │ │ │ │ └─ Present
  │                        │                          │ │ │ │ │ │ │ └─── Writable
  │                        │                          │ │ │ │ │ │ └───── User accessible
  │                        │                          │ │ │ │ │ └─────── Write-through
  │                        │                          │ │ │ │ └───────── Cache disable
  │                        │                          │ │ │ └─────────── Accessed
  │                        │                          │ │ └───────────── Dirty
  │                        │                          │ └─────────────── Page size (huge)
  │                        │                          └───────────────── Global
  │                        └──────────────────────────────────────────── PFN
  └───────────────────────────────────────────────────────────────────── Execute disable

Page Faults

When the MMU can't translate an address, it triggers a page fault. The kernel handles this in handle_mm_fault():

Page Fault
    v
handle_mm_fault()
    v
__handle_mm_fault()
    ├── pgd_offset() ─> p4d_alloc()
    ├── p4d_offset() ─> pud_alloc()
    ├── pud_offset() ─> pmd_alloc()
    ├── pmd_offset() ─> pte_alloc()
    v
handle_pte_fault()
    ├── do_read_fault()    (file read)
    ├── do_cow_fault()     (copy-on-write)
    └── do_shared_fault()  (shared mapping)

Fault Types

Type Cause Resolution
Minor Page in memory, PTE not set Update PTE
Major Page not in memory Read from disk/swap
Invalid Bad address or permissions SIGSEGV

TLB (Translation Lookaside Buffer)

The TLB caches recent translations. Without it, every memory access would require multiple page table lookups.

TLB Flush Operations

/* Flush single page */
flush_tlb_page(vma, addr);

/* Flush range */
flush_tlb_range(vma, start, end);

/* Flush entire mm */
flush_tlb_mm(mm);

TLB flushes are expensive on SMP - they require IPIs (Inter-Processor Interrupts) to all CPUs running the affected process.

Huge Pages

Higher-level entries can map large pages directly, skipping lower levels:

Level Page Size Use Case
PUD 1GB Large databases, VMs
PMD 2MB General large allocations
PTE 4KB Default
/* Check if PMD maps a huge page */
if (pmd_large(*pmd)) {
    /* 2MB page, no PTE walk needed */
}

Benefits: - Fewer TLB entries needed - Reduced page table memory - Fewer page faults

Trade-offs: - Internal fragmentation - Allocation challenges

History

Origins (1991-1994)

Early Linux on i386 used two-level page tables (PGD + PTE). The swapper_pg_dir was the top-level Page Global Directory, not a single flat table. The i386 hardware dictated this structure.

Three-Level (v2.3.23, 1999)

Added PMD for PAE (Physical Address Extension) on x86, enabling >4GB physical memory on 32-bit.

Four-Level (v2.6.11, 2005)

Added PUD for x86-64's 48-bit virtual address space. This predates git history (kernel moved to git at v2.6.12).

Five-Level (v4.14, 2017)

Commit: 77ef56e4f0fb ("x86: Enable 5-level paging support via CONFIG_X86_5LEVEL=y") | LKML

Added P4D for 57-bit virtual addresses (128PB), needed for machines with >64TB physical memory.

Try It Yourself

View Process Page Tables

# Page table stats for a process
cat /proc/<pid>/smaps | grep -E "Size|Rss|Pss"

# Detailed page mapping
cat /proc/<pid>/pagemap  # Binary format

# Parse with tools
pagemap <pid> <address>

Check TLB Flushes

# TLB flush statistics
cat /proc/vmstat | grep tlb

# Trace TLB flushes
echo 1 > /sys/kernel/debug/tracing/events/tlb/tlb_flush/enable
cat /sys/kernel/debug/tracing/trace_pipe

Check Huge Page Usage

# System huge page stats
cat /proc/meminfo | grep -i huge

# Per-process huge pages
cat /proc/<pid>/smaps | grep -i huge

Common Issues

TLB Shootdown Storms

Heavy mprotect() or munmap() causes excessive IPIs.

Debug: Check /proc/interrupts for TLB IPI counts.

Page Table Memory Overhead

Sparse address spaces waste page table memory.

Debug: Check PageTables in /proc/meminfo.

Huge Page Allocation Failures

Can't allocate huge pages due to fragmentation.

Solutions: - Reserve at boot: hugepages=N - Enable THP: /sys/kernel/mm/transparent_hugepage/enabled - Compact memory: echo 1 > /proc/sys/vm/compact_memory

Notorious bugs and edge cases

Page tables mediate all memory access. Bugs here range from privilege escalation to information disclosure to complete CPU security model breakdown.

Case 1: Meltdown (CVE-2017-5754)

What happened

In January 2018, researchers disclosed Meltdown, a hardware vulnerability affecting Intel CPUs that allowed unprivileged processes to read kernel memory.

The bug

Meltdown exploits speculative execution combined with cache timing. The CPU speculatively executes an illegal kernel read, and even though it raises a fault, the cache side effects remain, leaking the data.

The fix: KPTI

Commit: aa8c6248f8c7 ("x86/mm/pti: Add infrastructure for page table isolation")

Author: Thomas Gleixner

KPTI creates separate page tables for kernel and user mode. When entering the kernel, page tables are switched to include kernel mappings.

# Check KPTI status
cat /sys/devices/system/cpu/vulnerabilities/meltdown

Case 2: Spectre (CVE-2017-5753, CVE-2017-5715)

What happened

Disclosed alongside Meltdown, Spectre exploits speculative execution to leak data from other processes or the kernel.

Variant 1 (Bounds Check Bypass)

The CPU speculatively executes past bounds checks, allowing out-of-bounds reads via cache timing.

Fix: Array index masking with array_index_nospec().

Variant 2 (Branch Target Injection)

Attackers poison the branch predictor to redirect speculative execution to chosen gadgets.

Fix: Retpoline - replaces indirect calls with a sequence that traps speculative execution.

Commit: 76b043848fd2 ("x86/retpoline: Add initial retpoline support")


Case 3: TLB flush races

What happened

The TLB caches page table entries. When page tables change, the TLB must be flushed. Races between changes and flushes cause security vulnerabilities.

The bug pattern

A CPU can use stale TLB entries to access pages that should be unmapped: 1. Page table entry cleared 2. TLB flush pending (not yet complete) 3. Another CPU accesses via stale TLB entry 4. Access succeeds to freed/reallocated page

Example: CVE-2018-18281

The mremap() syscall moved page table entries but flushed TLBs too late.

Commit: eb66ae030829 ("mremap: properly flush TLB before releasing the page")


Case 4: KASLR bypasses

What is KASLR?

Kernel Address Space Layout Randomization randomizes where the kernel is loaded, making exploits harder.

Bypass techniques

Technique Method
Info leak Kernel pointer exposed to userspace
Timing side channel Measure access times
dmesg leak Kernel addresses in logs

Hardening

# Restrict kernel pointers
echo 2 > /proc/sys/kernel/kptr_restrict

# Restrict dmesg
echo 1 > /proc/sys/kernel/dmesg_restrict

Summary: Lessons learned

Bug Year Type Mitigation
Meltdown 2017 Hardware KPTI
Spectre v1 2017 Hardware nospec barriers
Spectre v2 2017 Hardware Retpoline
TLB races Ongoing Software Proper flush ordering
KASLR bypass Ongoing Info leak Pointer restrictions

The pattern: Page table bugs often involve timing (TLB races, speculation) or assumptions about hardware behavior that turn out to be wrong.

References

Key Code

File Description
include/linux/pgtable.h Generic page table API
arch/x86/include/asm/pgtable.h x86 page table definitions
mm/memory.c Page fault handling

Kernel Documentation

Further reading

  • arch/x86/mm/pgtable.c — x86-specific page table allocation, PTE manipulation helpers, and KPTI page table switching
  • include/linux/pgtable.h — architecture-independent page table API: pgd_offset(), pte_present(), pte_mkwrite(), and the full set of PTE flag accessors
  • mm/memory.c — page fault entry point handle_mm_fault(), COW handling, and page table teardown on munmap()
  • Documentation/admin-guide/mm/concepts.rst — kernel documentation overview of virtual memory, page tables, and the TLB
  • Five-level page tables (LWN, 2017) — design and rationale for adding the P4D level to support 57-bit virtual address spaces
  • KPTI: kernel page-table isolation (LWN, 2017) — how the Meltdown mitigation works by maintaining separate page tables for kernel and user mode
  • page-fault — the complete page fault handling path from hardware trap through handle_pte_fault() to memory resolution
  • tlb-optimization — TLB batching, lazy flushing, and how the kernel minimises IPI-heavy TLB shootdowns