Page Tables

Mapping virtual addresses to physical memory

What Are Page Tables?

Page tables are hierarchical data structures that map virtual addresses (what programs see) to physical addresses (actual RAM locations). The MMU (Memory Management Unit) hardware uses them to translate every memory access.

Virtual Address                Physical Address
0x7fff12340000   ──────────>   0x1a2b3c000
       │                              │
       │    Page Table Walk           │
       └──────────────────────────────┘

The Five-Level Hierarchy

Linux uses a five-level page table hierarchy (since v4.14 for x86-64, see x86_64 memory map):

+-----+
| PGD |  Page Global Directory
+-----+
   │
   v
+-----+
| P4D |  Page Level 4 Directory (5-level paging only)
+-----+
   │
   v
+-----+
| PUD |  Page Upper Directory
+-----+
   │
   v
+-----+
| PMD |  Page Middle Directory
+-----+
   │
   v
+-----+
| PTE |  Page Table Entry
+-----+
   │
   v
 PAGE    Physical memory page

Level Details

Level	Type	Bits (x86-64)	Entries	Maps
PGD	`pgd_t`	47-39 (4-level) or 56-48 (5-level)	512	512GB or 128PB
P4D	`p4d_t`	47-39	512	512GB (folded if 4-level)
PUD	`pud_t`	38-30	512	1GB
PMD	`pmd_t`	29-21	512	2MB
PTE	`pte_t`	20-12	512	4KB

Address Translation (x86-64, 4-level)

A 48-bit virtual address breaks down as:

 47      39 38      30 29      21 20      12 11       0
+----------+----------+----------+----------+----------+
| PGD idx  | PUD idx  | PMD idx  | PTE idx  |  Offset  |
+----------+----------+----------+----------+----------+
   9 bits     9 bits     9 bits     9 bits    12 bits

Page Table Folding

Not all architectures need all five levels. Linux handles this through folding - unused levels are compile-time optimized away.

Architecture	Levels	Notes
x86-64 (LA57)	5	57-bit virtual addresses
x86-64 (standard)	4	48-bit virtual addresses
x86-32 (PAE)	3	36-bit physical addresses
x86-32	2	Original 32-bit
ARM64	3-4	Configurable

When a level is folded, functions like p4d_offset() become no-ops that return the input unchanged.

Key Data Structures

Per-Process Page Tables

Each process has its own page tables via mm_struct:

struct mm_struct {
    pgd_t *pgd;              /* Top-level page table */
    atomic_t mm_users;       /* Users of this mm */
    atomic_t mm_count;       /* References to struct */
    /* ... */
};

The kernel has a single swapper_pg_dir for kernel mappings, shared across all processes.

Page Table Entry Format (x86-64 specific)

 63    62    52 51                           12 11  9 8 7 6 5 4 3 2 1 0
+----+--------+-------------------------------+-----+-+-+-+-+-+-+-+-+-+
| XD | (avail)|        Physical Address       |avail|G|S|D|A|C|W|U|W|P|
+----+--------+-------------------------------+-----+-+-+-+-+-+-+-+-+-+
  │                        │                          │ │ │ │ │ │ │ │ │
  │                        │                          │ │ │ │ │ │ │ │ └─ Present
  │                        │                          │ │ │ │ │ │ │ └─── Writable
  │                        │                          │ │ │ │ │ │ └───── User accessible
  │                        │                          │ │ │ │ │ └─────── Write-through
  │                        │                          │ │ │ │ └───────── Cache disable
  │                        │                          │ │ │ └─────────── Accessed
  │                        │                          │ │ └───────────── Dirty
  │                        │                          │ └─────────────── Page size (huge)
  │                        │                          └───────────────── Global
  │                        └──────────────────────────────────────────── PFN
  └───────────────────────────────────────────────────────────────────── Execute disable

Page Faults

When the MMU can't translate an address, it triggers a page fault. The kernel handles this in handle_mm_fault():

Page Fault
    │
    v
handle_mm_fault()
    │
    v
__handle_mm_fault()
    │
    ├── pgd_offset() ─> p4d_alloc()
    ├── p4d_offset() ─> pud_alloc()
    ├── pud_offset() ─> pmd_alloc()
    ├── pmd_offset() ─> pte_alloc()
    │
    v
handle_pte_fault()
    │
    ├── do_read_fault()    (file read)
    ├── do_cow_fault()     (copy-on-write)
    └── do_shared_fault()  (shared mapping)

Fault Types

Type	Cause	Resolution
Minor	Page in memory, PTE not set	Update PTE
Major	Page not in memory	Read from disk/swap
Invalid	Bad address or permissions	SIGSEGV

TLB (Translation Lookaside Buffer)

The TLB caches recent translations. Without it, every memory access would require multiple page table lookups.

TLB Flush Operations

/* Flush single page */
flush_tlb_page(vma, addr);

/* Flush range */
flush_tlb_range(vma, start, end);

/* Flush entire mm */
flush_tlb_mm(mm);

TLB flushes are expensive on SMP - they require IPIs (Inter-Processor Interrupts) to all CPUs running the affected process.

Huge Pages

Higher-level entries can map large pages directly, skipping lower levels:

Level	Page Size	Use Case
PUD	1GB	Large databases, VMs
PMD	2MB	General large allocations
PTE	4KB	Default

/* Check if PMD maps a huge page */
if (pmd_large(*pmd)) {
    /* 2MB page, no PTE walk needed */
}

Benefits: - Fewer TLB entries needed - Reduced page table memory - Fewer page faults

Trade-offs: - Internal fragmentation - Allocation challenges

History

Origins (1991-1994)

Early Linux on i386 used two-level page tables (PGD + PTE). The swapper_pg_dir was the top-level Page Global Directory, not a single flat table. The i386 hardware dictated this structure.

Three-Level (v2.3.23, 1999)

Added PMD for PAE (Physical Address Extension) on x86, enabling >4GB physical memory on 32-bit.

Four-Level (v2.6.11, 2005)

Added PUD for x86-64's 48-bit virtual address space. This predates git history (kernel moved to git at v2.6.12).

Five-Level (v4.14, 2017)

Commit: 77ef56e4f0fb ("x86: Enable 5-level paging support via CONFIG_X86_5LEVEL=y") | LKML

Added P4D for 57-bit virtual addresses (128PB), needed for machines with >64TB physical memory.

Try It Yourself

View Process Page Tables

# Page table stats for a process
cat /proc/<pid>/smaps | grep -E "Size|Rss|Pss"

# Detailed page mapping
cat /proc/<pid>/pagemap  # Binary format

# Parse with tools
pagemap <pid> <address>

Check TLB Flushes

# TLB flush statistics
cat /proc/vmstat | grep tlb

# Trace TLB flushes
echo 1 > /sys/kernel/debug/tracing/events/tlb/tlb_flush/enable
cat /sys/kernel/debug/tracing/trace_pipe

Check Huge Page Usage

# System huge page stats
cat /proc/meminfo | grep -i huge

# Per-process huge pages
cat /proc/<pid>/smaps | grep -i huge

Common Issues

TLB Shootdown Storms

Heavy mprotect() or munmap() causes excessive IPIs.

Debug: Check /proc/interrupts for TLB IPI counts.

Page Table Memory Overhead

Sparse address spaces waste page table memory.

Debug: Check PageTables in /proc/meminfo.

Huge Page Allocation Failures

Can't allocate huge pages due to fragmentation.

Solutions: - Reserve at boot: hugepages=N - Enable THP: /sys/kernel/mm/transparent_hugepage/enabled - Compact memory: echo 1 > /proc/sys/vm/compact_memory

Notorious bugs and edge cases

Page tables mediate all memory access. Bugs here range from privilege escalation to information disclosure to complete CPU security model breakdown.

Case 1: Meltdown (CVE-2017-5754)

What happened

In January 2018, researchers disclosed Meltdown, a hardware vulnerability affecting Intel CPUs that allowed unprivileged processes to read kernel memory.

The bug

Meltdown exploits speculative execution combined with cache timing. The CPU speculatively executes an illegal kernel read, and even though it raises a fault, the cache side effects remain, leaking the data.

The fix: KPTI

Commit: aa8c6248f8c7 ("x86/mm/pti: Add infrastructure for page table isolation")

Author: Thomas Gleixner

KPTI creates separate page tables for kernel and user mode. When entering the kernel, page tables are switched to include kernel mappings.

# Check KPTI status
cat /sys/devices/system/cpu/vulnerabilities/meltdown

Case 2: Spectre (CVE-2017-5753, CVE-2017-5715)

What happened

Disclosed alongside Meltdown, Spectre exploits speculative execution to leak data from other processes or the kernel.

Variant 1 (Bounds Check Bypass)

The CPU speculatively executes past bounds checks, allowing out-of-bounds reads via cache timing.

Fix: Array index masking with array_index_nospec().

Variant 2 (Branch Target Injection)

Attackers poison the branch predictor to redirect speculative execution to chosen gadgets.

Fix: Retpoline - replaces indirect calls with a sequence that traps speculative execution.

Commit: 76b043848fd2 ("x86/retpoline: Add initial retpoline support")

Case 3: TLB flush races

What happened

The TLB caches page table entries. When page tables change, the TLB must be flushed. Races between changes and flushes cause security vulnerabilities.

The bug pattern

A CPU can use stale TLB entries to access pages that should be unmapped: 1. Page table entry cleared 2. TLB flush pending (not yet complete) 3. Another CPU accesses via stale TLB entry 4. Access succeeds to freed/reallocated page

Example: CVE-2018-18281

The mremap() syscall moved page table entries but flushed TLBs too late.

Commit: eb66ae030829 ("mremap: properly flush TLB before releasing the page")

Case 4: KASLR bypasses

What is KASLR?

Kernel Address Space Layout Randomization randomizes where the kernel is loaded, making exploits harder.

Bypass techniques

Technique	Method
Info leak	Kernel pointer exposed to userspace
Timing side channel	Measure access times
dmesg leak	Kernel addresses in logs

Hardening

# Restrict kernel pointers
echo 2 > /proc/sys/kernel/kptr_restrict

# Restrict dmesg
echo 1 > /proc/sys/kernel/dmesg_restrict

Summary: Lessons learned

Bug	Year	Type	Mitigation
Meltdown	2017	Hardware	KPTI
Spectre v1	2017	Hardware	nospec barriers
Spectre v2	2017	Hardware	Retpoline
TLB races	Ongoing	Software	Proper flush ordering
KASLR bypass	Ongoing	Info leak	Pointer restrictions

The pattern: Page table bugs often involve timing (TLB races, speculation) or assumptions about hardware behavior that turn out to be wrong.

References

Key Code

File	Description
`include/linux/pgtable.h`	Generic page table API
`arch/x86/include/asm/pgtable.h`	x86 page table definitions
`mm/memory.c`	Page fault handling

Kernel Documentation

Documentation/mm/page_tables.rst

overview - Memory management overview
glossary - PFN, TLB, MMU definitions
Bug Index - Index of all mm kernel bugs

Page Tables

What Are Page Tables?

The Five-Level Hierarchy

Level Details

Address Translation (x86-64, 4-level)

Page Table Folding

Key Data Structures

Per-Process Page Tables

Page Table Entry Format (x86-64 specific)

Page Faults

Fault Types

TLB (Translation Lookaside Buffer)

TLB Flush Operations

Huge Pages

History

Origins (1991-1994)

Three-Level (v2.3.23, 1999)

Four-Level (v2.6.11, 2005)

Five-Level (v4.14, 2017)

Try It Yourself

View Process Page Tables

Check TLB Flushes

Check Huge Page Usage

Common Issues

TLB Shootdown Storms

Page Table Memory Overhead

Huge Page Allocation Failures

Notorious bugs and edge cases

Case 1: Meltdown (CVE-2017-5754)

What happened

The bug

The fix: KPTI

Case 2: Spectre (CVE-2017-5753, CVE-2017-5715)

What happened

Variant 1 (Bounds Check Bypass)

Variant 2 (Branch Target Injection)

Case 3: TLB flush races

What happened

The bug pattern

Example: CVE-2018-18281

Case 4: KASLR bypasses

What is KASLR?

Bypass techniques

Hardening

Summary: Lessons learned

References

Key Code

Kernel Documentation

Related

Further reading