Page Tables
Mapping virtual addresses to physical memory
What Are Page Tables?
Page tables are hierarchical data structures that map virtual addresses (what programs see) to physical addresses (actual RAM locations). The MMU (Memory Management Unit) hardware uses them to translate every memory access.
Virtual Address Physical Address
0x7fff12340000 ──────────> 0x1a2b3c000
│ │
│ Page Table Walk │
└──────────────────────────────┘
The Five-Level Hierarchy
Linux uses a five-level page table hierarchy (since v4.14 for x86-64, see x86_64 memory map):
+-----+
| PGD | Page Global Directory
+-----+
│
v
+-----+
| P4D | Page Level 4 Directory (5-level paging only)
+-----+
│
v
+-----+
| PUD | Page Upper Directory
+-----+
│
v
+-----+
| PMD | Page Middle Directory
+-----+
│
v
+-----+
| PTE | Page Table Entry
+-----+
│
v
PAGE Physical memory page
Level Details
| Level | Type | Bits (x86-64) | Entries | Maps |
|---|---|---|---|---|
| PGD | pgd_t |
47-39 (4-level) or 56-48 (5-level) | 512 | 512GB or 128PB |
| P4D | p4d_t |
47-39 | 512 | 512GB (folded if 4-level) |
| PUD | pud_t |
38-30 | 512 | 1GB |
| PMD | pmd_t |
29-21 | 512 | 2MB |
| PTE | pte_t |
20-12 | 512 | 4KB |
Address Translation (x86-64, 4-level)
A 48-bit virtual address breaks down as:
47 39 38 30 29 21 20 12 11 0
+----------+----------+----------+----------+----------+
| PGD idx | PUD idx | PMD idx | PTE idx | Offset |
+----------+----------+----------+----------+----------+
9 bits 9 bits 9 bits 9 bits 12 bits
Page Table Folding
Not all architectures need all five levels. Linux handles this through folding - unused levels are compile-time optimized away.
| Architecture | Levels | Notes |
|---|---|---|
| x86-64 (LA57) | 5 | 57-bit virtual addresses |
| x86-64 (standard) | 4 | 48-bit virtual addresses |
| x86-32 (PAE) | 3 | 36-bit physical addresses |
| x86-32 | 2 | Original 32-bit |
| ARM64 | 3-4 | Configurable |
When a level is folded, functions like p4d_offset() become no-ops that return the input unchanged.
Key Data Structures
Per-Process Page Tables
Each process has its own page tables via mm_struct:
struct mm_struct {
pgd_t *pgd; /* Top-level page table */
atomic_t mm_users; /* Users of this mm */
atomic_t mm_count; /* References to struct */
/* ... */
};
The kernel has a single swapper_pg_dir for kernel mappings, shared across all processes.
Page Table Entry Format (x86-64 specific)
63 62 52 51 12 11 9 8 7 6 5 4 3 2 1 0
+----+--------+-------------------------------+-----+-+-+-+-+-+-+-+-+-+
| XD | (avail)| Physical Address |avail|G|S|D|A|C|W|U|W|P|
+----+--------+-------------------------------+-----+-+-+-+-+-+-+-+-+-+
│ │ │ │ │ │ │ │ │ │ │
│ │ │ │ │ │ │ │ │ │ └─ Present
│ │ │ │ │ │ │ │ │ └─── Writable
│ │ │ │ │ │ │ │ └───── User accessible
│ │ │ │ │ │ │ └─────── Write-through
│ │ │ │ │ │ └───────── Cache disable
│ │ │ │ │ └─────────── Accessed
│ │ │ │ └───────────── Dirty
│ │ │ └─────────────── Page size (huge)
│ │ └───────────────── Global
│ └──────────────────────────────────────────── PFN
└───────────────────────────────────────────────────────────────────── Execute disable
Page Faults
When the MMU can't translate an address, it triggers a page fault. The kernel handles this in handle_mm_fault():
Page Fault
│
v
handle_mm_fault()
│
v
__handle_mm_fault()
│
├── pgd_offset() ─> p4d_alloc()
├── p4d_offset() ─> pud_alloc()
├── pud_offset() ─> pmd_alloc()
├── pmd_offset() ─> pte_alloc()
│
v
handle_pte_fault()
│
├── do_read_fault() (file read)
├── do_cow_fault() (copy-on-write)
└── do_shared_fault() (shared mapping)
Fault Types
| Type | Cause | Resolution |
|---|---|---|
| Minor | Page in memory, PTE not set | Update PTE |
| Major | Page not in memory | Read from disk/swap |
| Invalid | Bad address or permissions | SIGSEGV |
TLB (Translation Lookaside Buffer)
The TLB caches recent translations. Without it, every memory access would require multiple page table lookups.
TLB Flush Operations
/* Flush single page */
flush_tlb_page(vma, addr);
/* Flush range */
flush_tlb_range(vma, start, end);
/* Flush entire mm */
flush_tlb_mm(mm);
TLB flushes are expensive on SMP - they require IPIs (Inter-Processor Interrupts) to all CPUs running the affected process.
Huge Pages
Higher-level entries can map large pages directly, skipping lower levels:
| Level | Page Size | Use Case |
|---|---|---|
| PUD | 1GB | Large databases, VMs |
| PMD | 2MB | General large allocations |
| PTE | 4KB | Default |
Benefits: - Fewer TLB entries needed - Reduced page table memory - Fewer page faults
Trade-offs: - Internal fragmentation - Allocation challenges
History
Origins (1991-1994)
Early Linux on i386 used two-level page tables (PGD + PTE). The swapper_pg_dir was the top-level Page Global Directory, not a single flat table. The i386 hardware dictated this structure.
Three-Level (v2.3.23, 1999)
Added PMD for PAE (Physical Address Extension) on x86, enabling >4GB physical memory on 32-bit.
Four-Level (v2.6.11, 2005)
Added PUD for x86-64's 48-bit virtual address space. This predates git history (kernel moved to git at v2.6.12).
Five-Level (v4.14, 2017)
Commit: 77ef56e4f0fb ("x86: Enable 5-level paging support via CONFIG_X86_5LEVEL=y") | LKML
Added P4D for 57-bit virtual addresses (128PB), needed for machines with >64TB physical memory.
Try It Yourself
View Process Page Tables
# Page table stats for a process
cat /proc/<pid>/smaps | grep -E "Size|Rss|Pss"
# Detailed page mapping
cat /proc/<pid>/pagemap # Binary format
# Parse with tools
pagemap <pid> <address>
Check TLB Flushes
# TLB flush statistics
cat /proc/vmstat | grep tlb
# Trace TLB flushes
echo 1 > /sys/kernel/debug/tracing/events/tlb/tlb_flush/enable
cat /sys/kernel/debug/tracing/trace_pipe
Check Huge Page Usage
# System huge page stats
cat /proc/meminfo | grep -i huge
# Per-process huge pages
cat /proc/<pid>/smaps | grep -i huge
Common Issues
TLB Shootdown Storms
Heavy mprotect() or munmap() causes excessive IPIs.
Debug: Check /proc/interrupts for TLB IPI counts.
Page Table Memory Overhead
Sparse address spaces waste page table memory.
Debug: Check PageTables in /proc/meminfo.
Huge Page Allocation Failures
Can't allocate huge pages due to fragmentation.
Solutions:
- Reserve at boot: hugepages=N
- Enable THP: /sys/kernel/mm/transparent_hugepage/enabled
- Compact memory: echo 1 > /proc/sys/vm/compact_memory
Notorious bugs and edge cases
Page tables mediate all memory access. Bugs here range from privilege escalation to information disclosure to complete CPU security model breakdown.
Case 1: Meltdown (CVE-2017-5754)
What happened
In January 2018, researchers disclosed Meltdown, a hardware vulnerability affecting Intel CPUs that allowed unprivileged processes to read kernel memory.
The bug
Meltdown exploits speculative execution combined with cache timing. The CPU speculatively executes an illegal kernel read, and even though it raises a fault, the cache side effects remain, leaking the data.
The fix: KPTI
Commit: aa8c6248f8c7 ("x86/mm/pti: Add infrastructure for page table isolation")
Author: Thomas Gleixner
KPTI creates separate page tables for kernel and user mode. When entering the kernel, page tables are switched to include kernel mappings.
Case 2: Spectre (CVE-2017-5753, CVE-2017-5715)
What happened
Disclosed alongside Meltdown, Spectre exploits speculative execution to leak data from other processes or the kernel.
Variant 1 (Bounds Check Bypass)
The CPU speculatively executes past bounds checks, allowing out-of-bounds reads via cache timing.
Fix: Array index masking with array_index_nospec().
Variant 2 (Branch Target Injection)
Attackers poison the branch predictor to redirect speculative execution to chosen gadgets.
Fix: Retpoline - replaces indirect calls with a sequence that traps speculative execution.
Commit: 76b043848fd2 ("x86/retpoline: Add initial retpoline support")
Case 3: TLB flush races
What happened
The TLB caches page table entries. When page tables change, the TLB must be flushed. Races between changes and flushes cause security vulnerabilities.
The bug pattern
A CPU can use stale TLB entries to access pages that should be unmapped: 1. Page table entry cleared 2. TLB flush pending (not yet complete) 3. Another CPU accesses via stale TLB entry 4. Access succeeds to freed/reallocated page
Example: CVE-2018-18281
The mremap() syscall moved page table entries but flushed TLBs too late.
Commit: eb66ae030829 ("mremap: properly flush TLB before releasing the page")
Case 4: KASLR bypasses
What is KASLR?
Kernel Address Space Layout Randomization randomizes where the kernel is loaded, making exploits harder.
Bypass techniques
| Technique | Method |
|---|---|
| Info leak | Kernel pointer exposed to userspace |
| Timing side channel | Measure access times |
| dmesg leak | Kernel addresses in logs |
Hardening
# Restrict kernel pointers
echo 2 > /proc/sys/kernel/kptr_restrict
# Restrict dmesg
echo 1 > /proc/sys/kernel/dmesg_restrict
Summary: Lessons learned
| Bug | Year | Type | Mitigation |
|---|---|---|---|
| Meltdown | 2017 | Hardware | KPTI |
| Spectre v1 | 2017 | Hardware | nospec barriers |
| Spectre v2 | 2017 | Hardware | Retpoline |
| TLB races | Ongoing | Software | Proper flush ordering |
| KASLR bypass | Ongoing | Info leak | Pointer restrictions |
The pattern: Page table bugs often involve timing (TLB races, speculation) or assumptions about hardware behavior that turn out to be wrong.
References
Key Code
| File | Description |
|---|---|
include/linux/pgtable.h |
Generic page table API |
arch/x86/include/asm/pgtable.h |
x86 page table definitions |
mm/memory.c |
Page fault handling |
Kernel Documentation
Related
- overview - Memory management overview
- glossary - PFN, TLB, MMU definitions
- Bug Index - Index of all mm kernel bugs
Further reading
- arch/x86/mm/pgtable.c — x86-specific page table allocation, PTE manipulation helpers, and KPTI page table switching
- include/linux/pgtable.h — architecture-independent page table API:
pgd_offset(),pte_present(),pte_mkwrite(), and the full set of PTE flag accessors - mm/memory.c — page fault entry point
handle_mm_fault(), COW handling, and page table teardown onmunmap() Documentation/admin-guide/mm/concepts.rst— kernel documentation overview of virtual memory, page tables, and the TLB- Five-level page tables (LWN, 2017) — design and rationale for adding the P4D level to support 57-bit virtual address spaces
- KPTI: kernel page-table isolation (LWN, 2017) — how the Meltdown mitigation works by maintaining separate page tables for kernel and user mode
- page-fault — the complete page fault handling path from hardware trap through
handle_pte_fault()to memory resolution - tlb-optimization — TLB batching, lazy flushing, and how the kernel minimises IPI-heavy TLB shootdowns