Skip to content

What happens when you fork

The memory mechanics of process creation

fork() in 30 seconds

When a process calls fork(), the kernel creates a child process that's almost an exact copy of the parent. But copying gigabytes of memory would be slow and wasteful. Instead, Linux uses copy-on-write (COW): parent and child share the same physical pages until one of them writes.

flowchart TB
    subgraph before["Parent Process (before fork)"]
        VAS["Virtual Address Space<br/>[stack] [heap] [data] [code] [libs]"]
        PP["Physical pages (owned exclusively)"]
        VAS --> PP
    end

    subgraph after["After fork()"]
        subgraph Parent1["Parent"]
            PS1["[stack]"]
        end
        subgraph Child1["Child"]
            CS1["[stack]"]
        end
        SP["Shared physical pages (read-only)"]
        PS1 --> SP
        CS1 --> SP
    end

    subgraph cow["After child writes to stack"]
        subgraph Parent2["Parent"]
            PS2["[stack]"]
            OP["[Original]"]
            PS2 --> OP
        end
        subgraph Child2["Child"]
            CS2["[stack]"]
            CP["[Copy] (new page)"]
            CS2 --> CP
        end
    end

    before --> after
    after --> cow

This makes fork() fast - O(page tables) rather than O(memory used).

Before fork: The parent's memory

A typical process has multiple memory regions, each backed differently:

flowchart TB
    subgraph addr["Parent Process Address Space"]
        direction TB
        HIGH["High addresses"]
        STACK["[Stack]<br/>Anonymous, private, grows down"]
        MMAP["[Memory-mapped files]<br/>File-backed, may be shared"]
        LIBS["[Shared libraries]<br/>File-backed, shared across processes<br/>(libc.so, etc.)"]
        HEAP["[Heap]<br/>Anonymous, private, grows up"]
        BSS["[BSS]<br/>Anonymous (zero-initialized global variables)"]
        DATA["[Data]<br/>File-backed (initialized global variables)"]
        TEXT["[Text/Code]<br/>File-backed, read-only, executable"]
        LOW["Low addresses"]

        HIGH --- STACK
        STACK --- MMAP
        MMAP --- LIBS
        LIBS --- HEAP
        HEAP --- BSS
        BSS --- DATA
        DATA --- TEXT
        TEXT --- LOW
    end

Each region is represented by a VMA (Virtual Memory Area). The kernel must handle each VMA appropriately during fork.

The fork system call

Fork is implemented via the clone() system call internally:

// User calls fork()
pid_t child = fork();

// glibc translates to:
clone(SIGCHLD, 0, NULL, NULL, 0);

// Kernel entry point: kernel/fork.c
SYSCALL_DEFINE0(fork)
{
    struct kernel_clone_args args = {
        .exit_signal = SIGCHLD,
    };
    return kernel_clone(&args);
}

What gets copied

Fork creates copies of many process structures:

Structure What Happens
task_struct New copy for child
mm_struct New copy (address space descriptor)
VMAs Duplicated (but reference same pages)
Page tables Duplicated (pointing to same physical pages)
Physical pages Shared (COW setup)
File descriptors Duplicated (reference same files)
Signal handlers Copied

The critical insight: page tables are copied, physical pages are not.

Inside dup_mm(): Copying the address space

The memory copying happens in dup_mm():

// kernel/fork.c (simplified)
static struct mm_struct *dup_mm(struct task_struct *tsk,
                                struct mm_struct *oldmm)
{
    struct mm_struct *mm;

    // 1. Allocate new mm_struct
    mm = allocate_mm();

    // 2. Copy mm_struct fields
    memcpy(mm, oldmm, sizeof(*mm));

    // 3. Allocate new page tables (PGD)
    mm->pgd = pgd_alloc(mm);

    // 4. Copy VMAs and page tables
    dup_mmap(mm, oldmm);

    return mm;
}

VMA duplication

Each VMA is duplicated, but the handling depends on its type:

// kernel/fork.c dup_mmap() (simplified)
static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
{
    struct vm_area_struct *mpnt, *tmp;

    for_each_vma(oldmm, mpnt) {
        // Skip regions marked DONTFORK
        if (mpnt->vm_flags & VM_DONTFORK)
            continue;

        // Create new VMA for child
        tmp = vm_area_dup(mpnt);

        // Handle special cases
        if (mpnt->vm_flags & VM_WIPEONFORK) {
            // Child sees zeros (security feature)
            tmp->anon_vma = NULL;
        }

        // Copy page tables for this VMA
        copy_page_range(mm, oldmm, tmp, mpnt);

        // Add VMA to child's address space
        insert_vm_struct(mm, tmp);
    }
}

Page table copying and COW setup

The key function is copy_page_range(), which walks the parent's page tables and creates corresponding entries in the child:

// mm/memory.c (simplified)
int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
                    struct vm_area_struct *dst_vma,
                    struct vm_area_struct *src_vma)
{
    // Walk page table levels: PGD → PUD → PMD → PTE
    // For each valid PTE...

    // If it's a private writable mapping:
    if (is_cow_mapping(src_vma->vm_flags)) {
        // Make BOTH parent and child PTEs read-only
        ptep_set_wrprotect(src_mm, addr, src_pte);
        pte = pte_wrprotect(pte);
    }

    // Copy the PTE to child (pointing to same physical page)
    set_pte_at(dst_mm, addr, dst_pte, pte);

    // Increment page reference count
    page_dup_rmap(page);
}

Critical detail: Both parent AND child are made read-only for private writable mappings. This ensures the first writer (whichever process) triggers the COW fault.

The COW setup in detail

flowchart TB
    subgraph before["Before fork"]
        PPTE1["Parent PTE<br/>[RW, present]<br/>refcount=1"]
        PAGE1["Physical Page"]
        PPTE1 --> PAGE1
    end

    subgraph after["After fork (COW setup)"]
        PPTE2["Parent PTE<br/>[RO, present]"]
        PAGE2["Physical Page<br/>refcount=2"]
        CPTE["Child PTE<br/>[RO, present]"]
        PPTE2 --> PAGE2
        CPTE --> PAGE2
    end

    before --> after

Both PTEs point to the same physical page. Both are marked read-only. The page's reference count increments to 2.

COW fault handling

When either process writes to a COW page, a page fault occurs:

// User code (in parent or child)
buffer[0] = 'x';  // Write to shared page

// CPU: "PTE says read-only, but this is a write!"
// → Page fault → Kernel handles it

The kernel handles this in do_wp_page():

// mm/memory.c (simplified)
static vm_fault_t do_wp_page(struct vm_fault *vmf)
{
    struct folio *old_folio = page_folio(vmf->page);

    // Check if we're the only user of this page
    if (wp_can_reuse_anon_folio(old_folio, vmf->vma)) {
        // We're the only mapper - just make it writable
        wp_page_reuse(vmf, old_folio);
        return 0;
    }

    // Multiple mappers - need to copy
    return wp_page_copy(vmf);
}

static vm_fault_t wp_page_copy(struct vm_fault *vmf)
{
    struct folio *old_folio = page_folio(vmf->page);
    struct folio *new_folio;

    // 1. Allocate new page
    new_folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, ...);

    // 2. Copy content (the actual "copy" in copy-on-write)
    copy_user_highpage(&new_folio->page, &old_folio->page, ...);

    // 3. Update page table to point to new page (with write permission)
    set_pte_at(vma->vm_mm, vmf->address, vmf->pte,
               mk_pte(&new_folio->page, vma->vm_page_prot));

    // 4. Decrement old page refcount
    folio_put(old_folio);

    return 0;
}

After COW fault

flowchart LR
    subgraph parent["Parent (still read-only)"]
        PPTE["Parent PTE<br/>[RO, present]"]
        ORIG["Original Page<br/>refcount=1"]
        PPTE --> ORIG
    end

    subgraph child["Child (has its own copy)"]
        CPTE["Child PTE<br/>[RW, present]"]
        COPY["New Page (copy)<br/>refcount=1"]
        CPTE --> COPY
    end

Important: The parent's PTE stays read-only even though it's now the exclusive owner. If the parent writes later, do_wp_page() sees refcount=1, calls wp_page_reuse(), and simply makes the PTE writable - no copy needed. This lazy approach avoids modifying PTEs that might never be written to.

TLB considerations

After modifying page tables, stale TLB entries must be invalidated.

During fork

When making PTEs read-only for COW:

// The parent's PTE changes from RW to RO
// TLB might have cached the old RW entry

ptep_set_wrprotect(src_mm, addr, src_pte);
// This implies a TLB flush for that address

On x86, modifying a PTE from writable to read-only requires flushing the TLB entry. Otherwise, the CPU might use a cached writable entry and skip the COW fault.

During COW fault

After updating the faulting process's PTE:

// Old entry (RO) → New entry (RW, different physical page)
// Flush the old TLB entry

flush_tlb_page(vma, address);

Multi-threaded processes

If the forking process is multi-threaded, all threads share the same mm_struct. During fork:

  • The forking thread holds mmap_lock for write
  • Other threads are prevented from faulting during the page table copy
  • TLB shootdowns may be needed across CPUs running sibling threads

This is one reason fork() in heavily multi-threaded processes can be slow.

Memory accounting after fork

After fork, memory metrics can be confusing:

# Parent and child both show:
ps -o pid,vsz,rss,comm -p $PARENT_PID,$CHILD_PID

# VSZ: Same (same virtual address space layout)
# RSS: Initially same, diverges as COW happens

RSS (Resident Set Size)

RSS counts physical pages mapped to a process. After fork:

  • Both processes count the shared pages in their RSS
  • The sum of parent + child RSS > actual physical memory used
  • This is expected! RSS is per-process, not system-wide
flowchart LR
    subgraph reality["Reality after fork"]
        PHYS["Physical memory: 100MB (shared)"]
        subgraph reported["Reported (misleading)"]
            PRSS["Parent RSS: 100MB"]
            CRSS["Child RSS: 100MB"]
            TRSS["Total RSS: 200MB"]
        end
        ACTUAL["Actual physical: 100MB"]
    end

PSS (Proportional Set Size)

PSS divides shared pages among sharing processes:

flowchart LR
    subgraph pss["With PSS"]
        PHYS["Physical memory: 100MB<br/>(shared between 2 processes)"]
        subgraph accurate["Accurate accounting"]
            PPSS["Parent PSS: 50MB"]
            CPSS["Child PSS: 50MB"]
            TPSS["Total PSS: 100MB"]
        end
    end
# View PSS
cat /proc/$PID/smaps_rollup | grep Pss

Shared vs Private

The smaps file distinguishes:

cat /proc/$PID/smaps | grep -E "^(Shared|Private)"

# Shared_Clean:  Shared, unmodified pages
# Shared_Dirty:  Shared, modified pages (rare after fork)
# Private_Clean: Private pages, not written
# Private_Dirty: Private pages, written (COW completed)

After fork, most pages are Shared_Clean. As COW happens, they become Private_Dirty in the writing process.

exec() and COW cleanup

The common pattern fork() + exec() is highly optimized:

pid_t pid = fork();
if (pid == 0) {
    // Child process
    exec("/bin/ls", ...);  // Replace address space entirely
}

When exec() runs:

  1. Entire address space is discarded
  2. COW pages are never copied (refcount decremented instead)
  3. New program's pages are mapped fresh
flowchart LR
    FORK["fork()"]
    COW["Shared COW pages"]
    EXEC["exec()<br/>Discards all!"]
    RESULT["No copies made!<br/>Just refcount up, then down."]

    FORK --> COW --> EXEC
    COW --> RESULT

This is why shells (which fork+exec constantly) are efficient despite creating many child processes.

vfork() optimization

vfork() goes further - it doesn't even copy page tables:

pid_t pid = vfork();
if (pid == 0) {
    // Child SHARES parent's address space entirely
    // Parent is BLOCKED until child exits or execs
    exec("/bin/ls", ...);
}

Warning: With vfork(), any memory modification in the child corrupts the parent. Only use if you exec() or _exit() immediately.

Special cases

VM_DONTFORK

Some VMAs shouldn't be inherited:

// After mmap, mark the region as don't-fork
void *buf = mmap(NULL, size, PROT_READ | PROT_WRITE,
                 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
madvise(buf, size, MADV_DONTFORK);

The child simply won't have this mapping. Use for DMA buffers or security-sensitive regions.

VM_WIPEONFORK

The mapping exists in the child but content is zeroed:

madvise(addr, len, MADV_WIPEONFORK);

Added in kernel 4.14 (commit d2cd9ede6e19 | LKML) for security-sensitive data (encryption keys, random state) that shouldn't leak to children.

Huge pages and fork

Transparent Huge Pages (THP) have special COW handling:

  • Old behavior: Copy entire 2MB on first write (expensive!)
  • Modern behavior (v4.5+): Split THP into 4KB pages, COW only the written page

See the THP documentation and COW explainer for details.

Performance considerations

Fork latency

Fork time scales with:

  1. Number of VMAs: Each VMA is duplicated
  2. Page table size: All page table levels are copied
  3. TLB flush overhead: Especially on multi-CPU processes

For a process with 1GB of mapped memory (256K pages), fork must walk and copy page table entries for all of them.

Reducing fork overhead

  1. Use posix_spawn() when possible: Combines fork+exec more efficiently
  2. Avoid huge address spaces before fork: Unmapping reduces page table copying
  3. Consider vfork() for fork+exec: No page table copy, but dangerous

Measuring fork impact

# Time fork() operations
perf stat -e 'syscalls:sys_enter_clone' ./your_program

# Trace fork latency
perf trace -e clone ./your_program

# Watch COW faults after fork
perf stat -e page-faults ./your_program

Try it yourself

Watch COW in action

cat > /tmp/cow_demo.c << 'EOF'
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/wait.h>

int main() {
    // Allocate and touch memory
    size_t size = 100 * 1024 * 1024;  // 100MB
    char *buf = malloc(size);
    memset(buf, 'A', size);  // Fault in all pages

    printf("Before fork - check RSS:\n");
    printf("  Parent PID: %d\n", getpid());
    char cmd[64];
    snprintf(cmd, sizeof(cmd), "ps -o pid,rss -p %d", getpid());
    system(cmd);

    pid_t pid = fork();

    if (pid == 0) {
        // Child
        printf("\nAfter fork - child PID: %d\n", getpid());
        sleep(1);  // Let parent print first

        printf("\nChild writing to all pages (triggering COW)...\n");
        memset(buf, 'B', size);

        printf("\nAfter COW - check both processes:\n");
        char cmd[256];
        snprintf(cmd, sizeof(cmd), "ps -o pid,rss -p %d,%d", getppid(), getpid());
        system(cmd);

        exit(0);
    } else {
        // Parent
        sleep(2);  // Let child do its work
        wait(NULL);
    }

    return 0;
}
EOF
gcc -o /tmp/cow_demo /tmp/cow_demo.c
/tmp/cow_demo

Monitor page sharing

# View shared vs private memory
cat /proc/$PID/smaps_rollup

# Before fork (in parent):
# Shared_Clean + Shared_Dirty ≈ 0 (no sharing)
# Private_Clean + Private_Dirty ≈ RSS

# After fork (before COW):
# Shared_Clean ≈ most memory (COW shared)
# Private_* ≈ small

# After COW:
# Private_Dirty increases (copies made)

Trace fork system call

# Trace fork with timing
strace -f -T -e clone,execve ./your_program

# -f: follow forks
# -T: show time spent in syscall

Count page table pages

# Page table memory is in /proc/PID/status
grep -E "VmPTE|VmPMD" /proc/$PID/status

# VmPTE: PTE-level page table memory
# VmPMD: PMD-level page table memory

Key source files

File What It Does
kernel/fork.c fork/clone implementation, dup_mm()
mm/memory.c copy_page_range(), COW fault handling
mm/mmap.c VMA duplication
include/linux/mm_types.h mm_struct, vm_area_struct

History

Early COW

Copy-on-write in Unix predates Linux. Linux inherited the technique from Unix tradition when Linus Torvalds wrote the initial implementation in 1991.

Per-VMA locks (v6.4, 2023)

Commit: 5e31275cc997 ("mm: add per-VMA lock and helper functions to control it") | LKML

Author: Suren Baghdasaryan

COW page faults can now be handled with per-VMA locks instead of the global mmap_lock, significantly improving scalability for multi-threaded processes after fork.

COW under VMA lock (v6.4+, 2023)

Commit: 164b06f238b9 ("mm: call wp_page_copy() under the VMA lock") | LKML

Author: Matthew Wilcox (Oracle)

Part of ongoing work to handle COW faults without taking mmap_lock, improving concurrent page fault handling.

Notorious bugs and edge cases

Fork and COW involve subtle race conditions between memory mapping, page table manipulation, and signal handling. These bugs have led to some of the most impactful Linux kernel vulnerabilities.


Case 1: Dirty COW (CVE-2016-5195)

What happened

In October 2016, Phil Oester discovered a race condition in the kernel's COW handling that had existed since kernel 2.6.22 (July 2007) - nine years undetected.

The vulnerability allowed an unprivileged local user to gain write access to read-only memory mappings, enabling privilege escalation to root.

The bug

The official Dirty COW page explains the race condition occurs between two operations:

  1. Thread A: Writing to a COW memory mapping (triggers COW fault)
  2. Thread B: Calling madvise(MADV_DONTNEED) to discard the page

When these race repeatedly, the kernel can be confused into writing data directly to the original read-only page instead of creating a private copy first.

sequenceDiagram
    participant T1 as Thread 1 (write)
    participant K as Kernel
    participant T2 as Thread 2 (madvise)

    T1->>K: Write to COW page
    K->>K: Start COW fault handling
    T2->>K: madvise(MADV_DONTNEED)
    K->>K: Discard the COW copy
    K->>K: Race! Write goes to original page
    Note over K: Read-only page now modified!

Real-world implications

The vulnerability was actively exploited in the wild. Attack scenarios included:

  • Overwriting /etc/passwd to gain root access
  • Modifying setuid binaries
  • Rooting Android devices before version 7 (Nougat)

Red Hat's analysis: "Although it is a local privilege escalation, remote attackers can use it in conjunction with other exploits that allow remote execution of non-privileged code to achieve remote root access."

The fix

Commit: 19be0eaffa3a ("mm: remove gup_flags FOLL_WRITE games from __get_user_pages()") | LKML

Author: Linus Torvalds

Linus wrote: "This is an ancient bug that was actually attempted to be fixed once (badly) by me eleven years ago... It was then undone due to problems on s390."

The fix removed the FOLL_WRITE flag manipulation that created the race window.

Why it went undetected

The bug was subtle: - Required precise timing between two threads - Only affected private mappings of read-only files - Normal testing wouldn't trigger the race

Legacy

Dirty COW spawned research into similar races. The name comes from "Dirty" (the write operation) + "COW" (Copy-On-Write). It earned its own logo, website, and became a textbook example of race condition vulnerabilities.


Case 2: StackRot (CVE-2023-3269)

What happened

In June 2023, Ruihan Li (Peking University) discovered a use-after-free vulnerability in the maple tree - the new data structure that replaced red-black trees for VMA management in kernel 6.1.

The bug

The StackRot disclosure explains:

When the stack grows (via MAP_GROWSDOWN), the kernel adjusts the VMA's start address. This adjustment modifies the maple tree without holding the MM write lock.

Because maple trees are RCU-safe, node modifications create new nodes and schedule old ones for deletion via RCU callbacks. But VMA access only requires the MM read lock - it doesn't enter RCU critical sections.

flowchart TD
    A["Stack grows<br/>(page fault below stack)"] --> B["Kernel adjusts VMA start"]
    B --> C["Maple tree node replaced"]
    C --> D["Old node scheduled for RCU free"]
    D --> E["Other thread holds MM read lock<br/>Still has pointer to old node"]
    E --> F["RCU callback frees old node"]
    F --> G["USE-AFTER-FREE!"]

Real-world implications

From the oss-security disclosure:

"An unprivileged local user could use this flaw to compromise the kernel and escalate their privileges."

The vulnerability affected kernels 6.1 through 6.4 - any system using maple trees for VMA management.

The fix

Linus merged the fix on June 28, 2023 during the 6.5 merge window, with backports to 6.1.37, 6.3.11, and 6.4.1.

Linus provided a detailed merge message explaining the technical approach: ensuring proper locking when modifying maple tree nodes that might be accessed under RCU.

Why this matters for fork

Fork duplicates VMAs and their maple tree representation. The maple tree's complexity (compared to the simpler red-black tree it replaced) introduced this subtle race. It's a reminder that data structure changes in MM code require extremely careful concurrency analysis.


Case 3: The mremap() disaster (CVE-2004-0077)

What happened

In February 2004, Paul Starzetz discovered a critical vulnerability in mremap() - the system call that moves or resizes memory mappings.

The bug

From the ISEC advisory:

"A critical security vulnerability was found in the Linux kernel memory management code in the mremap(2) system call due to incorrect bound checks."

The kernel failed to properly validate arguments to mremap(). By crafting specific arguments, an attacker could:

  1. Overflow page table counters
  2. Hijack another process's virtual memory segment
  3. If that segment ran with elevated privileges, gain those privileges

Real-world implications

Any local user could escalate to root. The vulnerability was trivially exploitable with public proof-of-concept code.

From the advisory: "No special privileges are required to use the mremap(2) system call, thus any process may misuse its unexpected behavior to disrupt the kernel memory management subsystem."

The fix

Multiple patches were required across kernel versions 2.4.x and 2.6.x. The fix added proper bounds checking in the mremap path.

Legacy

This bug highlighted the danger of complex MM system calls. It led to increased scrutiny of mmap/mremap/mprotect code paths and inspired research into MM syscall fuzzing.


Case 4: TLB flush races in mremap (CVE-2018-18281)

What happened

In 2018, Jann Horn (Google Project Zero) found a TLB (Translation Lookaside Buffer) race in mremap().

The bug

From the oss-security disclosure by Jann Horn:

When mremap() moves a mapping, it must: 1. Update page tables 2. Flush TLBs on all CPUs that might have cached the old translation

If the TLB flush is incomplete or races with concurrent access, a CPU might use stale translations, accessing the wrong physical page.

Real-world implications

An attacker could potentially read or write memory belonging to other processes, bypassing isolation.

The fix

Commit: eb66ae030829 ("mremap: properly flush TLB before releasing the page")

Author: Linus Torvalds

The fix ensures TLB flushes happen before releasing both source and destination page table locks, preventing stale TLB entries from accessing reallocated pages.


Case 5: fork() and userfaultfd races

The pattern

userfaultfd allows userspace to handle page faults. This is extremely useful for: - Live migration of VMs - Garbage collectors - Custom memory management

But it's also a powerful primitive for exploiting race conditions. By handling page faults in userspace, an attacker can pause kernel execution at precise moments.

Exploitation technique

From CVE-2019-18683 analysis:

  1. Attacker maps memory with userfaultfd
  2. Triggers a kernel code path that accesses that memory
  3. Kernel blocks waiting for userspace to handle the fault
  4. Attacker manipulates other kernel state while kernel is paused
  5. Attacker releases the fault
  6. Kernel continues with corrupted state

This technique has been used in many kernel exploits, including Dirty COW variants.

Mitigation

Since kernel 5.2, /proc/sys/vm/unprivileged_userfaultfd can restrict userfaultfd to privileged users:

# Disable unprivileged userfaultfd (reduces kernel attack surface)
echo 0 > /proc/sys/vm/unprivileged_userfaultfd

Many distributions now default to 0.


Summary: Lessons learned

Bug Year Root Cause Impact Prevention
Dirty COW 2016 Race in COW + madvise Root access Kernel update, lived 9 years undetected
StackRot 2023 RCU vs MM lock mismatch Root access Careful data structure concurrency review
mremap bounds 2004 Missing bounds checks Root access Syscall argument validation
TLB flush race 2018 Incomplete TLB flush Memory disclosure Proper flush ordering
userfaultfd races Ongoing Kernel pause primitive Various Restrict unprivileged userfaultfd

The common thread: fork and COW involve complex state shared between processes. Any race condition in this code has severe security implications because it can break process isolation.

Further reading

External resources