Skip to content

Life of a file read

Tracing a read() syscall through the page cache

What happens when you read a file?

When a program reads a file, the data rarely comes directly from disk. The kernel maintains a page cache - a cache of file data in memory. Most reads are satisfied from this cache without touching the disk at all.

flowchart TD
    A["read(fd, buf, 4096)"] --> B

    B["<b>VFS Layer</b><br/>- Find file from fd<br/>- Check permissions<br/>- Call file system's read method"]
    B --> C

    C["<b>Page Cache Lookup</b><br/>- Search file's address_space for page at offset<br/>- Using xarray (radix tree)"]
    C --> D{Cache Result}

    D -->|Cache HIT| E["Return cached<br/>data immediately<br/>(no disk I/O!)"]
    D -->|Cache MISS| F["Allocate page<br/>Submit I/O to disk<br/>Wait for completion<br/>Add page to cache<br/>Return data"]

    E --> G
    F --> G

    G["<b>Copy to User Buffer</b><br/>copy_to_user(buf, page_address + offset, count)"]
    G --> H["read() returns bytes read"]

The key insight: the page cache makes most reads fast. Once the kernel detects a sequential pattern, it prefetches pages asynchronously so that most subsequent reads are served from cache rather than disk. In steady state, readahead can achieve cache hit rates above 90% for sequential workloads.

The page cache

The page cache is the kernel's cache of file contents. Every file read goes through it.

Why cache file data?

Operation Latency
Memory access ~100 nanoseconds
SSD read ~100 microseconds (1000x slower)
HDD read ~10 milliseconds (100,000x slower)

(Order-of-magnitude figures from Latency Numbers Every Programmer Should Know, originally Jeff Dean via Peter Norvig.)

Even with fast SSDs, memory is orders of magnitude faster. The page cache exploits temporal locality - data read once is likely to be read again soon.

Page cache organization

Each open file has an address_space that manages its cached pages:

// From include/linux/fs.h
struct address_space {
    struct inode *host;           // Owning inode
    struct xarray i_pages;        // Cached pages (xarray/radix tree)
    atomic_t i_mmap_writable;     // Writable mappings
    struct rb_root_cached i_mmap; // Tree of private/shared mappings
    unsigned long nrpages;        // Number of cached pages
    const struct address_space_operations *a_ops;  // Operations
    // ...
};

Pages are indexed by file offset:

flowchart LR
    subgraph file["File (offset in pages)"]
        direction LR
        P0["[0]"]
        P1["[1]"]
        P2["[2]"]
        P3["[3]"]
        P4["[4]"]
        P5["[5]"]
        P6["[6]"]
        P7["..."]
    end

    subgraph xarray["address_space xarray"]
        direction LR
        O0["offset 0"] --> PG0["page_0"]
        O1["offset 1"] --> PG1["page_1"]
        O2["offset 2"] --> EMPTY["(empty - not cached)"]
        O3["offset 3"] --> PG3["page_3"]
        O4["..."]
    end

Stage 1: System call entry

A user program calls read():

// User space
ssize_t n = read(fd, buffer, 4096);

// Kernel entry: fs/read_write.c
SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
{
    struct fd f = fdget_pos(fd);

    // ... permission checks ...

    ret = vfs_read(f.file, buf, count, &pos);

    fdput_pos(f);
    return ret;
}

From fd to file

The kernel translates the file descriptor to a struct file:

flowchart LR
    subgraph fdtable["Process File Descriptor Table"]
        direction LR
        FD0["fd 0"] --> STDIN["stdin"]
        FD1["fd 1"] --> STDOUT["stdout"]
        FD2["fd 2"] --> STDERR["stderr"]
        FD3["fd 3"] --> FILE["/home/user/data.txt<br/>(Our file)"]
        FD4["..."]
    end
struct file {
    struct path f_path;           // Dentry and mount
    struct inode *f_inode;        // Inode (file metadata)
    const struct file_operations *f_op;  // read/write operations
    loff_t f_pos;                 // Current file position
    struct address_space *f_mapping;  // Page cache
    // ...
};

Stage 2: VFS read dispatch

vfs_read() calls the file system's read implementation:

// fs/read_write.c
ssize_t vfs_read(struct file *file, char __user *buf,
                 size_t count, loff_t *pos)
{
    // Check we can read
    if (!(file->f_mode & FMODE_READ))
        return -EBADF;

    // Check buffer is valid user memory
    if (!access_ok(buf, count))
        return -EFAULT;

    // Call file system's read method
    if (file->f_op->read)
        ret = file->f_op->read(file, buf, count, pos);
    else if (file->f_op->read_iter)
        ret = new_sync_read(file, buf, count, pos);

    return ret;
}

Most file systems use the generic read_iter path, which goes through generic_file_read_iter():

// mm/filemap.c
ssize_t generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
{
    if (iocb->ki_flags & IOCB_DIRECT)
        return mapping_direct_IO(...);  // Bypass cache

    return filemap_read(iocb, iter, 0);  // Normal path
}

Stage 3: Page cache lookup

The core of file reading is filemap_read():

// mm/filemap.c (simplified)
ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
                     ssize_t already_read)
{
    struct file *file = iocb->ki_filp;
    struct address_space *mapping = file->f_mapping;
    pgoff_t index = iocb->ki_pos >> PAGE_SHIFT;  // Page index from offset

    for (;;) {
        // 1. Try to find page in cache
        folio = filemap_get_folio(mapping, index);

        if (IS_ERR(folio)) {
            // Cache miss - need to read from disk
            folio = page_cache_sync_readahead(...);
        }

        // 2. Wait for page to be up-to-date
        folio_wait_locked(folio);
        if (!folio_test_uptodate(folio))
            goto error;

        // 3. Copy data to user buffer
        copied = copy_folio_to_iter(folio, offset, bytes, iter);

        folio_put(folio);
        index++;
    }
}

The xarray lookup

Finding a page in the cache uses the xarray (evolved from radix tree):

// Finding page at index
struct folio *filemap_get_folio(struct address_space *mapping,
                                 pgoff_t index)
{
    return __filemap_get_folio(mapping, index, FGP_ENTRY, 0);
}

struct folio *__filemap_get_folio(struct address_space *mapping,
                                   pgoff_t index, fgf_t fgp_flags,
                                   gfp_t gfp)
{
    struct folio *folio;

    // Lock-free lookup in xarray
    folio = xa_load(&mapping->i_pages, index);

    if (!folio || xa_is_value(folio))
        return ERR_PTR(-ENOENT);  // Not in cache

    // Found - increment reference
    folio_get(folio);
    return folio;
}

The lookup is effectively O(1) for practical file sizes - xarray is a radix tree with bounded height, not a balanced tree, so performance doesn't degrade with more pages.

Stage 4: Cache hit (fast path)

When the page is in cache and up-to-date, reading is fast (order-of-magnitude estimates; actual latency depends on CPU and workload):

flowchart TD
    subgraph cachehit["Cache Hit"]
        direction TB
        S1["1. xa_load() finds page in xarray<br/>~50-100ns"]
        S1 --> S2["2. Check page is uptodate (PG_uptodate set)<br/>~10ns"]
        S2 --> S3["3. copy_to_user() copies data<br/>~1-10us<br/>(depends on size, may need multiple pages)"]
        S3 --> S4["<b>Total: ~1-10 microseconds for typical read</b>"]
    end

No disk I/O, no waiting, just memory copies.

The copy to userspace

// Copy from kernel page to user buffer
size_t copy_folio_to_iter(struct folio *folio, size_t offset,
                          size_t bytes, struct iov_iter *i)
{
    // Map the page temporarily
    // Copy bytes to user buffer
    // Handle partial pages (offset within page)
}

This is the primary cost of cached reads - copying data from kernel memory to user memory.

Stage 5: Cache miss (slow path)

When the page isn't in cache, the kernel must read from disk:

// Page not in cache - allocate and read
static struct folio *page_cache_sync_readahead(...)
{
    // 1. Allocate a new folio
    folio = filemap_alloc_folio(gfp, 0);

    // 2. Add to page cache
    error = filemap_add_folio(mapping, folio, index, gfp);

    // 3. Submit I/O to read from disk
    folio_mark_uptodate(folio);  // After I/O completes

    return folio;
}

I/O submission

The actual disk read goes through the file system and block layer:

flowchart TD
    subgraph cachemiss["Cache Miss Path"]
        direction TB
        A["filemap_read()"] --> B["page_cache_sync_readahead()"]
        B --> C["Address space operations<br/>(a_ops->readahead)"]
        C --> D["File system:<br/>ext4_readahead() / xfs_vm_readahead() / etc."]
        D --> E["Block layer:<br/>submit_bio()"]
        E --> F["Device driver:<br/>NVMe / SATA / etc."]
        F --> G["Physical disk I/O"]
    end

The process blocks until I/O completes (unless using async I/O).

Waiting for I/O

// Block until page is ready
folio_wait_locked(folio);

// Check it completed successfully
if (!folio_test_uptodate(folio)) {
    folio_put(folio);
    return -EIO;
}

Stage 6: Readahead

The kernel doesn't just read the requested page - it reads ahead, predicting you'll want subsequent pages:

flowchart TD
    subgraph request["Read request for page 5"]
        direction TB
        subgraph without["Without readahead"]
            direction LR
            W1["Read: [5]"]
            W2["Cache: [5]"]
        end
        subgraph with["With readahead (window = 8)"]
            direction LR
            R1["Read: [5][6][7][8][9][10][11][12]"]
            R2["Cache: [5][6][7][8][9][10][11][12]"]
        end
        subgraph next["Next read for page 6"]
            N1["Cache hit! No disk I/O needed."]
        end
    end

Readahead algorithm

The kernel uses an adaptive readahead algorithm:

// mm/readahead.c (simplified)
void page_cache_sync_readahead(struct address_space *mapping,
                                struct file_ra_state *ra,
                                struct file *file,
                                pgoff_t index,
                                unsigned long req_count)
{
    // Check if sequential access pattern
    if (index == ra->prev_pos + 1) {
        // Sequential - expand readahead window
        ra->size = min(ra->size * 2, ra->ra_pages);
    } else {
        // Random - shrink window
        ra->size = 4;  // Start small
    }

    // Submit readahead I/O
    do_page_cache_ra(ractl, ra->size, ra->async_size);
}

Readahead state

Each file has readahead state tracking its access pattern:

struct file_ra_state {
    pgoff_t start;          // Where readahead started
    unsigned int size;      // Current window size
    unsigned int async_size;// Async portion (trigger next readahead)
    unsigned int ra_pages;  // Maximum window
    unsigned int mmap_miss; // Cache miss in mmap
    loff_t prev_pos;        // Previous read position
};

Tuning readahead

# View default readahead size (in 512-byte sectors)
cat /sys/block/sda/queue/read_ahead_kb

# Adjust (e.g., for sequential workloads)
echo 256 > /sys/block/sda/queue/read_ahead_kb

# Per-file readahead via posix_fadvise
posix_fadvise(fd, 0, 0, POSIX_FADV_SEQUENTIAL);  // Hint: sequential
posix_fadvise(fd, 0, 0, POSIX_FADV_RANDOM);      // Hint: random

mmap vs read

There's another way to read files - memory-mapping with mmap():

// read() approach
char buf[4096];
read(fd, buf, 4096);

// mmap() approach
char *ptr = mmap(NULL, file_size, PROT_READ, MAP_PRIVATE, fd, 0);
// Access ptr directly - page faults load data

Both use the same page cache:

flowchart BT
    subgraph pagecache["Page Cache<br/>(single cache for file, regardless of access method)"]
    end

    subgraph readmethod["read()"]
        R1["Explicit<br/>copy to<br/>user buf"]
    end

    subgraph mmapmethod["mmap()"]
        M1["On-demand<br/>via page<br/>faults"]
    end

    readmethod --> pagecache
    mmapmethod --> pagecache

read(): Explicit copy from page cache to user buffer. mmap(): Map page cache pages directly into process address space; no copy needed.

See page cache and process address space for details.

Direct I/O

Sometimes you want to bypass the page cache entirely:

// Open with O_DIRECT
int fd = open("file.dat", O_RDONLY | O_DIRECT);

// Read bypasses page cache - goes directly to/from user buffer
read(fd, aligned_buffer, size);

When to use direct I/O: - Database engines (they have their own caching) - Very large sequential reads (cache would be wasteful) - When you need predictable I/O timing

Requirements: - Buffer must be aligned (typically 512 bytes or 4KB) - Offset and size often must be aligned too

The complete picture

Here's the full flow for a read() call:

flowchart TD
    A["read(fd, buf, 4096)"] --> B["fdget_pos(fd)<br/><i>Get struct file from fd table</i>"]
    B --> C["vfs_read()<br/><i>Permission checks, dispatch</i>"]
    C --> D["generic_file_read_iter()<br/><i>Most file systems</i>"]
    D --> E["filemap_read()<br/><i>Page cache logic</i>"]
    E --> F{Page in cache?}

    F -->|YES| G["Page in cache"]
    F -->|NO| H["Not in cache"]

    H --> I["<b>Readahead:</b><br/>- Allocate pages<br/>- Submit I/O<br/>- Wait for I/O<br/>- Add to cache"]

    G --> J["folio_wait_locked()<br/><i>ensure page is ready</i>"]
    I --> J

    J --> K["copy_folio_to_iter()<br/><i>copy to user buffer</i>"]
    K --> L["return bytes_read"]

Try it yourself

Watch cache hit ratio

# Clear the page cache (requires root)
echo 3 > /proc/sys/vm/drop_caches

# Read a file
cat /path/to/large/file > /dev/null

# Check cache stats
cat /proc/vmstat | grep -E "pgpg|pgfault"
# pgpgin  - pages read from disk
# pgpgout - pages written to disk

Compare cached vs uncached reads

# First read (cold cache)
time cat /path/to/large/file > /dev/null

# Second read (warm cache)
time cat /path/to/large/file > /dev/null

# The second read should be much faster

Monitor readahead

# Watch readahead activity
cat /proc/vmstat | grep -E "pgread|pgalloc"

# Trace readahead events
echo 1 > /sys/kernel/debug/tracing/events/filemap/mm_filemap_add_to_page_cache/enable
cat /sys/kernel/debug/tracing/trace_pipe

View cache contents for a file

# Using fincore (part of util-linux)
fincore /path/to/file

# Shows which pages of the file are in cache

# Using vmtouch (third-party tool)
vmtouch -v /path/to/file

Test direct I/O

# Read with direct I/O (bypasses cache)
dd if=/path/to/file of=/dev/null bs=4096 iflag=direct

# Compare with buffered I/O
dd if=/path/to/file of=/dev/null bs=4096

# Direct I/O timing is more consistent but potentially slower
# for cached data

Observe page cache size

# Current cache usage
cat /proc/meminfo | grep -E "Cached|Buffers"

# Watch it grow as you read files
watch -n 1 'cat /proc/meminfo | grep -E "Cached|Buffers"'

# Read a large file in another terminal
cat /path/to/large/file > /dev/null

Key source files

File What It Does
mm/filemap.c Page cache operations, filemap_read()
mm/readahead.c Readahead algorithm
fs/read_write.c VFS read/write entry points
include/linux/fs.h struct address_space, file operations
include/linux/pagemap.h Page cache APIs

History

Page cache evolution

Early Linux: Simple page cache with manual readahead.

v2.4: Unified buffer cache and page cache.

v2.6: Radix tree for page cache lookup, adaptive readahead.

v4.20 (2018): XArray replaced radix tree for page cache indexing.

Commit: 4c7472c0df2f ("page cache: Convert find_get_entry to XArray") | LKML

Author: Matthew Wilcox

In the flow above, filemap_get_folio() uses the xarray to look up cached pages by file offset. The xarray provides better cache utilization than the radix tree it replaced, speeding up the cache hit path.

Folio conversion (v5.16+)

In the flow above, functions like filemap_get_folio() and filemap_read() work with struct folio rather than raw struct page:

Commit: 7b230db3b8d3 ("mm: Introduce struct folio") | LKML

Author: Matthew Wilcox

  • What it enables: The page cache can now store large folios (e.g., 64KB or larger) for files, reducing metadata overhead and improving I/O efficiency for large sequential reads
  • In our flow: When read() triggers a cache miss, filemap_alloc_folio() may allocate a large folio, and readahead fills it with multiple pages worth of file data in a single I/O operation

Notorious bugs and edge cases

The page cache and read path involve complex interactions between filesystems, memory management, and I/O. Bugs here cause data corruption, stale reads, or crashes.


Case 1: filemap_fault() race conditions

What happened

The filemap_fault() function handles page faults for memory-mapped files. Race conditions between invalidation and fault handling have caused multiple bugs.

The bug

From kernel discussions:

A race can occur between: 1. Thread A: Handling a page fault, reading file data into page cache 2. Thread B: Invalidating/truncating the file

If Thread B truncates the file while Thread A is faulting in a page, Thread A might install a page beyond the new file end, or see stale data.

"We see a blank page from time to time, when we should be seeing valid data."

Real-world implications

Applications using mmap'd files (databases, VMs, etc.) could see: - Stale data after truncation - Blank pages - Crashes on invalid access

Mitigations

  • Proper page lock ordering between fault and truncate paths
  • page_mkwrite() coordination for writable mappings
  • Filesystem-specific locking (e.g., i_rwsem)

Case 2: 9p read corruption (2025)

What happened

Recent LKML discussion documented read corruption with 9p filesystem and mmapped content.

The bug

The 9p filesystem's splice implementation had a bug where page pinning during data transfer could race with page cache operations, causing:

  • Incorrect data being read
  • Data appearing in wrong file positions
  • Silent corruption without errors

Why this matters

Any filesystem implementing custom I/O paths (splice, direct I/O, io_uring) must coordinate carefully with the page cache. Bugs in this coordination are difficult to detect because:

  1. They require specific timing
  2. Corruption is silent
  3. Standard I/O testing doesn't trigger them

Case 3: Poisoned page cache data loss

What happened

When a hardware memory error (hwpoison) affects a page cache page, the kernel must handle it carefully to avoid silent data loss.

The bug

From kernel patch discussion:

"Solve silent data loss caused by poisoned page cache (shmem/tmpfs)"

If a page in the page cache gets a hardware error: 1. The page is marked poisoned 2. But readers might still see cached (potentially corrupted) data 3. Writes might silently fail

For tmpfs/shmem, there's no backing store to recover from - the data is simply lost.

Real-world implications

Hardware errors in RAM used for page cache could cause: - Database corruption - Lost user data - Inconsistent filesystem state

Mitigation

The patches ensure: - Poisoned pages are immediately evicted from cache - Subsequent reads go to disk (if file-backed) - Errors are properly propagated to userspace


Case 4: Readahead window explosion

What happened

The adaptive readahead algorithm tries to predict access patterns. Bugs in the heuristics can cause excessive I/O.

The bug

If readahead incorrectly predicts sequential access: 1. Kernel reads many pages ahead 2. Application doesn't use them 3. Useful pages get evicted to make room 4. Application triggers more I/O for the evicted pages 5. System thrashes despite available memory

Historical examples

  • Random access on large files: Readahead assumes sequential, wastes bandwidth
  • Multiple streams: Readahead doesn't track per-reader state well (improved in recent kernels)
  • SSD vs HDD: Optimal readahead differs; default often wrong

Tuning

# Check default readahead (KB)
blockdev --getra /dev/sda
# Typical: 256 (KB)

# Reduce for random workloads
blockdev --setra 64 /dev/sda

# Per-file hint in application
posix_fadvise(fd, 0, 0, POSIX_FADV_RANDOM);

Case 5: Page cache coherency with direct I/O

What happened

Applications mixing buffered I/O (page cache) and direct I/O can see stale data.

The bug

// Thread A: buffered write
write(fd, "AAAA", 4);

// Thread B: direct read (bypasses page cache)
char buf[4];
pread(fd_dio, buf, 4, 0);  // Might see old data!

// Why: Page cache is dirty but not yet written to disk
// Direct I/O reads from disk, sees old data

Real-world implications

Databases often use: - Direct I/O for data files (predictable latency) - Buffered I/O for logs (better small-write performance)

If not carefully coordinated, one subsystem sees stale data from the other.

Mitigation

// Before direct read, flush buffered writes
fsync(fd);  // Ensure page cache written to disk

// Or use O_DSYNC for writes
int fd = open(path, O_WRONLY | O_DSYNC);

// Or invalidate cache before direct read
posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);

Case 6: BDI (Backing Device Info) writeback bugs

What happened

The BDI tracks per-device writeback state. Bugs in BDI handling have caused writeback stalls and data loss.

The bug

From a kernel BUG report:

"Kernel panic in __delete_from_page_cache() with a message 'kernel BUG at mm/filemap.c:240!'"

When a device is removed while dirty pages exist: 1. BDI structure may be freed 2. Writeback worker still references it 3. Use-after-free → crash

Why this matters

Hot-pluggable storage (USB, SATA hotplug, NVMe hot remove) must coordinate with the page cache. The page cache may hold dirty data for the device, and removing the device without proper synchronization causes:

  • Lost dirty data
  • Kernel crash
  • Filesystem corruption

Summary: Lessons learned

Bug Root Cause Impact Prevention
filemap_fault race Truncate vs fault race Stale data Proper lock ordering
9p splice corruption Page pinning bug Data corruption Careful splice implementation
Poisoned page cache hwpoison handling Silent data loss Error propagation
Readahead explosion Wrong heuristics Thrashing Tune readahead, use fadvise
DIO cache coherency Buffered vs direct mix Stale reads fsync before DIO
BDI removal Device hot-remove race Crash, data loss Proper teardown sequence

The common thread: the page cache is a critical performance optimization that adds complexity. Any operation that bypasses, invalidates, or races with the cache is a potential bug site.

Further reading

LWN articles