Skip to content

I/O Patterns

How Linux moves data between userspace, the page cache, and storage — and why the interface you choose shapes everything from latency to correctness

Getting Started

This documentation explains how Linux handles I/O — not just the system call API, but the implementation decisions underneath: where data lives at each stage, what copies happen (or don't), what the kernel defers, and what trade-offs each interface forces on application developers.

> **Browsing on GitHub?** See the [full site index](../site-index.md) for a complete listing of all documentation.

Prerequisites

This documentation assumes familiarity with: - C programming — The kernel is written in C; syscall behavior is best understood at the C level - Basic OS concepts — Processes, kernel vs. userspace, system calls, file descriptors - File descriptors — What they are, how open() / close() / dup() work, and that they reference kernel-side struct file objects

If you've used read() and write() from C and wondered what happens inside, you're in the right place. A passing familiarity with virtual memory (pages, the page cache) will help but is not required — the relevant concepts are introduced where needed. See Further Reading for recommended textbooks.

Getting the Kernel Source

git clone https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
cd linux

The most relevant directories for I/O are mm/ (page cache, readahead, writeback) and fs/ (syscall entry points, splice, fallocate). For building and configuration, see Documentation/admin-guide/README.rst.

Running Tests

The fastest way to experiment with I/O behavior without rebooting your machine:

# Install virtme-ng (runs the kernel in QEMU with your host filesystem)
pip install virtme-ng

# Build and boot with a test module
virtme-ng --kdir . --append 'test_module.run_tests=1'

For tracing I/O at runtime on your current kernel, blktrace, bpftrace, and ftrace are the primary tools — see observability.md for how to use them.

Suggested Reading Order

Fundamentals — start here:

  1. Buffered I/O — How read() and write() interact with the page cache; what "buffered" actually means
  2. Direct I/OO_DIRECT, alignment requirements, when to bypass the page cache, and when not to
  3. Readahead — How the kernel speculatively fetches pages before you ask for them, and how to control it
  4. Page Cache and Writeback — Dirty pages, flusher threads, fsync(), and the durability guarantees (or lack thereof) of write()

Memory-Mapped I/O:

  1. mmap I/O — Mapping files into the address space; page faults as the I/O mechanism; when mmap wins and when it doesn't

Zero-Copy and Vectored:

  1. splice and sendfile — Moving data between file descriptors without userspace copies; what "zero-copy" actually means here
  2. Vectored I/Oreadv() / writev(), scatter-gather, iov_iter internals

End-to-End Walkthroughs:

  1. Life of a write — Tracing a write() call from userspace through the page cache to the block layer
  2. Async I/O — POSIX AIO (fake), Linux AIO (io_submit), and io_uring (real async); why the history matters

Observability and Tuning:

  1. Observability — Tools for measuring I/O: iostat, blktrace, bpftrace, perf, /proc/diskstats
  2. Tuning Storage I/O — Scheduler selection, readahead size, dirty ratio knobs, fadvise(), madvise()

War Stories:

  1. War Stories — Real bugs, surprising behaviors, and production incidents traced to I/O subsystem edge cases

What You'll Learn

Textbook Concept Linux Reality
"write() saves data to disk" No — it writes into the page cache. The kernel may not flush to storage for seconds. Call fsync() if you need durability
"O_DIRECT is faster" Depends entirely on workload. Alignment requirements are strict, readahead is disabled, and for workloads with temporal locality the page cache usually wins
"sendfile is zero-copy" Zero userspace copies, yes, but the CPU still touches metadata and socket buffers. True zero-copy (DMA to NIC) requires additional hardware support
"mmap is always faster than read()" Not for sequential streaming. Page fault overhead and TLB pressure make read() competitive; mmap wins for random access and shared data
"AIO is async" It depends which AIO you mean. POSIX AIO (aio_read) is implemented with threads in glibc. Linux AIO (io_submit) is real kernel async but restricted to O_DIRECT. io_uring is the first truly general async I/O mechanism
"splice avoids copies" Only when no transformation is needed. Splice moves page references between kernel structures — the moment you touch the data, the copy happens

What Makes This Different

  • Real commits: Every claim links to actual kernel commits
  • Real bugs: Learn from what broke in production and why
  • Real discussions: Links to LKML where design decisions were debated
  • Real trade-offs: Not just "what" but "why this interface and not that one"
  • Hands-on: "Try It Yourself" sections with commands to run and code to read

Documentation

Fundamentals

Document What You'll Learn
buffered-io How read() and write() move data through the page cache; what happens on a cache hit vs. miss
readahead The kernel's speculative prefetch logic; POSIX_FADV_SEQUENTIAL, posix_fadvise(), manual readahead
page-cache-writeback Dirty page accounting, flusher threads (wb_workfn), dirty throttling, and what fsync() actually flushes
direct-io O_DIRECT path through the kernel, alignment constraints, DIO vs. buffered I/O performance trade-offs

Memory-Mapped I/O

Document What You'll Learn
mmap-io File-backed mmap, page fault handling, MAP_SHARED vs. MAP_PRIVATE, msync(), and when mmap hurts

Zero-Copy and Vectored

Document What You'll Learn
splice-sendfile splice(), sendfile(), tee(); pipe buffers as the kernel's zero-copy primitive; limitations
vectored-io readv() / writev(), scatter-gather DMA, iov_iter as the kernel's unified buffer abstraction

Allocation and Space Management

Document What You'll Learn
fallocate Pre-allocating space without writing data; FALLOC_FL_PUNCH_HOLE, FALLOC_FL_KEEP_SIZE, sparse files

End-to-End Walkthroughs

Document What You'll Learn
life-of-a-write Tracing write() from the syscall entry through VFS, the address space, the block layer, and back
async-io POSIX AIO implementation, Linux AIO (io_submit) restrictions, io_uring submission queues and completion rings

Observability and Tuning

Document What You'll Learn
observability Reading /proc/diskstats, iostat, blktrace event anatomy, bpftrace one-liners for I/O tracing
tuning-storage Block scheduler selection, nr_requests, readahead tuning, dirty ratio knobs, fadvise / madvise hints

War Stories

Document What You'll Learn
war-stories Production incidents and surprising behaviors: write ordering bugs, fsync gaps, mmap + truncate races

Key Source Files

If you want to read the actual kernel code:

File What It Does
mm/filemap.c Core page cache implementation — generic_file_read_iter(), generic_file_write_iter(), page lookup and insertion
mm/readahead.c Readahead state machine, page_cache_async_readahead(), force_page_cache_readahead()
mm/page-writeback.c Dirty page accounting, dirty throttling (balance_dirty_pages()), writeback control
fs/fs-writeback.c Inode writeback, wb_workfn() flusher thread, sync_inodes_sb(), writeback bandwidth estimation
fs/read_write.c read(), write(), pread(), pwrite() syscall entry points; vfs_read() / vfs_write()
fs/splice.c splice(), sendfile(), tee() — pipe buffer movement between file descriptors
fs/fallocate.c fallocate() syscall dispatch; mode flag validation; filesystem delegation
lib/iov_iter.c iov_iter abstraction over iovec, kvec, bio_vec, and pipe buffers; copy to/from user
include/linux/fs.h struct file, struct inode, struct address_space, struct file_operations — the central VFS types

Further Reading

Blogs and Articles

Kernel Documentation

Textbooks

Papers