I/O Patterns

How Linux moves data between userspace, the page cache, and storage — and why the interface you choose shapes everything from latency to correctness

Getting Started

This documentation explains how Linux handles I/O — not just the system call API, but the implementation decisions underneath: where data lives at each stage, what copies happen (or don't), what the kernel defers, and what trade-offs each interface forces on application developers.

> **Browsing on GitHub?** See the [full site index](../site-index.md) for a complete listing of all documentation.

Prerequisites

This documentation assumes familiarity with: - C programming — The kernel is written in C; syscall behavior is best understood at the C level - Basic OS concepts — Processes, kernel vs. userspace, system calls, file descriptors - File descriptors — What they are, how open() / close() / dup() work, and that they reference kernel-side struct file objects

If you've used read() and write() from C and wondered what happens inside, you're in the right place. A passing familiarity with virtual memory (pages, the page cache) will help but is not required — the relevant concepts are introduced where needed. See Further Reading for recommended textbooks.

Getting the Kernel Source

git clone https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
cd linux

The most relevant directories for I/O are mm/ (page cache, readahead, writeback) and fs/ (syscall entry points, splice, fallocate). For building and configuration, see Documentation/admin-guide/README.rst.

Running Tests

The fastest way to experiment with I/O behavior without rebooting your machine:

# Install virtme-ng (runs the kernel in QEMU with your host filesystem)
pip install virtme-ng

# Build and boot with a test module
virtme-ng --kdir . --append 'test_module.run_tests=1'

For tracing I/O at runtime on your current kernel, blktrace, bpftrace, and ftrace are the primary tools — see observability.md for how to use them.

What You'll Learn

Textbook Concept	Linux Reality
"`write()` saves data to disk"	No — it writes into the page cache. The kernel may not flush to storage for seconds. Call `fsync()` if you need durability
"`O_DIRECT` is faster"	Depends entirely on workload. Alignment requirements are strict, readahead is disabled, and for workloads with temporal locality the page cache usually wins
"`sendfile` is zero-copy"	Zero userspace copies, yes, but the CPU still touches metadata and socket buffers. True zero-copy (DMA to NIC) requires additional hardware support
"`mmap` is always faster than `read()`"	Not for sequential streaming. Page fault overhead and TLB pressure make `read()` competitive; mmap wins for random access and shared data
"AIO is async"	It depends which AIO you mean. POSIX AIO (`aio_read`) is implemented with threads in glibc. Linux AIO (`io_submit`) is real kernel async but restricted to `O_DIRECT`. io_uring is the first truly general async I/O mechanism
"`splice` avoids copies"	Only when no transformation is needed. Splice moves page references between kernel structures — the moment you touch the data, the copy happens

What Makes This Different

Real commits: Every claim links to actual kernel commits
Real bugs: Learn from what broke in production and why
Real discussions: Links to LKML where design decisions were debated
Real trade-offs: Not just "what" but "why this interface and not that one"
Hands-on: "Try It Yourself" sections with commands to run and code to read

Documentation

Fundamentals

Document	What You'll Learn
buffered-io	How `read()` and `write()` move data through the page cache; what happens on a cache hit vs. miss
readahead	The kernel's speculative prefetch logic; `POSIX_FADV_SEQUENTIAL`, `posix_fadvise()`, manual readahead
page-cache-writeback	Dirty page accounting, flusher threads (`wb_workfn`), dirty throttling, and what `fsync()` actually flushes
direct-io	`O_DIRECT` path through the kernel, alignment constraints, DIO vs. buffered I/O performance trade-offs

Memory-Mapped I/O

Document	What You'll Learn
mmap-io	File-backed `mmap`, page fault handling, `MAP_SHARED` vs. `MAP_PRIVATE`, `msync()`, and when mmap hurts

Zero-Copy and Vectored

Document	What You'll Learn
splice-sendfile	`splice()`, `sendfile()`, `tee()`; pipe buffers as the kernel's zero-copy primitive; limitations
vectored-io	`readv()` / `writev()`, scatter-gather DMA, `iov_iter` as the kernel's unified buffer abstraction

Allocation and Space Management

Document	What You'll Learn
fallocate	Pre-allocating space without writing data; `FALLOC_FL_PUNCH_HOLE`, `FALLOC_FL_KEEP_SIZE`, sparse files

End-to-End Walkthroughs

Document	What You'll Learn
life-of-a-write	Tracing `write()` from the syscall entry through VFS, the address space, the block layer, and back
async-io	POSIX AIO implementation, Linux AIO (`io_submit`) restrictions, io_uring submission queues and completion rings

Observability and Tuning

Document	What You'll Learn
observability	Reading `/proc/diskstats`, `iostat`, `blktrace` event anatomy, `bpftrace` one-liners for I/O tracing
tuning-storage	Block scheduler selection, `nr_requests`, readahead tuning, dirty ratio knobs, `fadvise` / `madvise` hints

War Stories

Document	What You'll Learn
war-stories	Production incidents and surprising behaviors: write ordering bugs, `fsync` gaps, mmap + truncate races

Key Source Files

If you want to read the actual kernel code:

File	What It Does
`mm/filemap.c`	Core page cache implementation — `generic_file_read_iter()`, `generic_file_write_iter()`, page lookup and insertion
`mm/readahead.c`	Readahead state machine, `page_cache_async_readahead()`, `force_page_cache_readahead()`
`mm/page-writeback.c`	Dirty page accounting, dirty throttling (`balance_dirty_pages()`), writeback control
`fs/fs-writeback.c`	Inode writeback, `wb_workfn()` flusher thread, `sync_inodes_sb()`, writeback bandwidth estimation
`fs/read_write.c`	`read()`, `write()`, `pread()`, `pwrite()` syscall entry points; `vfs_read()` / `vfs_write()`
`fs/splice.c`	`splice()`, `sendfile()`, `tee()` — pipe buffer movement between file descriptors
`fs/fallocate.c`	`fallocate()` syscall dispatch; mode flag validation; filesystem delegation
`lib/iov_iter.c`	`iov_iter` abstraction over `iovec`, `kvec`, `bio_vec`, and pipe buffers; copy to/from user
`include/linux/fs.h`	`struct file`, `struct inode`, `struct address_space`, `struct file_operations` — the central VFS types