io_uring
True async I/O for everything — files, sockets, pipes, and more, without a syscall per operation
Getting Started
This documentation explains how io_uring works — not just the API, but the design decisions behind the ring layout, the trade-offs that shaped it, and the real-world patterns used in databases, storage engines, and network servers.
Prerequisites
This documentation assumes familiarity with:
- C programming — io_uring's API and the kernel implementation are both in C
- File descriptors and syscalls — how read(), write(), accept() work at the OS level
- Basic async concepts — event loops, non-blocking I/O, callbacks
If you've used epoll or libaio, you have a good foundation. If not, understanding blocking vs. non-blocking I/O is enough to start.
Getting the Kernel Source
The io_uring implementation lives in io_uring/ (~25,000 lines). The public ABI is in include/uapi/linux/io_uring.h.
Suggested Reading Order
Start here:
- Architecture and Rings — ring layout, SQE/CQE structures, submission and completion flow
- Life of a Request — end-to-end trace: from userspace SQE to kernel completion to CQE
Operations:
- Operations Reference — the full opcode table, flags, and per-operation semantics
Advanced topics:
- Fixed Buffers and Files — pre-registered buffers and file descriptors, zero-copy I/O
- Multishot Operations — one SQE, many CQEs: multishot accept, recv, and poll
- Networking — accept loops, zero-copy send, TCP zero-copy receive
Comparison and security:
- io_uring vs epoll — when to use each, what epoll can't do, migration patterns
- Security — privilege requirements, attack surface, seccomp filtering
- War Stories — real bugs, CVEs, and lessons from production
What You'll Learn
| Textbook Concept | Linux Reality |
|---|---|
| "async I/O requires threads" | io_uring's SQPOLL mode can drive I/O with zero application threads — a kernel thread polls the SQ ring |
| "one syscall per I/O" | io_uring batches submissions; with SQPOLL, you can eliminate io_uring_enter() entirely |
| "epoll handles all I/O" | epoll cannot do file I/O — regular files are always ready; io_uring handles files, sockets, pipes uniformly |
| "io_uring is just for servers" | Databases (RocksDB, ScyllaDB), storage engines, and game engines all use io_uring for file I/O |
| "zero-copy means no CPU" | Fixed buffers eliminate data copies between userspace and kernel, but DMA setup and completion interrupts still consume CPU |
What Makes This Different
- Real commits: Every claim links to actual kernel commits
- Real bugs: Learn from what broke and why
- Real discussions: Links to LKML where design decisions were debated
- Real trade-offs: Not just "what" but "why this and not that"
- Hands-on: "Try It Yourself" sections with commands to run and code to read
What io_uring Is
io_uring was introduced in Linux 5.1 (May 2019) by Jens Axboe. It is a kernel interface for asynchronous I/O built around two lock-free rings in shared memory:
- The Submission Queue (SQ) — userspace writes SQEs describing I/O operations
- The Completion Queue (CQ) — the kernel posts CQEs with results
Because both rings are mapped into userspace memory, submitting and harvesting I/O requires no syscall in the common case — especially with SQPOLL enabled.
Userspace Kernel
────────────────────────────────────────────────────────────────
┌──────────────────────┐
SQ ring (mmap'd) ────────► │ io_uring instance │
SQE array (mmap'd) ────────►│ (struct io_ring_ctx) │
└──────────────────────┘
CQ ring (mmap'd) ◄──────── │
io_uring_enter() or SQPOLL
│
direct dispatch / io-wq
│
complete → post CQE
How io_uring Differs from epoll and POSIX AIO
epoll is a readiness notification mechanism — it tells you when a file descriptor can be read without blocking. You still need a second syscall to actually transfer data. It cannot be used for regular file I/O because files are always "ready".
POSIX AIO (aio_read, aio_write) is implemented in glibc as a thread pool, not a true kernel interface. It has poor performance, a complex signal-based notification model, and limited operation coverage.
Linux libaio (io_submit, io_getevents) is a real kernel interface but is restricted to O_DIRECT block I/O, lacks socket support, and requires per-operation syscalls.
io_uring addresses all of these:
| Feature | epoll | libaio | io_uring |
|---|---|---|---|
| Regular file I/O | No | O_DIRECT only | Yes |
| Socket I/O | Yes | No | Yes |
| Zero syscall path | No | No | Yes (SQPOLL) |
| Batched submission | No | Limited | Yes |
| Linked operations | No | No | Yes |
| Fixed buffers | No | No | Yes |
| Multishot ops | No | No | Yes |
Documentation
Core Reading
| Document | What You'll Learn |
|---|---|
| io-uring-arch | Ring layout, SQE/CQE structures, io_uring_setup, submission and completion flow |
| life-of-request | End-to-end trace of a single read: SQE submission, kernel dispatch, CQE posting |
| io-uring-ops | Opcode table, per-operation flags and semantics, cancellation |
Advanced Topics
| Document | What You'll Learn |
|---|---|
| fixed-buffers | Pre-registered buffers and file descriptors — eliminating per-I/O setup cost |
| multishot-ops | One SQE generating many CQEs: multishot accept, recv, and poll |
| networking | High-throughput accept loops, zero-copy TCP send, recv with buffer selection |
Comparison and Security
| Document | What You'll Learn |
|---|---|
| io-uring-vs-epoll | When each model fits, what epoll can't do, migration patterns |
| security | Attack surface, privilege requirements, seccomp and io_uring, CVE history |
| war-stories | Real bugs, production failures, and what they teach about the implementation |
Key Source Files
If you want to read the actual kernel code:
| File | What It Does |
|---|---|
io_uring/io_uring.c |
Core ring management, io_uring_setup, io_uring_enter, context lifecycle |
io_uring/rw.c |
Read and write operations — IORING_OP_READ, IORING_OP_WRITE, vectored variants |
io_uring/net.c |
Network operations — IORING_OP_ACCEPT, IORING_OP_SEND, IORING_OP_RECV |
io_uring/sqpoll.c |
SQPOLL kernel thread — polling the SQ ring without userspace syscalls |
io_uring/io-wq.c |
Async worker thread pool for operations that cannot be dispatched inline |
io_uring/rsrc.c |
Fixed buffer and fixed file registration, resource table management |
include/uapi/linux/io_uring.h |
Userspace ABI — io_uring_sqe, io_uring_cqe, opcodes, flags, feature bits |
Quick Reference
A minimal io_uring read loop using liburing:
#include <liburing.h>
#include <fcntl.h>
#include <stdio.h>
int main(void) {
struct io_uring ring;
struct io_uring_sqe *sqe;
struct io_uring_cqe *cqe;
char buf[4096];
int fd, ret;
/* Initialize ring with 32 SQE slots */
io_uring_queue_init(32, &ring, 0);
fd = open("/etc/hostname", O_RDONLY);
/* Get a free SQE slot and describe a read */
sqe = io_uring_get_sqe(&ring);
io_uring_prep_read(sqe, fd, buf, sizeof(buf), 0);
sqe->user_data = 42; /* tag copied verbatim to CQE */
/* Submit: tells the kernel there are pending SQEs */
io_uring_submit(&ring);
/* Wait for one completion */
ret = io_uring_wait_cqe(&ring, &cqe);
if (ret == 0) {
printf("read %d bytes (tag=%llu)\n", cqe->res, cqe->user_data);
io_uring_cqe_seen(&ring, cqe); /* advance CQ head */
}
io_uring_queue_exit(&ring);
return 0;
}
Compile with:
Key Patterns
SQPOLL (zero-syscall I/O):
struct io_uring_params params = {
.flags = IORING_SETUP_SQPOLL,
.sq_thread_idle = 2000, /* ms before thread sleeps */
};
int ring_fd = io_uring_setup(128, ¶ms);
/* After this, io_uring_submit() does NOT need io_uring_enter() */
Linked requests (ordered chain):
/* Write then fsync, guaranteed order */
sqe = io_uring_get_sqe(&ring);
io_uring_prep_write(sqe, fd, buf, len, 0);
sqe->flags |= IOSQE_IO_LINK; /* link to next SQE */
sqe = io_uring_get_sqe(&ring);
io_uring_prep_fsync(sqe, fd, 0);
/* No IOSQE_IO_LINK here — ends the chain */
io_uring_submit(&ring);
Further Reading
Papers and Posts by Jens Axboe
- Axboe, Efficient IO with io_uring (2019) — the original design paper; explains the ring layout, SQPOLL, and fixed buffers
- Axboe, Lord of the io_uring — a practical guide with worked examples (hosted by Shuveb Hussain)
LWN Articles
- The rapid growth of io_uring (2019) — early overview of capabilities and design rationale
- An introduction to the io_uring asynchronous I/O framework (2019) — how it compares to existing async interfaces
- io_uring and seccomp (2021) — security model and seccomp interaction
Userspace Library
- liburing on GitHub — the standard userspace library; includes examples and wrappers for every opcode
Kernel Documentation
Documentation/filesystems/io_uring.rst— in-tree documentationinclude/uapi/linux/io_uring.h— the definitive ABI reference
Textbooks and Background
- Love, Linux Kernel Development — Ch. 13 (Block I/O), background on the I/O stack io_uring builds on
- Stevens & Rago, Advanced Programming in the UNIX Environment — Ch. 14 (Advanced I/O), for understanding the pre-io_uring landscape
Appendix
- Glossary — Terminology reference (SQE, CQE, SQPOLL, io-wq, fixed buffer, multishot)