Async I/O Evolution
From POSIX AIO to Linux AIO to io_uring
The problem with synchronous I/O
Synchronous read()/write() block the calling thread until data is transferred. For high-throughput servers this means either:
- Many threads (each blocking on I/O) — high memory cost, context switch overhead
- Non-blocking I/O with epoll — works for sockets, but epoll does not work for regular files (they are always "ready")
The goal of async I/O: submit I/O, do other work, collect results when ready.
POSIX AIO (glibc userspace)
POSIX AIO (aio_read, aio_write) is the standard API — but the Linux glibc implementation fakes it with threads:
#include <aio.h>
struct aiocb cb = {
.aio_fildes = fd,
.aio_buf = buf,
.aio_nbytes = 4096,
.aio_offset = 0,
/* Notify via signal: */
.aio_sigevent = {
.sigev_notify = SIGEV_SIGNAL,
.sigev_signo = SIGUSR1,
},
};
/* Submit async read */
aio_read(&cb);
/* ... do other work ... */
/* Check completion */
int err = aio_error(&cb);
if (err == 0) {
ssize_t n = aio_return(&cb); /* bytes transferred */
}
/* Or: wait for all */
const struct aiocb *list[] = { &cb };
aio_suspend(list, 1, NULL);
The problem: glibc implements POSIX AIO using a thread pool. Each aio_read() submits work to a worker thread that calls pread() synchronously. This adds:
- Thread creation/teardown overhead
- Memory per thread (stack)
- Context switches to the worker thread and back
- No kernel bypass — still goes through the kernel per-thread
Linux Kernel AIO
Linux added native async I/O in 2.6 (io_submit, io_getevents). Unlike POSIX AIO, this runs in the kernel — no threads.
#include <linux/aio_abi.h>
#include <sys/syscall.h>
/* Setup: create an AIO context */
aio_context_t ctx = 0;
io_setup(128, &ctx); /* max 128 in-flight ops */
/* Prepare an iocb */
struct iocb cb = {
.aio_lio_opcode = IOCB_CMD_PREAD,
.aio_fildes = fd,
.aio_buf = (uint64_t)buf,
.aio_nbytes = 4096,
.aio_offset = 0,
};
struct iocb *cbs[] = { &cb };
/* Submit */
io_submit(ctx, 1, cbs);
/* Wait for completion */
struct io_event events[1];
struct timespec timeout = { .tv_sec = 1 };
int n = io_getevents(ctx, 1, 1, events, &timeout);
if (n == 1) {
printf("result: %ld\n", events[0].res);
}
io_destroy(ctx);
Linux AIO limitations
Linux AIO was designed specifically for O_DIRECT I/O. It has serious limitations:
| Limitation | Details |
|---|---|
| O_DIRECT only | Buffered I/O submits synchronously even with io_submit |
| No network sockets | Only block/file I/O, not sockets or pipes |
| io_submit can block | On buffered files or full submission ring |
| No fixed buffers | Each op requires mapping/unmapping user buffers |
| No chaining | Can't express "read A, then write B" without round-tripping to userspace |
| No fsync | IOCB_CMD_FSYNC exists but rarely works async |
The result: Linux AIO only helps databases already using O_DIRECT. It does not solve the general async I/O problem.
Kernel path for Linux AIO
/* fs/aio.c */
long io_submit(aio_context_t ctx_id, long nr, struct iocb __user **iocbpp)
{
struct kioctx *ctx = lookup_ioctx(ctx_id);
for (i = 0; i < nr; i++) {
struct iocb __user *user_iocb = iocbpp[i];
ret = io_submit_one(ctx, user_iocb, &tmp, compat);
}
}
static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb, ...)
{
struct aio_kiocb *req = aio_get_req(ctx);
switch (iocb.aio_lio_opcode) {
case IOCB_CMD_PREAD:
ret = aio_read(&req->rw, &iocb, false, compat);
/* For O_DIRECT: submits bio and returns immediately */
/* For buffered: may block right here in io_submit! */
break;
case IOCB_CMD_PWRITE:
ret = aio_write(&req->rw, &iocb, false, compat);
break;
}
}
io_uring: solving the problem properly
io_uring (merged in Linux 5.1, 2019) addresses all Linux AIO limitations. It uses two shared-memory ring buffers between kernel and userspace:
Userspace Kernel
│ │
│ SQ ring (submissions) │
│ ┌─────────────────────┐ │
│ │ SQE SQE SQE SQE ... │──▶│ io_uring_enter()
│ └─────────────────────┘ │ ↓
│ submits I/O
│ CQ ring (completions) │
│ ┌─────────────────────┐ │
│ │ CQE CQE CQE CQE ... │◀──│ kernel writes completions
│ └─────────────────────┘ │
No locking needed: SQ tail is written by userspace, SQ head by kernel; CQ tail by kernel, CQ head by userspace.
Basic io_uring usage
#include <liburing.h>
struct io_uring ring;
io_uring_queue_init(128, &ring, 0);
/* Submit a read */
struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
io_uring_prep_read(sqe, fd, buf, 4096, offset);
sqe->user_data = 42; /* tag to identify completion */
io_uring_submit(&ring);
/* Wait for completion */
struct io_uring_cqe *cqe;
io_uring_wait_cqe(&ring, &cqe);
printf("read %d bytes (tag=%llu)\n", cqe->res, cqe->user_data);
io_uring_cqe_seen(&ring, cqe);
io_uring_queue_exit(&ring);
io_uring vs Linux AIO
| Feature | Linux AIO | io_uring |
|---|---|---|
| Buffered I/O | Blocks in submit | Truly async |
| Network sockets | No | Yes (recv/send/accept/connect) |
| Fixed buffers | No | Yes (zero-copy after register) |
| Request chaining | No | Yes (IOSQE_IO_LINK) |
| Polled I/O (no IRQ) | No | Yes (IORING_SETUP_IOPOLL) |
| Kernel thread submit | No | Yes (IORING_SETUP_SQPOLL) |
| fsync | Partial | Yes |
| Splice/tee | No | Yes |
| openat/statx/unlinkat | No | Yes |
| Zero syscalls | No | Yes (with SQPOLL) |
io_uring SQPOLL: zero-syscall I/O
With IORING_SETUP_SQPOLL, a kernel thread polls the SQ ring:
struct io_uring_params params = {
.flags = IORING_SETUP_SQPOLL,
.sq_thread_idle = 2000, /* ms before kernel thread sleeps */
};
io_uring_queue_init_params(128, &ring, ¶ms);
/* Now io_uring_submit() is optional — kernel polls for new SQEs */
struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
io_uring_prep_read(sqe, fd, buf, 4096, 0);
io_uring_sqe_set_flags(sqe, IOSQE_FIXED_FILE);
/* No syscall needed if kernel thread is still polling: */
io_uring_submit(&ring); /* may return without syscall */
Useful for NVMe with polled queues (IORING_SETUP_IOPOLL + IORING_SETUP_SQPOLL): latency of 50-100µs with zero syscalls.
io_uring request chaining
/* Chain: read file → write to socket, only if read succeeds */
struct io_uring_sqe *read_sqe = io_uring_get_sqe(&ring);
io_uring_prep_read(read_sqe, file_fd, buf, 4096, 0);
io_uring_sqe_set_flags(read_sqe, IOSQE_IO_LINK); /* link to next */
struct io_uring_sqe *send_sqe = io_uring_get_sqe(&ring);
io_uring_prep_send(send_sqe, sock_fd, buf, 4096, 0);
/* No IOSQE_IO_LINK: end of chain */
io_uring_submit(&ring);
/* Kernel executes read, then send (if read succeeded) */
io_uring kernel internals
/* io_uring/io_uring.c */
static int io_read(struct io_kiocb *req, unsigned int issue_flags)
{
struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
struct kiocb *kiocb = &rw->kiocb;
/* Try inline (non-blocking first): */
ret = io_iter_do_read(rw, &iter);
if (ret == -EAGAIN) {
/* Would block: punt to io-wq thread pool */
if (issue_flags & IO_URING_F_NONBLOCK)
return -EAGAIN;
/* Or: submit to async worker */
io_req_task_work_add(req);
}
/* Complete inline if data was in page cache */
io_req_set_res(req, ret, 0);
return IOU_OK;
}
io_uring first attempts the operation inline (non-blocking). If it would block, it falls back to a io-wq worker thread (similar to a kernel thread pool but cheaper than POSIX AIO's userspace threads).
epoll limitations for file I/O
For completeness: epoll does not solve async file I/O:
/* This does NOT work for regular files: */
int fd = open("file.dat", O_RDONLY | O_NONBLOCK);
int epfd = epoll_create1(0);
struct epoll_event ev = { .events = EPOLLIN, .data.fd = fd };
epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &ev);
/* epoll_wait will return immediately: files are always "ready" */
epoll_wait(epfd, events, 10, -1);
/* read() still blocks if data isn't in page cache */
Regular files always appear ready to epoll because the VFS has no mechanism to notify when a page cache miss resolves. Only sockets and pipes have genuine readiness semantics.
Choosing the right async I/O
| Scenario | Recommendation |
|---|---|
| O_DIRECT database I/O | io_uring (or Linux AIO for legacy) |
| Network + file I/O together | io_uring |
| High-throughput NVMe | io_uring + IOPOLL + SQPOLL |
| Portable POSIX code | POSIX AIO (accepts the thread overhead) |
| Socket-only server | epoll + non-blocking sockets |
| Buffered file I/O, simple | Sync read/write + thread pool |
Further reading
- Direct I/O — O_DIRECT, alignment requirements
- io_uring: Architecture — ring setup, SQE/CQE layout
- io_uring: Operations — full operation reference
io_uring/io_uring.c— io_uring corefs/aio.c— Linux AIO implementation- io_uring.pdf — Jens Axboe's design paper