io Cgroup Controller
Block I/O bandwidth throttling, weight-based scheduling, and latency SLAs
Overview
The cgroup v2 io controller manages block I/O resources for processes. It provides:
- io.max: hard rate limits (bytes/sec, IOPS)
- io.weight: proportional share scheduling (WFQ)
- io.latency: latency SLA enforcement (evict excess I/O to protect latency) (commit) (LWN)
# Enable the io controller on a cgroup:
echo "+io" > /sys/fs/cgroup/mycgroup/cgroup.subtree_control
# Check supported features:
cat /sys/block/sda/queue/scheduler
# [mq-deadline] kyber bfq none
# BFQ supports io.weight proportional scheduling
# mq-deadline supports io.latency
io.max: Hard rate limits
# Format: MAJ:MIN rbps=N wbps=N riops=N wiops=N
echo "8:0 rbps=52428800" > /sys/fs/cgroup/mycgroup/io.max # 50MB/s read
echo "8:0 wbps=26214400" > /sys/fs/cgroup/mycgroup/io.max # 25MB/s write
echo "8:0 riops=1000" > /sys/fs/cgroup/mycgroup/io.max # 1000 read IOPS
echo "8:0 wiops=500" > /sys/fs/cgroup/mycgroup/io.max # 500 write IOPS
# Multiple limits at once:
echo "8:0 rbps=52428800 wbps=26214400 riops=1000 wiops=500" \
> /sys/fs/cgroup/mycgroup/io.max
# Remove a limit:
echo "8:0 rbps=max wbps=max riops=max wiops=max" \
> /sys/fs/cgroup/mycgroup/io.max
# Read current limits:
cat /sys/fs/cgroup/mycgroup/io.max
# 8:0 rbps=52428800 wbps=26214400 riops=1000 wiops=500
Kernel implementation
/* block/blk-throttle.c */
/* When a bio is submitted, blk-throttle checks if the cgroup is over limit: */
static bool throtl_tg_is_idle(struct throtl_grp *tg)
{
/* Check both bytes and IOPS limits */
return (tg->bytes_disp[READ] + tg->bytes_disp[WRITE]) == 0;
}
static void tg_dispatch_one_bio(struct throtl_grp *tg, bool rw)
{
struct throtl_data *td = tg->td;
struct bio *bio;
/* Check if we're within limits */
if (tg_bps_limit(tg, rw)) {
/* Rate-limit: delay bio dispatch */
tg->disptime = jiffies + tg->slice_remaining[rw];
throtl_schedule_pending_timer(tg->service_queue, tg->disptime);
return;
}
/* Dispatch normally */
bio_list_pop(&tg->service_queue->bio_lists[rw]);
throtl_dispatch_bio(td, tg, bio);
}
io.weight: Proportional scheduling
Requires BFQ I/O scheduler (Kyber does not support weight; mq-deadline and none are also unsupported):
# Set weight (1-10000, default 100):
echo "default 200" > /sys/fs/cgroup/high-prio/io.weight # 2x default
echo "default 50" > /sys/fs/cgroup/low-prio/io.weight # 0.5x default
# Per-device weight:
echo "8:0 200" > /sys/fs/cgroup/high-prio/io.weight # only for /dev/sda
echo "default 100" >> /sys/fs/cgroup/high-prio/io.weight # others default
# Under BFQ: io.weight implements WFQ (Weighted Fair Queuing)
# high-prio gets 200/(200+50+100) = 57% of bandwidth
# low-prio gets 50/(200+50+100) = 14% of bandwidth
# normal gets 100/(200+50+100) = 29% of bandwidth
BFQ weight-based scheduling
BFQ (Budget Fair Queuing) is a proportional-share I/O scheduler that implements weight-based fairness:
/* block/bfq-iosched.c */
struct bfq_group {
struct bfq_entity entity; /* scheduling entity */
struct bfq_sched_data sched_data; /* active/idle queues */
/* ... */
};
/* Weight is set via cgroup io.weight: */
static void bfq_pd_init(struct blkg_policy_data *pd)
{
struct bfq_group *bfqg = blkg_to_bfqg(pd->blkg);
bfqg->entity.weight = pd->cgroup_weight; /* from io.weight */
bfqg->entity.new_weight = bfqg->entity.weight;
}
io.latency: Latency SLA
io.latency provides latency-based I/O protection. Introduced in Linux 4.19 by Josef Bacik (commit) (LWN). When a cgroup's latency exceeds the target, the controller throttles competing cgroups:
# Set target latency for a cgroup:
echo "8:0 target=10ms" > /sys/fs/cgroup/important/io.latency
# The kernel monitors latency for this cgroup.
# If p99 latency exceeds 10ms, throttle competing cgroups.
# io.latency is an SLA, not a hard limit:
# - The protected cgroup's I/O is prioritized
# - Competing cgroups are throttled even if they have io.max budget
# Check if throttling is happening:
cat /sys/fs/cgroup/important/io.stat
# 8:0 rbytes=... wbytes=... rios=... wios=... dbytes=... dios=...
# rbytes=read bytes completed
# dbytes=discard bytes (TRIM/DISCARD operations)
How io.latency works
/* block/blk-iolatency.c */
/* The iolatency monitor tracks per-IO latency for each cgroup: */
struct iolatency_grp {
struct blkg_rwstat stats; /* per-direction stats */
struct blk_iolatency *blkiolat;
u64 min_lat_nsec; /* target from io.latency */
u64 cur_win_nsec; /* current monitoring window */
atomic64_t window_start;
atomic_t scale_cookie; /* throttle cookie for competing cgroups */
/* ... */
};
/* When an IO completes: */
static void blkcg_iolatency_done_bio(struct rq_qos *rqos, struct bio *bio)
{
u64 now = ktime_get_ns();
u64 bio_issue_time = bio->bi_issue.value;
u64 lat = now - bio_issue_time;
/* Track latency in window */
iolatency_record_time(iolat, lat);
/* Check if we exceeded target */
if (lat > iolat->min_lat_nsec)
iolatency_check_latencies(iolat, now);
}
io.stat: I/O statistics
# Per-device I/O stats for a cgroup:
cat /sys/fs/cgroup/mycgroup/io.stat
# 8:0 rbytes=1234567890 wbytes=987654321 rios=12345 wios=9876 \
# dbytes=0 dios=0 wait_usec=123456789 io_service_bytes=... \
# io_wait_time=...
# Fields:
# rbytes/wbytes: read/write bytes completed
# rios/wios: read/write I/O count
# dbytes/dios: discard bytes/count
# wait_usec: total time spent waiting in queue (µs)
Writeback attribution
Dirty page writeback can be attributed to the cgroup that dirtied the pages:
# Check writeback stats:
cat /sys/fs/cgroup/mycgroup/io.stat
# dbytes/dios = discard (TRIM) bytes/count; writeback is counted in wbytes
# Tune writeback throttling:
echo 80 > /proc/sys/vm/dirty_ratio # 80% of RAM as dirty threshold
echo "8:0 wbps=10485760" > /sys/fs/cgroup/mycgroup/io.max # throttle writeback
/* mm/page-writeback.c: writeback is attributed via inode ownership */
/* If inode is created by a cgroup, its writeback counts against that cgroup */
Container I/O isolation
Docker/containerd use io.max and io.weight for container isolation:
# Docker: set device read rate
docker run --device-read-bps /dev/sda:50mb --device-write-bps /dev/sda:25mb \
my-image
# Docker: set I/O weight
docker run --blkio-weight 500 my-image # 500/1000 = 0.5x default
# Kubernetes: resource requests/limits use io.max underneath:
# resources:
# limits:
# blkio.throttle.read_bps_device: "52428800" # 50MB/s
Further reading
- cgroup v2 Architecture — cgroup v2 hierarchy and controllers
- Resource Controllers — memory, cpu, and io controllers
- I/O Schedulers — BFQ scheduler that io.weight uses
- NVMe Driver — block device underneath the io controller
block/blk-throttle.c— io.max throttling implementationblock/blk-iolatency.c— io.latency implementationblock/bfq-iosched.c— BFQ I/O scheduler with weight support