CPU Bandwidth Control

CFS bandwidth throttling: how the kernel enforces hard CPU limits

What bandwidth control does

CPU weight (shares) controls the proportion of CPU a group gets when the system is busy. Bandwidth control caps the absolute amount — a group set to 50% will be throttled to 50% even if the other 99% of CPU is idle.

This matters for: - Predictable latency: Ensure a group never monopolizes the CPU - Multi-tenant isolation: Prevent one tenant from starving others - Resource accounting: Know exactly how much CPU a container is using

How it's configured

# v2 interface (unified)
# Format: "quota_us period_us" or "max period_us" for unlimited
cat /sys/fs/cgroup/mygroup/cpu.max
# 200000 100000  → 200ms every 100ms = 2 CPUs worth

echo "50000 100000" > /sys/fs/cgroup/mygroup/cpu.max  # 50% of 1 CPU

# v1 interface (separate files)
echo 50000  > /sys/fs/cgroup/cpu/mygroup/cpu.cfs_quota_us
echo 100000 > /sys/fs/cgroup/cpu/mygroup/cpu.cfs_period_us

The bandwidth is quota / period. For a quota of 200ms and period of 100ms, the group can use 2 CPUs concurrently (200% utilization). The quota is a wall-clock budget, not per-CPU.

The bandwidth pool: struct cfs_bandwidth

Each cgroup has one global cfs_bandwidth pool:

// kernel/sched/sched.h
struct cfs_bandwidth {
    raw_spinlock_t   lock;
    ktime_t          period;          // e.g., 100ms
    u64              quota;           // e.g., 50ms (in ns)
    u64              runtime;         // remaining this period
    u64              burst;           // burst allowance above quota

    struct hrtimer   period_timer;    // refills runtime at period end
    struct hrtimer   slack_timer;     // deferred unthrottle batching
    struct list_head throttled_cfs_rq; // per-CPU rqs that are throttled

    int nr_periods;
    int nr_throttled;
    u64 throttled_time;
};

At the start of each period, runtime is reset to quota. As tasks run, runtime drains. When runtime hits zero, all CPUs running tasks from this group are throttled.

Distribution: pool → per-CPU slice

Tasks don't drain the global pool directly. Instead, each CPU grabs a slice from the pool when needed:

// kernel/sched/fair.c
static u64 sched_cfs_bandwidth_slice(void)
{
    return (u64)sysctl_sched_cfs_bandwidth_slice * NSEC_PER_USEC;
    // Default: 5ms (sysctl: /proc/sys/kernel/sched_cfs_bandwidth_slice_us)
}

flowchart LR
    POOL["Global pool<br/>runtime = 50ms remaining"]
    CPU0["CPU 0's cfs_rq<br/>local_runtime = 5ms"]
    CPU1["CPU 1's cfs_rq<br/>local_runtime = 5ms"]
    CPU2["CPU 2's cfs_rq<br/>exhausted → throttled"]

    POOL -->|"assign 5ms slice"| CPU0
    POOL -->|"assign 5ms slice"| CPU1
    POOL -->|"pool empty"| CPU2

When a CPU's local runtime runs out, it calls assign_cfs_rq_runtime() to grab another slice. If the pool is empty, the CPU is throttled.

assign_cfs_rq_runtime()

// kernel/sched/fair.c
static void assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
{
    struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);

    raw_spin_lock(&cfs_b->lock);
    __assign_cfs_rq_runtime(cfs_b, cfs_rq, sched_cfs_bandwidth_slice());
    raw_spin_unlock(&cfs_b->lock);
}

Throttling: throttle_cfs_rq()

When a CPU's local runtime drains to zero and the global pool is empty:

// kernel/sched/fair.c
static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
{
    // Add to group's throttled list
    list_add_tail(&cfs_rq->throttled_list,
                  &cfs_b->throttled_cfs_rq);
    cfs_rq->throttled = 1;

    // Walk up the group scheduling hierarchy via for_each_sched_entity:
    // dequeue the group's sched_entity from parent runqueues
    // so tasks in this group can no longer be selected

    return true;
}

After throttling, the group's sched_entity is removed from the parent runqueue. pick_next_task() will never select a task from this group until it's unthrottled.

Refill: the period timer

At the end of each period, period_timer fires and refills the pool:

// kernel/sched/fair.c
static void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
{
    cfs_b->runtime = cfs_b->quota;  /* reset to quota each period */

    // Cap at quota + burst (allow burst credit to accumulate)
    cfs_b->runtime = min(cfs_b->runtime, cfs_b->quota + cfs_b->burst);
}

After refilling, unthrottle_cfs_rq() is called for each throttled CPU:

// kernel/sched/fair.c
static void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
{
    cfs_rq->throttled = 0;

    // Re-enqueue the group's sched_entity back into parent runqueues
    // via for_each_sched_entity loop

    // Tasks can run again
}

Burst: absorbing transient spikes

cpu.max.burst (v2) / cpu.cfs_burst_us (v1) allows a group to temporarily exceed its quota by consuming saved-up credit:

# Allow 20ms burst above the 50ms quota (max 70ms in a period)
echo "50000 100000" > /sys/fs/cgroup/mygroup/cpu.max        # quota period
echo "20000"        > /sys/fs/cgroup/mygroup/cpu.max.burst  # burst (separate file)

This helps workloads with bursty behavior (startup, GC pauses) avoid throttling while still maintaining average rate limits.

Slack timer: batching unthrottles

Unthrottling each CPU individually as soon as runtime becomes available would cause many small lock acquisitions. The slack timer batches unthrottles:

// When runtime is returned to the pool (e.g., task blocks before using its slice),
// instead of unthrottling immediately, arm the slack timer for 5ms
// and batch multiple unthrottles together.

This reduces lock contention on cfs_b->lock at the cost of up to 5ms extra throttle time.

Diagnosing throttling

Reading cpu.stat

cat /sys/fs/cgroup/mygroup/cpu.stat
nr_periods      100    # bandwidth periods elapsed
nr_throttled     23    # periods where group was throttled
throttled_usec  450000 # total throttle time (450ms)
nr_bursts         5    # periods where burst was used
burst_usec      15000  # total burst time (15ms)

A high nr_throttled / nr_periods ratio (say, >10%) indicates the quota is too low for the workload.

Tracing bandwidth events

# Trace throttle/unthrottle events
trace-cmd record -e sched:sched_stat_throttled \
                 -e cgroup:cgroup_throttle_cfs_rq \
                 sleep 10
trace-cmd report

Common causes of unexpected throttling

Multi-threaded workloads: If a group has 4 threads all running simultaneously, they collectively consume 4x the per-thread rate. A quota of 100ms/100ms = 100% is only 1 CPU — 4 threads will consume it in 25ms and be throttled for 75ms.

Solution: Set quota = target_utilization × period × number_of_CPUs_you_want_to_use.

# Allow 2 CPUs: 200ms quota / 100ms period = 2.0 CPUs
echo "200000 100000" > /sys/fs/cgroup/mygroup/cpu.max

Short-lived bursts: Tasks that do heavy work for 50ms then sleep will hit throttling during the heavy phase. Burst budget can absorb this.

Bandwidth slice tuning

The default 5ms slice is a trade-off between: - Too small (< 1ms): High lock contention on cfs_b->lock - Too large (> 10ms): CPU gets 10ms of budget it may never use; throttle response is slow

# Default: 5000 µs
cat /proc/sys/kernel/sched_cfs_bandwidth_slice_us

# For latency-sensitive workloads, reduce to 1ms:
echo 1000 > /proc/sys/kernel/sched_cfs_bandwidth_slice_us