RT Scheduler: SCHED_FIFO and SCHED_RR
Hard real-time scheduling with priority-based preemption
What the RT scheduler provides
The RT scheduler (rt_sched_class) gives tasks strict priority guarantees: an RT task at priority N will always preempt any fair-class task and any RT task at priority < N. There is no fairness, no vruntime, no time-sharing between tasks of different priorities — the highest-priority runnable task runs, period.
Two policies share the RT class:
| Policy | Timeslice | Preemption |
|---|---|---|
SCHED_FIFO |
None — runs until it blocks or yields | By higher-priority task only |
SCHED_RR |
100ms (default) | By higher-priority task, or same-priority task when slice expires |
RT priorities range from 1 (lowest) to 99 (highest). This is the reverse of nice values: higher number = higher priority.
The RT runqueue
// kernel/sched/sched.h
struct rt_prio_array {
DECLARE_BITMAP(bitmap, MAX_RT_PRIO + 1); // 101-bit bitmap (+1 for delimiter)
struct list_head queue[MAX_RT_PRIO]; // 100 per-priority FIFO queues
};
struct rt_rq {
struct rt_prio_array active; // ready-to-run tasks by priority
unsigned int rt_nr_running; // total runnable RT tasks
int rt_queued; // is this rq on the overload list?
// (throttling fields when CONFIG_RT_GROUP_SCHED)
};
MAX_RT_PRIO = 100 (defined in include/linux/sched/prio.h). The bitmap has one bit per priority level — when a task at priority P becomes runnable, bit P is set. Task selection is:
int idx = sched_find_first_bit(rt_rq->active.bitmap);
// idx is the highest-priority non-empty queue
next = list_first_entry(&rt_rq->active.queue[idx], ...);
This is O(1) regardless of how many RT tasks exist — the bitmap scan is a single hardware instruction (bsf/bsr on x86).
struct sched_rt_entity
Every RT task has a sched_rt_entity embedded in its task_struct:
// include/linux/sched.h
struct sched_rt_entity {
struct list_head run_list; // node in priority queue
unsigned long timeout; // RLIMIT_RTTIME watchdog
unsigned long watchdog_stamp; // jiffies at last check
unsigned int time_slice; // remaining RR timeslice
unsigned short on_rq; // currently queued?
unsigned short on_list; // on active priority list?
// (group scheduling fields when CONFIG_RT_GROUP_SCHED)
};
Task selection: pick_next_task_rt()
// kernel/sched/rt.c
static struct task_struct *pick_task_rt(struct rq *rq)
{
struct sched_rt_entity *rt_se = pick_next_rt_entity(&rq->rt);
if (!rt_se)
return NULL;
return rt_task_of(rt_se);
}
pick_next_rt_entity() finds the first-set bit in the priority bitmap, then returns the head of that queue. For SCHED_RR tasks at equal priority, the queue is FIFO — the task at the front ran least recently (has been waiting longest) and goes to the back after its slice.
SCHED_RR timeslice rotation
On each timer tick, task_tick_rt() runs:
// kernel/sched/rt.c
static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued)
{
struct sched_rt_entity *rt_se = &p->rt;
// Watchdog: kill task if it exceeds RLIMIT_RTTIME
watchdog(rq, p);
// Only SCHED_RR uses timeslices
if (p->policy != SCHED_RR)
return;
if (--rt_se->time_slice)
return; // still has time left
// Timeslice expired: reset and requeue at back
rt_se->time_slice = sched_rr_timeslice;
if (rt_se->run_list.prev != rt_se->run_list.next) {
// Other tasks at same priority: move to back of queue
requeue_task_rt(rq, p, 0);
resched_curr(rq);
}
}
The default SCHED_RR timeslice is 100ms:
// include/linux/sched/rt.h
#define RR_TIMESLICE (100 * HZ / 1000)
// Configurable via sysctl:
// /proc/sys/kernel/sched_rr_timeslice_ms (default: 100)
int sched_rr_timeslice = RR_TIMESLICE;
RT throttling
An RT task that never sleeps will starve all normal tasks. The kernel protects against this with RT throttling:
// kernel/sched/rt.c
int sysctl_sched_rt_period = 1000000; // 1 second (µs)
int sysctl_sched_rt_runtime = 950000; // 950ms (µs)
RT tasks collectively can use at most rt_runtime / rt_period = 95% of CPU time. The remaining 5% is reserved for SCHED_NORMAL tasks (including the shell, so you can kill -9 a runaway RT task).
# View/change RT throttle settings
cat /proc/sys/kernel/sched_rt_period_us # default: 1000000 (1s)
cat /proc/sys/kernel/sched_rt_runtime_us # default: 950000 (950ms)
# Disable throttling (dangerous — can freeze the system)
echo -1 > /proc/sys/kernel/sched_rt_runtime_us
Setting sched_rt_runtime_us to -1 disables throttling entirely. Only do this if you fully trust your RT tasks.
Setting RT policy
// Set SCHED_FIFO at priority 50
struct sched_param param = { .sched_priority = 50 };
sched_setscheduler(pid, SCHED_FIFO, ¶m);
// Or via sched_setattr (more flexible)
struct sched_attr attr = {
.sched_policy = SCHED_FIFO,
.sched_priority = 50,
};
sched_setattr(0, &attr, 0);
Requires CAP_SYS_NICE (or running as root). Unprivileged users cannot set RT policies.
# Set process to SCHED_FIFO priority 80
chrt -f 80 ./my_realtime_app
# Set existing process
chrt -f -p 80 $PID
# Check scheduling policy
chrt -p $PID
SMP considerations: pushable tasks
On multi-CPU systems, the RT scheduler maintains a pushable tasks list — RT tasks that could be migrated to another CPU if their current CPU is occupied by a higher-priority RT task.
When a higher-priority RT task arrives on a CPU that's already running an RT task, the lower-priority task is "pushed" to another CPU:
Before: After:
CPU 0: RT-90 (running) CPU 0: RT-90 (running)
CPU 1: RT-70 (running) CPU 1: RT-80 (running, migrated here)
CPU 1: RT-70 → pushed to CPU 2
CPU 2: normal task CPU 2: RT-70 (migrated)
The push_rt_task() and pull_rt_task() functions handle this migration.
Practical patterns
Audio/video processing
// Realtime audio thread: SCHED_FIFO, priority ~70
// High enough to preempt all normal tasks,
// low enough for SCHED_DEADLINE tasks to preempt if needed
struct sched_param param = { .sched_priority = 70 };
sched_setscheduler(0, SCHED_FIFO, ¶m);
// Process audio in tight loop, sleep on condition variable
while (running) {
pthread_mutex_lock(&buf_lock);
while (buf_empty)
pthread_cond_wait(&buf_cond, &buf_lock);
process_audio();
pthread_mutex_unlock(&buf_lock);
}
Avoiding PI issues
When using mutexes in RT tasks, use pthread_mutexattr_setprotocol(PTHREAD_PRIO_INHERIT) to enable priority inheritance:
pthread_mutexattr_t attr;
pthread_mutexattr_init(&attr);
pthread_mutexattr_setprotocol(&attr, PTHREAD_PRIO_INHERIT);
pthread_mutex_init(&mutex, &attr);
Without PI, a low-priority task holding a mutex can block a high-priority RT task indefinitely — the classic priority inversion problem. See Priority Inversion and PI Mutexes.
Further reading
- SCHED_DEADLINE — Even stricter guarantees with bandwidth enforcement
- Priority Inversion and PI Mutexes — The RT locking problem and solution
- Scheduler Classes — RT class in the class hierarchy