Life of a Context Switch
What happens inside
__schedule()from preemption to resumption
The big picture
A context switch moves the CPU from running one task to running another. It's triggered thousands of times per second and must be as fast as possible — yet it involves locking, memory barrier ordering, architecture-specific register saving, and careful pointer gymnastics to safely hand off execution.
flowchart TB
A["TIF_NEED_RESCHED set<br/>(timer tick / wakeup / yield)"]
B["Preemption point reached<br/>(syscall return / interrupt exit / schedule())"]
C["__schedule(sched_mode)"]
D["Pick next task<br/>(pick_next_task → class hierarchy)"]
E["context_switch(rq, prev, next)"]
F["switch_to(prev, next, prev)<br/>(assembly: save regs, swap stacks)"]
G["finish_task_switch(prev)<br/>(release rq lock, cleanup)"]
H["Next task resumes"]
A --> B --> C --> D --> E --> F --> G --> H
How a task gets marked for rescheduling
Before a context switch can happen, something must decide the current task should give up the CPU. This sets TIF_NEED_RESCHED on the current task:
// kernel/sched/core.c
static void resched_curr(struct rq *rq)
{
struct task_struct *curr = rq->curr;
if (test_tsk_need_resched(curr))
return;
// ...
set_tsk_need_resched(curr);
set_preempt_need_resched();
}
Common triggers:
- Timer tick (task_tick_fair()): current task has run too long
- Wakeup (wakeup_preempt()): a higher-priority task became runnable
- Explicit yield (sched_yield()): task voluntarily gives up the CPU
- Priority change: task's priority lowered below a runnable task
Voluntary vs involuntary switches
The kernel tracks two types of context switches per task:
// include/linux/sched.h
unsigned long nvcsw; /* voluntary context switches */
unsigned long nivcsw; /* involuntary (non-voluntary) context switches */
Voluntary (nvcsw): The task called schedule() directly because it has nothing to do — blocked on I/O, sleeping, waiting for a lock.
Involuntary (nivcsw): The scheduler preempted the task — its timeslice expired or a higher-priority task woke up.
# See voluntary/involuntary switches for a process
cat /proc/$PID/status | grep ctxt
# voluntary_ctxt_switches: 1234
# nonvoluntary_ctxt_switches: 567
A high involuntary count relative to voluntary means the task is being preempted frequently — it always has work to do but keeps losing the CPU to other tasks.
Schedule modes
__schedule() is called with a mode that determines what triggered it:
// kernel/sched/core.c
#define SM_NONE 0 // voluntary (task called schedule())
#define SM_PREEMPT 1 // involuntary (timer/interrupt forced it)
#define SM_RTLOCK_WAIT 2 // RT lock contention (PREEMPT_RT only)
/* Note: there is no SM_IDLE constant; the idle loop calls __schedule(SM_NONE) */
This affects two things: which counter is incremented (nvcsw vs nivcsw) and whether the task's state is consulted before dequeuing.
Inside __schedule()
// kernel/sched/core.c (simplified)
static void __sched notrace __schedule(int sched_mode)
{
struct task_struct *prev, *next;
struct rq *rq;
int cpu;
cpu = smp_processor_id();
rq = cpu_rq(cpu);
prev = rq->curr;
// 1. Disable IRQs and lock the runqueue
local_irq_disable();
rcu_note_context_switch(sched_mode == SM_PREEMPT);
rq_lock(rq, &rf);
// 2. Update the runqueue clock
update_rq_clock(rq);
// 3. If voluntary switch and task has state (it's blocking):
// dequeue it from the runqueue
if (!(sched_mode & SM_MASK_PREEMPT) && prev->__state) {
// Task is going to sleep — remove from runqueue
try_to_block_task(rq, prev, &rf);
switch_count = &prev->nvcsw; // voluntary
} else {
switch_count = &prev->nivcsw; // involuntary
}
// 4. Clear the resched flag
clear_tsk_need_resched(prev);
clear_preempt_need_resched();
// 5. Pick the next task to run
next = pick_next_task(rq, prev, &rf);
// 6. If same task, nothing to do
if (likely(prev != next)) {
rq->nr_switches++;
rq->curr = next;
++*switch_count;
// 7. Switch!
context_switch(rq, prev, next, &rf);
// --- execution now in 'next's context ---
// When we return here, we are running as
// whatever task resumed this CPU next.
} else {
rq_unpin_lock(rq, &rf);
__balance_callbacks(rq);
raw_spin_rq_unlock_irq(rq);
}
}
context_switch()
Once next is selected, context_switch() does the actual switch:
// kernel/sched/core.c
static __always_inline struct rq *context_switch(struct rq *rq, struct task_struct *prev,
struct task_struct *next, struct rq_flags *rf)
{
// 1. Pre-switch bookkeeping (tracing, perf, cgroup notifications)
prepare_task_switch(rq, prev, next);
// 2. Switch memory mappings
if (!next->mm) {
// Kernel thread: borrow prev's page tables (lazy TLB)
next->active_mm = prev->active_mm;
enter_lazy_tlb(prev->active_mm, next);
} else {
// User task: install next's page tables
switch_mm_irqs_off(prev->active_mm, next->mm, next);
}
// 3. Switch CPU registers and stack
switch_to(prev, next, prev);
// ↑ After this line, we are running in next's context.
// 'prev' now refers to the task that was running
// before *next* last yielded (not necessarily prev above).
barrier();
// 4. Post-switch cleanup — runs in next's context
return finish_task_switch(prev);
}
The memory switch
For a user-space task, switch_mm_irqs_off() installs the new task's page tables. On x86-64 with PCID, this is a CR3 write with the PCID bits set — the TLB is not fully flushed if the PCID is still cached.
For a kernel thread (no mm), it borrows the previous task's active_mm via lazy TLB: the page tables aren't switched, but the kernel thread is told "don't expect any user-space TLB entries to be valid."
switch_to(): the assembly heart
switch_to() is architecture-specific. On x86-64:
// arch/x86/include/asm/switch_to.h
#define switch_to(prev, next, last) \
do { \
((last) = __switch_to_asm((prev), (next))); \
} while (0)
__switch_to_asm (in arch/x86/entry/entry_64.S) saves callee-saved registers and swaps the stack pointer:
// Simplified x86-64 __switch_to_asm
__switch_to_asm:
// Save prev's callee-saved registers
pushq %rbp
pushq %rbx
pushq %r12
pushq %r13
pushq %r14
pushq %r15
// Save prev's stack pointer into task_struct->thread.sp
movq %rsp, TASK_threadsp(%rdi) // rdi = prev
// Load next's stack pointer
movq TASK_threadsp(%rsi), %rsp // rsi = next
// Restore next's callee-saved registers
popq %r15
popq %r14
popq %r13
popq %r12
popq %rbx
popq %rbp
// Jump to __switch_to (C function) for FPU/debug reg handling
jmp __switch_to
After switch_to() returns, the CPU is running on next's stack with next's saved registers. From the perspective of the code after switch_to(), "the current task" is now next.
The prev pointer puzzle
The third argument to switch_to receives the last task that ran on this CPU before next was scheduled. This is necessary because:
- When
nexteventually yields and__schedule()runs again,nextis nowprev - But what if
nextwas previously on a different CPU? When it ran there, some other task was scheduled out - The
lastvariable gives the code afterswitch_to()the rightprevto pass tofinish_task_switch()
finish_task_switch()
finish_task_switch() runs in the new task's context, cleaning up after the task that was just switched out:
// kernel/sched/core.c
static struct rq *finish_task_switch(struct task_struct *prev)
{
struct rq *rq = this_rq();
struct mm_struct *mm = rq->prev_mm;
rq->prev_mm = NULL;
// Signal that prev is no longer running on any CPU
finish_task(prev); // smp_store_release(&prev->on_cpu, 0)
// Release the runqueue lock (held since before pick_next_task)
finish_lock_switch(rq);
// Drop prev's mm if needed (lazy TLB cleanup)
if (mm)
mmdrop_lazy_tlb_sched(mm);
// If prev is dead, release its resources
if (unlikely(prev->__state == TASK_DEAD)) {
// put_task_struct, cgroup cleanup, etc.
put_task_struct_rcu_user(prev);
}
return rq;
}
The key ordering: prev->on_cpu = 0 is set with release semantics before the lock is dropped. Any CPU trying to wake prev spins on on_cpu — this ensures prev is fully off the CPU before a wakeup can re-enqueue it.
The full timeline
sequenceDiagram
participant CPU as CPU N
participant PrevTask as prev task
participant Sched as __schedule()
participant NextTask as next task
CPU->>PrevTask: Running normally
CPU->>PrevTask: Timer IRQ → TIF_NEED_RESCHED set
PrevTask->>Sched: Preemption point (syscall return / interrupt exit)
Sched->>Sched: Lock rq, update clock
Sched->>Sched: pick_next_task() → next
Sched->>CPU: context_switch(prev, next)
CPU->>CPU: switch_mm (page tables)
CPU->>CPU: switch_to (registers, stack)
Note over CPU: Now running in next's context
CPU->>NextTask: finish_task_switch(prev)
NextTask->>NextTask: prev->on_cpu = 0 (release)
NextTask->>NextTask: rq unlock
NextTask->>CPU: Resume executing next's code
Measuring context switch cost
# Count context switches system-wide
vmstat 1 | awk '{print "cs:", $12}'
# Per-process context switches
cat /proc/$PID/status | grep ctxt_switches
# Measure context switch latency with perf
perf stat -e context-switches,cs ./workload
# Detailed scheduling events
perf sched record ./workload
perf sched latency --sort=avg
# Trace scheduler switches in real time
trace-cmd record -e sched:sched_switch sleep 5
trace-cmd report | head -50
Further reading
- Runqueues —
struct rq,pick_next_task(), and__schedule()structure - Scheduler Classes — Which class provides the
nexttask - What Happens When a Process Wakes Up — How tasks get back onto a runqueue