The Timer Wheel
jiffies, struct timer_list, and the hierarchical timer wheel for low-resolution timers
Two timer systems
Linux maintains two distinct timer implementations that serve different use cases:
| System | Precision | Data structure | Insert cost | Typical use |
|---|---|---|---|---|
| hrtimer | ~nanosecond | Per-CPU red-black tree | O(log n) | sleep(), audio, network pacing, real-time |
| timer_list (timer wheel) | ~1/HZ (1–10 ms) | Per-CPU hierarchical wheel | O(1) | TCP retransmit, watchdogs, delayed work |
hrtimers program hardware clockevents directly for precise wakeups. The timer wheel is designed for timeouts that are frequently cancelled before expiry (e.g., TCP retransmit timers that are cancelled when an ACK arrives). The O(1) insert and cancel cost is the key advantage — most timer-list timers never actually fire.
jiffies and HZ
jiffies is a global counter incremented once per hardware tick. It is the fundamental unit of time for the timer wheel.
/* include/linux/jiffies.h */
extern unsigned long volatile jiffies; /* wraps at ULONG_MAX */
extern u64 jiffies_64; /* 64-bit, no wraparound in practice */
HZ is the number of jiffies per second, configured at compile time:
| CONFIG_HZ | Tick period | Typical use |
|---|---|---|
| 100 | 10 ms | Servers, embedded (CONFIG_HZ_100) |
| 250 | 4 ms | Desktop default on many distros |
| 1000 | 1 ms | Low-latency desktops, real-time |
Conversion helpers:
msecs_to_jiffies(500) /* ms → jiffies; safe with HZ-independent code */
jiffies_to_msecs(jiffies) /* jiffies → ms */
usecs_to_jiffies(100) /* us → jiffies (rounds up) */
nsecs_to_jiffies(1000000UL) /* ns → jiffies */
Always use these helpers rather than computing HZ * seconds directly — the helpers handle rounding correctly and are portable across HZ configurations.
struct timer_list
/* include/linux/timer.h */
struct timer_list {
struct hlist_node entry; /* linkage in the wheel bucket */
unsigned long expires; /* absolute expiry in jiffies */
void (*function)(struct timer_list *); /* callback */
u32 flags; /* encoded CPU affinity + TIMER_* flags */
};
Key fields:
expires— absolute jiffies value when the timer should fire. Set viaadd_timer()ormod_timer().function— called from softirq context (TIMER_SOFTIRQ) on the CPU the timer is pinned to. Must not sleep.flags— encodes the target CPU in the upper bits and timer flags in the lower bits.
Initializing a timer:
struct timer_list my_timer;
/* Static init */
DEFINE_TIMER(my_timer, my_callback);
/* Dynamic init */
timer_setup(&my_timer, my_callback, 0);
timer_setup() replaced the older init_timer() + setup_timer() pattern (changed in Linux 4.15) — the callback now receives the struct timer_list * directly rather than an opaque unsigned long data argument.
The public API
add_timer()
Inserts a timer into the wheel. timer->expires must be set before calling. The timer fires no earlier than expires but may fire slightly later if the CPU is busy.
timer_setup(&my_timer, my_callback, 0);
my_timer.expires = jiffies + msecs_to_jiffies(500);
add_timer(&my_timer);
mod_timer()
Changes the expiry of a pending timer atomically. If the timer is not pending, it is added. Returns 1 if the timer was pending, 0 if it was not. This is the preferred way to both add and modify a timer in one call.
timer_reduce()
Like mod_timer() but only moves the expiry earlier. If the timer is already set to expire at or before expires, the call is a no-op. Added in Linux 4.20. This prevents a driver from accidentally delaying a timer that was already scheduled to fire sooner — useful when multiple code paths may arm the same timer.
del_timer() vs del_timer_sync()
del_timer() removes the timer from the wheel. If the callback is currently running on another CPU, del_timer() returns immediately — the callback continues executing. This can cause use-after-free if you free data the callback touches immediately after del_timer().
del_timer_sync() waits until the callback has finished on all CPUs before returning. Always use del_timer_sync() before freeing resources a timer callback accesses.
Since Linux 6.2, timer_shutdown_sync() marks the timer as permanently dead after stopping it, so any subsequent add_timer() or mod_timer() calls are silently ignored. Use this when a timer must never fire again (e.g., during driver removal).
The hierarchical wheel (Linux 4.8+)
Linux 4.8 replaced the old five-level cascade wheel with a hierarchical wheel design (kernel/time/timer.c). The old design required expensive cascade operations when the wheel's hand advanced past a level boundary; the new design uses a simpler, more cache-friendly structure.
struct timer_base
/* kernel/time/timer.c (internal) */
struct timer_base {
raw_spinlock_t lock;
struct timer_list *running_timer; /* callback executing now */
unsigned long clk; /* last processed jiffies value */
unsigned long next_expiry; /* earliest pending expiry */
unsigned int cpu;
bool is_idle;
bool timers_pending;
DECLARE_BITMAP(pending_map, WHEEL_SIZE);
struct hlist_head vectors[WHEEL_SIZE]; /* the buckets */
};
Each CPU has two timer_base instances: one for standard timers and one for deferrable timers (TIMER_DEFERRABLE).
Wheel geometry
The number of levels depends on the configured HZ:
- HZ > 100 (e.g., HZ=250 or HZ=1000, the default for desktop/server kernels):
LVL_DEPTH = 9, giving 9 × 64 = 576 buckets. - HZ ≤ 100 (e.g., embedded/low-power kernels):
LVL_DEPTH = 8, giving 8 × 64 = 512 buckets.
The table below shows the geometry for HZ=250 (LVL_DEPTH = 9), each with 64 buckets (LVL_SIZE = 64):
| Level | Granularity | Range covered |
|---|---|---|
| 0 | 1 jiffy | 0 – 63 jiffies |
| 1 | 8 jiffies | 64 – 511 jiffies |
| 2 | 64 jiffies | 512 – 4,095 jiffies |
| 3 | 512 jiffies | 4,096 – 32,767 jiffies |
| 4 | 4,096 jiffies | 32,768 – 262,143 jiffies |
| 5 | 32,768 jiffies | 262,144 – 2,097,151 jiffies |
| 6 | 262,144 jiffies | 2,097,152 – 16,777,215 jiffies |
| 7 | 2,097,152 jiffies | 16,777,216 – 134,217,727 jiffies |
| 8 | 16,777,216 jiffies | up to ~67,108 seconds at HZ=250 |
Each level is 8× coarser than the previous. When a timer is inserted, the kernel computes which level and bucket it belongs to based on the delta between now and expires. This is a pure bitmask operation — O(1) insert at any range.
Total wheel capacity: 576 buckets (9 × 64) for HZ=250. The pending_map bitmap lets the kernel quickly find the next non-empty bucket without scanning all entries.
__run_timers()
Called indirectly from run_timer_softirq() (the TIMER_SOFTIRQ handler) via the call chain run_timer_softirq() → run_timer_base() → __run_timer_base() → __run_timers() on each tick. It advances base->clk to the current jiffies value, identifies all buckets that have become due, and executes their callbacks. Expired timers in higher-level buckets are redistributed into lower-level buckets as the wheel advances — this is the "cascade" step, but it is amortized across ticks rather than happening all at once.
The softirq is raised by the tick interrupt handler. On a NOHZ (tickless) system, the tick may be skipped entirely when the CPU is idle, and __run_timers() catches up when the CPU wakes.
Deferrable timers
A deferrable timer is stored in the second timer_base (the deferrable base). These timers expire only when the CPU is already running — they do not wake an idle CPU. If the CPU remains idle past the expiry time, the timer fires when the CPU next wakes for any reason.
Use deferrable timers for periodic maintenance work that is not time-critical (e.g., statistics gathering, buffer flushing). Using them reduces unnecessary wakeups and improves power efficiency on NOHZ systems.
Relationship to workqueues
schedule_delayed_work() combines the timer wheel with the workqueue system:
/* Internally: */
struct delayed_work {
struct work_struct work;
struct timer_list timer; /* fires after the delay */
/* ... */
};
schedule_delayed_work(&my_dwork, msecs_to_jiffies(100));
When the timer fires, its callback calls queue_work() to push work onto the system workqueue. The actual work runs in a workqueue thread (process context, can sleep) rather than in softirq context. Use schedule_delayed_work() when the deferred action needs to sleep or allocate memory — use a plain timer_list only when the callback is short and cannot sleep.
Observing timers
There is no direct procfs interface for inspecting pending timer-wheel entries. Use kernel tracing instead:
# Trace __run_timers firings with ftrace
echo '__run_timers' > /sys/kernel/debug/tracing/set_ftrace_filter
echo function > /sys/kernel/debug/tracing/current_tracer
cat /sys/kernel/debug/tracing/trace_pipe
# Count timer callbacks by function with bpftrace
bpftrace -e '
kprobe:call_timer_fn {
@[ksym(arg1)] = count();
}
interval:s:5 { print(@); clear(@); exit(); }
'
# Show per-CPU timer softirq counts
grep TIMER /proc/softirqs
# High-level: see timer expiry latency distribution
perf stat -e irq:softirq_entry,irq:softirq_exit -a -- sleep 5
Common pitfalls
Jiffies wraparound
jiffies is an unsigned long and wraps at ULONG_MAX. On 32-bit kernels with HZ=1000 this happens every ~49 days. Never compare jiffies with raw > or <:
/* WRONG: breaks at wraparound */
if (jiffies > my_timer.expires) { ... }
/* CORRECT: use the wraparound-safe macros */
if (time_after(jiffies, my_timer.expires)) { ... }
if (time_before(jiffies, deadline)) { ... }
if (time_after_eq(jiffies, start + HZ)) { ... }
time_after(a, b) is defined as shown below. The subtraction happens at unsigned long width first, then the result is cast to long — this is not the same as casting each operand individually:
/* include/linux/jiffies.h */
/* The subtraction is done at unsigned long width; the result is cast to long */
#define time_after(a,b) \
(typecheck(unsigned long, a) && \
typecheck(unsigned long, b) && \
((long)((b) - (a)) < 0))
This handles unsigned wraparound correctly by interpreting the result of the unsigned subtraction as a signed value.
Sleeping in a timer callback
Timer callbacks run in softirq context. They must not sleep, call schedule(), or acquire any lock that could sleep (e.g., a mutex). Use schedule_delayed_work() if the deferred action needs to sleep.