Kprobes and Tracepoints
Dynamic and static instrumentation points in the kernel
Why three mechanisms?
kprobes, tracepoints, and BPF are not redundant — they solve different parts of a problem that evolved over a decade.
kprobes: dynamic but fragile (2000s)
kprobes arrived in Linux 2.6.9 as a way to instrument production kernels without recompilation. A kprobe on tcp_sendmsg fires every time that function is called, giving you register state and the ability to run arbitrary code.
The problem: kprobes are inherently unstable. If tcp_sendmsg is renamed, inlined by the compiler, or its calling convention changes, the probe silently does nothing or attaches to the wrong place. A tool written against one kernel version often breaks on the next. This made kprobes useful for ad-hoc debugging but unreliable for production monitoring.
Tracepoints: stable ABI (Linux 2.6.28, 2008)
Mathieu Desnoyers introduced (LWN) the TRACE_EVENT infrastructure to give tracing tools a stable surface. A tracepoint like sched_switch or block_rq_issue is a deliberate, named annotation placed by kernel developers in semantically meaningful locations.
The contract: the tracepoint name and its fields are stable across kernel versions (treated like a userspace-visible ABI). Tools that attach to sched_switch and read prev_pid/next_pid keep working release after release, regardless of how the surrounding code changes.
The zero-overhead design (jump labels that patch out the entire check when no consumers are registered) meant subsystem maintainers were willing to add tracepoints liberally — they cost nothing when idle.
The limitation: tracepoints only exist where developers placed them. There is no tracepoint for every interesting function, and adding one requires a kernel patch.
BPF: programmable logic at both layers (Linux 3.18+)
BPF programs can attach to both tracepoints and kprobes, but they add a critical capability: instead of just logging every event, a BPF program can aggregate, filter, and compute in-kernel — only sending a summary to userspace.
This turned tracepoints from "fire and log" into "fire and compute." A BPF program on block_rq_complete can maintain a per-device latency histogram entirely in kernel memory and expose it as a BPF map. Without BPF, collecting that histogram required copying every raw event to userspace and aggregating there — at high I/O rates, this was prohibitive.
fentry/fexit (Linux 5.5, 2020) (commit) added a fourth layer: BPF programs that attach to any kernel function entry/exit using BTF type information to access typed arguments, without the INT3-trap overhead of kprobes and without needing an explicit TRACE_EVENT annotation.
kprobes → attach anywhere, unstable ABI, INT3 overhead
tracepoints → stable ABI, zero overhead, limited coverage
BPF (kprobe) → programmable kprobes, in-kernel aggregation
BPF (tp) → programmable tracepoints, structured arguments
fentry/fexit → typed function tracing, minimal overhead
The result is a layered system: use tracepoints for stable production instrumentation, kprobes/fentry for deep one-off investigation, and BPF programs to avoid flooding userspace with raw events.
Kprobes: dynamic probes on any instruction
A kprobe can be attached to any instruction in the kernel (not just function entries). When hit, it calls a handler in your code.
How kprobes work
/* kernel/kprobes.c: simplified */
/* At probe registration: */
register_kprobe(kp)
→ find instruction at kp->addr
→ copy original instruction (for single-stepping)
→ replace first byte(s) with INT3 (0xCC) breakpoint
→ /* on x86 */
/* At runtime, when INT3 fires: */
do_int3()
→ kprobe_int3_handler()
→ call kp->pre_handler(kp, regs)
→ set up single-step: restore original instruction, set TF flag
→ /* return to single-step the original instruction */
→ after single-step:
→ post_kprobe_handler()
→ call kp->post_handler(kp, regs, 0)
→ restore INT3 breakpoint
On x86-64 with CONFIG_KPROBES_ON_FTRACE, kprobes at function entry use the existing ftrace hook instead of INT3, eliminating the breakpoint overhead.
Registering kprobes from a kernel module
/* drivers/mymodule/mymodule.c */
#include <linux/kprobes.h>
static int pre_handler(struct kprobe *p, struct pt_regs *regs)
{
/* Called BEFORE the probed instruction executes */
/* regs contains register state at probe point */
printk(KERN_INFO "kprobe: hit tcp_sendmsg, rdi=0x%lx\n", regs->di);
return 0;
}
static void post_handler(struct kprobe *p, struct pt_regs *regs,
unsigned long flags)
{
/* Called AFTER the probed instruction executes (single-stepped) */
}
static struct kprobe kp = {
.symbol_name = "tcp_sendmsg", /* or .addr = (kprobe_opcode_t *)addr */
.pre_handler = pre_handler,
.post_handler = post_handler,
};
static int __init mymodule_init(void)
{
int ret = register_kprobe(&kp);
if (ret < 0) {
pr_err("register_kprobe failed: %d\n", ret);
return ret;
}
pr_info("kprobe registered at %p\n", kp.addr);
return 0;
}
static void __exit mymodule_exit(void)
{
unregister_kprobe(&kp);
}
kretprobe: hook on function return
static int entry_handler(struct kretprobe_instance *ri, struct pt_regs *regs)
{
/* Called on function entry; can store data in ri->data */
*(ktime_t *)ri->data = ktime_get();
return 0;
}
static int return_handler(struct kretprobe_instance *ri, struct pt_regs *regs)
{
/* Called on function return */
ktime_t start = *(ktime_t *)ri->data;
long retval = regs_return_value(regs);
ktime_t elapsed = ktime_sub(ktime_get(), start);
pr_info("tcp_sendmsg returned %ld, took %lld ns\n",
retval, ktime_to_ns(elapsed));
return 0;
}
static struct kretprobe my_kretprobe = {
.handler = return_handler,
.entry_handler = entry_handler,
.data_size = sizeof(ktime_t), /* per-instance data */
.maxactive = 20, /* max concurrent instances */
.kp.symbol_name = "tcp_sendmsg",
};
Static tracepoints: TRACE_EVENT
Static tracepoints are annotation points compiled into the kernel at specific places. They have: - Zero overhead when disabled (a single no-op test) - Rich structured data (not just register dumps) - Stable ABI (unlike kprobes which depend on function names)
TRACE_EVENT macro
/* include/trace/events/sched.h */
TRACE_EVENT(sched_switch,
/* Arguments to the tracepoint call: */
TP_PROTO(bool preempt,
struct task_struct *prev,
struct task_struct *next,
unsigned int prev_state),
/* Actual call arguments: */
TP_ARGS(preempt, prev, next, prev_state),
/* Fields stored in the ring buffer: */
TP_STRUCT__entry(
__array( char, prev_comm, TASK_COMM_LEN )
__field( pid_t, prev_pid )
__field( int, prev_prio )
__field( long, prev_state )
__array( char, next_comm, TASK_COMM_LEN )
__field( pid_t, next_pid )
__field( int, next_prio )
),
/* Code to copy data into the ring buffer entry: */
TP_fast_assign(
memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
__entry->prev_pid = prev->pid;
__entry->prev_prio = prev->prio;
__entry->prev_state = __trace_sched_switch_state(preempt, prev_state, prev);
memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
__entry->next_pid = next->pid;
__entry->next_prio = next->prio;
),
/* Format string for human-readable output: */
TP_printk("prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> "
"next_comm=%s next_pid=%d next_prio=%d",
__entry->prev_comm, __entry->prev_pid, __entry->prev_prio,
...)
);
Adding a tracepoint to your code
/* 1. Define the tracepoint (once, in a header): */
/* include/trace/events/mysubsystem.h */
#undef TRACE_SYSTEM
#define TRACE_SYSTEM mysubsystem
#if !defined(_TRACE_MYSUBSYSTEM_H) || defined(TRACE_HEADER_MULTI_READ)
#define _TRACE_MYSUBSYSTEM_H
#include <linux/tracepoint.h>
TRACE_EVENT(my_event,
TP_PROTO(int value, const char *name),
TP_ARGS(value, name),
TP_STRUCT__entry(
__field(int, value)
__string(name, name) /* variable-length string */
),
TP_fast_assign(
__entry->value = value;
__assign_str(name, name);
),
TP_printk("value=%d name=%s", __entry->value, __get_str(name))
);
#endif /* _TRACE_MYSUBSYSTEM_H */
#include <trace/define_trace.h>
/* 2. Use in source files: */
/* kernel/mysubsystem.c */
#define CREATE_TRACE_POINTS
#include <trace/events/mysubsystem.h>
void my_function(int val, const char *name)
{
/* Tracepoint: zero-overhead when disabled */
trace_my_event(val, name);
/* ... */
}
How tracepoints achieve zero overhead
/* Generated by TRACE_EVENT: */
static inline void trace_my_event(int value, const char *name)
{
/* Single branch: check if any caller is registered */
if (static_key_false(&__tracepoint_my_event.key)) {
/* overhead only when enabled */
__tracepoint_my_event.funcs(value, name);
}
}
static_key_false uses a patched nop/jmp instruction (via jump labels) — the entire if block is patched out at runtime when no one is listening.
uprobes: userspace dynamic probes
uprobes work like kprobes but for userspace binaries, introduced in Linux 3.5 (commit). The kernel inserts a breakpoint into the mapped pages:
# Trace all calls to malloc in any process
echo 'p:malloc /lib/x86_64-linux-gnu/libc.so.6:0x9d430' \
> /sys/kernel/tracing/uprobe_events
echo 1 > /sys/kernel/tracing/events/uprobes/malloc/enable
# Trace by symbol name (requires debug symbols)
echo 'p:my_func /usr/bin/myapp:my_function' \
> /sys/kernel/tracing/uprobe_events
/* Kernel struct uprobe */
struct uprobe {
struct rb_node rb_node; /* inode's uprobe tree */
refcount_t ref;
struct rw_semaphore register_rwsem;
struct rw_semaphore consumer_rwsem;
struct list_head pending_list;
struct uprobe_consumer *consumers;
struct inode *inode; /* target binary's inode */
loff_t offset; /* offset within file */
loff_t ref_ctr_offset;
unsigned long flags;
struct arch_uprobe arch; /* arch-specific saved instruction */
};
USDT: userspace statically defined tracepoints
Similar to kernel TRACE_EVENT, USDT probes are static markers in userspace code. BPF attachment to USDT probes via the uprobe mechanism became available in Linux 4.8.
/* In C code (requires systemtap-sdt-dev or dtrace probes): */
#include <sys/sdt.h>
DTRACE_PROBE2(myapp, request_start, req_id, req_type);
/* ... */
DTRACE_PROBE1(myapp, request_end, req_id);
# List USDT probes in a binary
readelf --notes /usr/bin/python3 | grep stapsdt
# Trace with bpftrace
bpftrace -e 'usdt:/usr/bin/python3:function__entry { printf("%s\n", str(arg1)); }'
# Trace with perf
perf probe -x /usr/bin/python3 sdt_pythonX_Y:function__entry
Observing probe activity
# Count kprobe hits
cat /sys/kernel/tracing/kprobe_profile
# tcp_sendmsg 42 0 ← hits, misses
# Event hit counts
cat /sys/kernel/tracing/events/syscalls/sys_enter_read/hist
# Aggregate with bpftrace
bpftrace -e 'kprobe:tcp_sendmsg { @[comm] = count(); }'
# Attaching 1 probe...
# ^C
# @[nginx]: 1234
# @[postgres]: 567
# Dynamic function graph with kprobe
echo function_graph > /sys/kernel/tracing/current_tracer
echo 'tcp_sendmsg' > /sys/kernel/tracing/set_graph_function
echo 1 > /sys/kernel/tracing/tracing_on
Further reading
- ftrace — Ring buffer and tracefs interface
- perf Events — Hardware counters and sampling
- BPF: Architecture — kprobe/tracepoint BPF programs
Documentation/trace/kprobes.rst— kernel kprobe documentationDocumentation/trace/tracepoints.rst— static tracepoint documentation