BPF Architecture and Program Types

How eBPF programs are loaded, verified, and attached to kernel hooks

From cBPF to eBPF

Classic BPF (1993–2014)

BPF was not originally a Linux invention. It came from a paper by Steven McCanne and Van Jacobson at Lawrence Berkeley Laboratory presented at USENIX Winter 1993: "The BSD Packet Filter: A New Architecture for User-level Packet Capture." (paper)

The original BPF was a small virtual machine with two 32-bit registers (A and X), a small scratch memory array, and a simple instruction set designed for one purpose: filtering network packets in the kernel before copying them to userspace. Linux adopted it as SO_ATTACH_FILTER — attach a filter to a socket to drop packets you don't care about before they reach recvmsg().

That was its entire purpose. The implementation lived almost entirely in net/core/filter.c.

The problem with cBPF

As Linux kernel development accelerated in the 2000s and 2010s, multiple subsystems needed a way to let userspace express policies in the kernel without giving userspace arbitrary kernel code execution:

seccomp (2005): wanted to filter system calls. Used cBPF as-is, but the 2-register model was awkward for real policies.
tc (traffic control): needed programmable packet classification and mangling, not just filtering.
perf/tracing: wanted to attach custom aggregation logic to probe points to avoid copying raw events to userspace.

Kernel modules could do all this, but they're dangerous: a buggy module panics the kernel, they require CAP_SYS_MODULE, and they're tied to a specific kernel version. There was no way to safely run "untrusted" kernel code.

eBPF (Linux 3.18, 2014)

Alexei Starovoitov and Daniel Borkmann rewrote the BPF virtual machine from scratch. The new design — retroactively called eBPF (extended BPF) (commit) (LWN) — kept backward compatibility with cBPF socket filters (they are auto-translated) but was otherwise a new architecture:

	cBPF	eBPF
Registers	2 (32-bit A, X)	11 (64-bit r0–r10)
Instruction width	Variable	Fixed 64-bit
Shared state	None	Maps (hash, array, ring buffer…)
Attachment points	Socket filter only	30+ program types
Verifier	None (simple validator)	Full safety proof
JIT	Partial	All major architectures
Userspace/kernel sharing	No	Yes (maps, BTF)

The verifier is what makes eBPF safe: it walks every possible execution path in the BPF program and proves that it terminates, does not dereference invalid pointers, and does not access memory outside the permitted regions. A program the verifier rejects never runs.

The map abstraction solved the shared-state problem: BPF programs could accumulate statistics (per-IP counters, latency histograms) in a kernel-side data structure, and userspace could read that structure via the bpf() syscall — no raw kernel memory access needed.

The answer to "why not a kernel module?" is now concrete: a BPF program that passes the verifier cannot crash the kernel, can be loaded without CAP_SYS_MODULE, and is JIT-compiled to near-native speed.

The bpf() syscall

Everything in eBPF flows through a single syscall:

#include <linux/bpf.h>

int bpf(int cmd, union bpf_attr *attr, unsigned int size);

Key commands:

BPF_PROG_LOAD      /* load and verify a BPF program */
BPF_MAP_CREATE     /* create a BPF map */
BPF_MAP_LOOKUP_ELEM
BPF_MAP_UPDATE_ELEM
BPF_MAP_DELETE_ELEM
BPF_OBJ_PIN        /* pin map/program to /sys/fs/bpf/ */
BPF_OBJ_GET        /* retrieve pinned object */
BPF_PROG_ATTACH    /* attach program to cgroup/hook */
BPF_PROG_DETACH
BPF_LINK_CREATE    /* create a BPF link (managed attachment) */
BPF_LINK_UPDATE
BTF_LOAD           /* load BTF type information */

Loading a program

/* kernel/bpf/syscall.c: bpf_prog_load() */
union bpf_attr attr = {
    .prog_type    = BPF_PROG_TYPE_SOCKET_FILTER,
    .insn_cnt     = prog_len / sizeof(struct bpf_insn),
    .insns        = (uint64_t)prog_insns,
    .license      = (uint64_t)"GPL",
    .log_level    = 1,          /* request verifier log */
    .log_buf      = (uint64_t)log_buf,
    .log_size     = LOG_BUF_SIZE,
    .kern_version = 0,          /* for kprobes, must match kernel */
};

int prog_fd = bpf(BPF_PROG_LOAD, &attr, sizeof(attr));
/* Returns file descriptor or -1 with errno */

The kernel flow after BPF_PROG_LOAD:

bpf_prog_load()
  → bpf_check_uarg_tail_zero()    /* validate attr padding */
  → find_prog_type()              /* look up type ops */
  → bpf_prog_alloc()              /* allocate struct bpf_prog */
  → copy_from_bpfptr(insns)       /* copy bytecode from userspace */
  → bpf_check()                   /* VERIFIER — safety proof */
  → bpf_prog_select_runtime()     /* JIT or interpreter selection */
  → bpf_prog_kallsyms_add()       /* expose in /proc/kallsyms */
  → anon_inode_getfd("bpf-prog")  /* return file descriptor */

struct bpf_prog

The central kernel object for a loaded BPF program:

/* include/linux/filter.h */
struct bpf_prog {
    u16             pages;          /* number of pages */
    u16             jited:1,        /* true if JIT-compiled */
                    jit_requested:1,
                    gpl_compatible:1,
                    cb_access:1,    /* accesses skb->cb area */
                    dst_needed:1,
                    blinding_requested:1,
                    blinded:1,
                    is_func:1,      /* function in a program array */
                    kprobe_override:1,
                    has_callchain_buf:1,
                    enforce_expected_attach_type:1,
                    call_get_stack:1,
                    call_get_func_ip:1,
                    tstamp_type_access:1;
    enum bpf_prog_type      type;   /* BPF_PROG_TYPE_* */
    enum bpf_attach_type    expected_attach_type;
    u32             len;            /* number of filter blocks */
    u32             jited_len;      /* size of JIT code */
    u8              tag[BPF_TAG_SIZE]; /* SHA1-based identifier */
    struct bpf_prog_stats __percpu *stats;
    struct bpf_prog_aux    *aux;    /* verifier state, maps, BTF */
    struct sock_fprog_kern *orig_prog; /* original Classic BPF */
    unsigned int    (*bpf_func)(const void *ctx,
                                const struct bpf_insn *insn);
    /* JIT/interpreter code follows */
    union {
        struct sock_filter  insns[0]; /* Classic BPF instructions */
        struct bpf_insn     insnsi[]; /* eBPF instructions */
    };
};

BPF instruction set

eBPF has 11 64-bit registers (r0–r10) and a fixed-size instruction format:

/* include/linux/filter.h */
struct bpf_insn {
    __u8    code;       /* opcode */
    __u8    dst_reg:4;  /* destination register */
    __u8    src_reg:4;  /* source register */
    __s16   off;        /* signed offset */
    __s32   imm;        /* signed immediate */
};

Register conventions: - r0 — return value from helper calls and program exit - r1–r5 — function arguments (r1 = context pointer on entry) - r6–r9 — callee-saved registers - r10 — frame pointer (read-only)

Program types

Each program type has a specific context struct and allowed helpers:

/* include/uapi/linux/bpf.h */
enum bpf_prog_type {
    BPF_PROG_TYPE_SOCKET_FILTER,    /* socket filter, ctx=sk_buff */
    BPF_PROG_TYPE_KPROBE,           /* kprobe/uprobe, ctx=pt_regs */
    BPF_PROG_TYPE_SCHED_CLS,        /* TC classifier, ctx=sk_buff */
    BPF_PROG_TYPE_SCHED_ACT,        /* TC action */
    BPF_PROG_TYPE_TRACEPOINT,       /* static tracepoints */
    BPF_PROG_TYPE_XDP,              /* XDP, ctx=xdp_md */
    BPF_PROG_TYPE_PERF_EVENT,       /* PMU/perf events */
    BPF_PROG_TYPE_CGROUP_SKB,       /* cgroup socket buffer */
    BPF_PROG_TYPE_CGROUP_SOCK,      /* cgroup socket creation */
    BPF_PROG_TYPE_LWT_IN,           /* lightweight tunnels */
    BPF_PROG_TYPE_SOCK_OPS,         /* socket event callbacks */
    BPF_PROG_TYPE_SK_SKB,           /* socket map steering */
    BPF_PROG_TYPE_CGROUP_DEVICE,    /* device access control */
    BPF_PROG_TYPE_SK_MSG,           /* sockmap message redirect */
    BPF_PROG_TYPE_RAW_TRACEPOINT,   /* raw tracepoints (no arg rewrite) */
    BPF_PROG_TYPE_CGROUP_SOCK_ADDR, /* cgroup bind/connect/sendmsg */
    BPF_PROG_TYPE_LWT_SEG6LOCAL,
    BPF_PROG_TYPE_LIRC_MODE2,       /* IR decoding */
    BPF_PROG_TYPE_SK_REUSEPORT,     /* SO_REUSEPORT steering */
    BPF_PROG_TYPE_FLOW_DISSECTOR,   /* custom flow dissection */
    BPF_PROG_TYPE_CGROUP_SYSCTL,    /* sysctl access control */
    BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE,
    BPF_PROG_TYPE_CGROUP_SOCKOPT,
    BPF_PROG_TYPE_TRACING,          /* fentry/fexit/fmod_ret */
    BPF_PROG_TYPE_STRUCT_OPS,       /* implement kernel struct ops */
    BPF_PROG_TYPE_EXT,              /* function replacement */
    BPF_PROG_TYPE_LSM,              /* Linux Security Module hooks */
    BPF_PROG_TYPE_SK_LOOKUP,        /* socket lookup */
    BPF_PROG_TYPE_SYSCALL,          /* privileged syscall programs */
    BPF_PROG_TYPE_NETFILTER,        /* netfilter hook */
};

Tracing programs in detail

/* fentry: attaches to function entry, no overhead of kprobes */
SEC("fentry/tcp_sendmsg")
int BPF_PROG(trace_tcp_sendmsg, struct sock *sk,
             struct msghdr *msg, size_t size)
{
    bpf_printk("tcp_sendmsg: size=%zu\n", size);
    return 0;
}

/* fexit: attaches to function return, can see return value */
SEC("fexit/tcp_sendmsg")
int BPF_PROG(trace_tcp_sendmsg_exit, struct sock *sk,
             struct msghdr *msg, size_t size, int ret)
{
    bpf_printk("tcp_sendmsg returned %d\n", ret);
    return 0;
}

/* kprobe: can attach to any non-inlined function */
SEC("kprobe/tcp_sendmsg")
int kprobe_tcp_sendmsg(struct pt_regs *ctx)
{
    struct sock *sk = (struct sock *)PT_REGS_PARM1(ctx);
    /* access arguments through PT_REGS_PARM1..5 */
    return 0;
}

XDP program

/* Minimal XDP program */
SEC("xdp")
int xdp_drop_icmp(struct xdp_md *ctx)
{
    void *data_end = (void *)(long)ctx->data_end;
    void *data     = (void *)(long)ctx->data;

    struct ethhdr *eth = data;
    if ((void *)(eth + 1) > data_end)
        return XDP_PASS;

    if (eth->h_proto != htons(ETH_P_IP))
        return XDP_PASS;

    struct iphdr *ip = (void *)(eth + 1);
    if ((void *)(ip + 1) > data_end)
        return XDP_PASS;

    if (ip->protocol == IPPROTO_ICMP)
        return XDP_DROP;  /* drop ICMP */

    return XDP_PASS;
}

XDP return codes: - XDP_ABORTED — bug/error, drops packet + fires tracepoint - XDP_DROP — drop immediately (before skb allocation) - XDP_PASS — pass to normal networking stack - XDP_TX — send back out the same interface - XDP_REDIRECT — redirect to another interface or AF_XDP socket

BPF program lifecycle

Load: bpf(BPF_PROG_LOAD)  → prog_fd (refcount=1)
                              │
Attach: bpf_link_create()  → link_fd
        or explicit attach      │
                              │   prog refcount++
                              │
Running: hook fires        → bpf_prog_run(prog, ctx)
                              │
Detach: close(link_fd)     → link cleanup, prog refcount--
Unload: close(prog_fd)     → prog refcount--
                              │
refcount==0:               → bpf_prog_free()
                              JIT memory freed

BPF links (BPF_LINK_CREATE) are preferred over legacy attach methods because: - Closing the link_fd automatically detaches (no dangling programs) - Links survive the original loader process exiting (if pinned) - Atomic program replacement via BPF_LINK_UPDATE

Helper functions

BPF programs can't call arbitrary kernel functions — they use a whitelisted set of helpers:

/* Common helpers */
bpf_map_lookup_elem(map, key)       → value pointer or NULL
bpf_map_update_elem(map, key, val, flags)
bpf_map_delete_elem(map, key)

bpf_ktime_get_ns()                  → nanoseconds since boot
bpf_get_current_pid_tgid()         → (tgid << 32) | pid
bpf_get_current_comm(buf, sz)       → process name string
bpf_get_current_task()             → struct task_struct *

bpf_probe_read_kernel(dst, sz, src) → safe kernel memory read
bpf_probe_read_user(dst, sz, src)   → safe userspace memory read
bpf_probe_read_kernel_str(dst, sz, src)

bpf_printk(fmt, ...)               → /sys/kernel/debug/tracing/trace_pipe

/* Networking */
bpf_skb_load_bytes(skb, off, to, len)
bpf_skb_store_bytes(skb, off, from, len, flags)
bpf_redirect(ifindex, flags)
bpf_fib_lookup(ctx, params, plen, flags)

/* Ringbuf */
bpf_ringbuf_reserve(map, size, flags)
bpf_ringbuf_submit(data, flags)
bpf_ringbuf_output(map, data, size, flags)

Each helper is assigned a stable number; new helpers get new numbers and never change existing ones for ABI stability.

Observing BPF programs

# List all loaded BPF programs
bpftool prog list
# 42: xdp  name xdp_drop_icmp  tag a04f5eef06a7f555  gpl
#         loaded_at 2024-01-15T10:00:00+0000  uid 0
#         xlated 112B  jited 88B  memlock 4096B  map_ids 3

# Show disassembly (eBPF or JIT)
bpftool prog dump xlated id 42
bpftool prog dump jited id 42

# Show a program's maps
bpftool prog show id 42 --pretty

# BPF stats
bpftool prog profile id 42 duration 5 cycles instructions

# Trace output
cat /sys/kernel/debug/tracing/trace_pipe