AF_XDP Sockets
Sending and receiving packets directly to/from userspace with zero copies
What AF_XDP is
AF_XDP is a socket address family that allows a userspace application to receive packets directly from an XDP program — bypassing the kernel network stack entirely for those packets. A BPF program running in XDP redirects matched packets to an AF_XDP socket, and the application reads them from a shared-memory ring buffer.
This achieves near-DPDK performance while staying within the kernel security model (no kernel bypass, kernel still owns the hardware).
The architecture
NIC hardware → DMA → ring buffer
↓ (NAPI poll)
XDP program runs:
if packet matches: bpf_redirect_map(&xsk_map, queue_id, 0) → XDP_REDIRECT
else: XDP_PASS → normal kernel stack
↓
Shared UMEM (User Memory)
↓
AF_XDP socket (xsk) → userspace application reads/writes directly
No sk_buff is allocated. The packet DMA buffer is the same memory that userspace reads.
UMEM: the shared memory region
AF_XDP is built around UMEM — a userspace-allocated memory region divided into fixed-size frames that both the kernel and userspace share:
UMEM:
┌──────┬──────┬──────┬──────┬──────┬──────┐
│frame0│frame1│frame2│frame3│frame4│frame5│ ...
└──────┴──────┴──────┴──────┴──────┴──────┘
Each frame: configurable size (default 4096 bytes)
Four ring buffers mediate ownership of frames between kernel and userspace:
| Ring | Direction | Who produces | Who consumes |
|---|---|---|---|
| FILL | RX: give frames to kernel | Userspace | Kernel (DMA target) |
| RX | RX: kernel delivers packets | Kernel | Userspace |
| TX | TX: userspace sends packets | Userspace | Kernel |
| COMPLETION | TX: kernel signals sent | Kernel | Userspace |
Setting up an AF_XDP socket
// 1. Create UMEM (shared memory)
int umem_size = 4096 * NUM_FRAMES; // NUM_FRAMES of 4096 bytes each
void *umem_area = mmap(NULL, umem_size, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS|MAP_HUGETLB, -1, 0);
struct xdp_umem_reg mr = {
.addr = (uint64_t)umem_area,
.len = umem_size,
.chunk_size = 4096,
.headroom = 0,
};
// 2. Create AF_XDP socket
int xsk_fd = socket(AF_XDP, SOCK_RAW, 0);
// 3. Register UMEM with socket
setsockopt(xsk_fd, SOL_XDP, XDP_UMEM_REG, &mr, sizeof(mr));
// 4. Set up ring sizes
int ring_size = 2048;
setsockopt(xsk_fd, SOL_XDP, XDP_RX_RING, &ring_size, sizeof(ring_size));
setsockopt(xsk_fd, SOL_XDP, XDP_TX_RING, &ring_size, sizeof(ring_size));
setsockopt(xsk_fd, SOL_XDP, XDP_UMEM_FILL_RING, &ring_size, sizeof(ring_size));
setsockopt(xsk_fd, SOL_XDP, XDP_UMEM_COMPLETION_RING,&ring_size, sizeof(ring_size));
// 5. Map rings into userspace
struct xdp_mmap_offsets off;
getsockopt(xsk_fd, SOL_XDP, XDP_MMAP_OFFSETS, &off, &optlen);
void *fill_ring = mmap(NULL, off.fr.desc + ring_size * sizeof(__u64),
PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE,
xsk_fd, XDP_UMEM_PGOFF_FILL_RING);
// Similar for rx_ring, tx_ring, completion_ring
// 6. Bind to interface and queue
struct sockaddr_xdp sxdp = {
.sxdp_family = AF_XDP,
.sxdp_ifindex = ifindex,
.sxdp_queue_id = queue_id,
.sxdp_flags = XDP_COPY, // or XDP_ZEROCOPY for optimal performance
};
bind(xsk_fd, (struct sockaddr *)&sxdp, sizeof(sxdp));
Zero-copy mode
In zero-copy mode (XDP_ZEROCOPY), the NIC DMA-writes packets directly into the UMEM frames. No data copying occurs between NIC, kernel, and userspace. Requires:
- Driver support (i40e, mlx5, ixgbe, etc.)
- UMEM frames pinned in memory
In copy mode (XDP_COPY), the kernel copies packet data from its internal buffers to UMEM. Works on any NIC but adds a copy.
Receiving packets (RX loop)
// Fill ring with addresses of free frames (give to kernel for DMA)
for (int i = 0; i < NUM_FRAMES; i++) {
fill_ring->addrs[fill_prod_idx % ring_size] = i * FRAME_SIZE;
fill_prod_idx++;
}
// Update fill ring producer index
__atomic_store_n(fill_ring->producer, fill_prod_idx, __ATOMIC_RELEASE);
// Poll for received packets
struct pollfd pfd = { .fd = xsk_fd, .events = POLLIN };
poll(&pfd, 1, -1);
// Read from RX ring
uint32_t rx_idx = *rx_ring->consumer;
uint32_t avail = *rx_ring->producer - rx_idx;
for (uint32_t i = 0; i < avail; i++) {
struct xdp_desc *desc = &rx_ring->descs[(rx_idx + i) % ring_size];
void *pkt = umem_area + desc->addr;
uint32_t len = desc->len;
// Process packet at pkt[0..len-1]
// Then return the frame to fill ring
}
*rx_ring->consumer = rx_idx + avail;
The XDP program (kernel side)
struct {
__uint(type, BPF_MAP_TYPE_XSKMAP);
__uint(max_entries, 64);
__type(key, __u32); // queue index
__type(value, __u32); // xsk socket fd
} xsk_map SEC(".maps");
SEC("xdp")
int xdp_redirect_to_xsk(struct xdp_md *ctx)
{
// Redirect all packets on this queue to the AF_XDP socket
return bpf_redirect_map(&xsk_map, ctx->rx_queue_index, XDP_PASS);
// XDP_PASS as fallback: if no xsk registered for this queue, pass to stack
}
libxdp / libbpf helper library
For production use, libxdp (part of xdp-tools) and libbpf provide abstractions that handle the ring management, UMEM setup, and XDP program loading:
// With libxdp
#include <xdp/xsk.h>
struct xsk_socket_config cfg = {
.rx_size = 2048,
.tx_size = 2048,
.libxdp_flags = XSK_LIBXDP_FLAGS__INHIBIT_PROG_LOAD,
.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST,
.bind_flags = XDP_ZEROCOPY,
};
struct xsk_umem *umem;
struct xsk_socket *xsk;
xsk_umem__create(&umem, umem_area, umem_size, fill_ring, comp_ring, NULL);
xsk_socket__create(&xsk, ifname, queue_id, umem, rx_ring, tx_ring, &cfg);
Use cases
- Kernel bypass for application protocols: Handle DNS/HTTP packet parsing in userspace at line rate
- Packet capture: Zero-copy alternative to libpcap for high-rate captures
- Custom load balancers: User-space decision logic with kernel-speed I/O
- Network function virtualization: Virtual switches, firewalls without full DPDK
Further reading
- XDP — The XDP framework that feeds AF_XDP sockets
- Network Device and NAPI — Where XDP programs hook into the driver
- TC and qdisc — Alternative for post-stack packet steering