sk_buff: The Network Buffer
The central data structure of the Linux network stack
What sk_buff is
struct sk_buff (socket buffer) is how the kernel represents a network packet. Every packet — from the moment it arrives from hardware to the moment it's delivered to a socket — lives in an sk_buff.
The structure has two parts: 1. Metadata header: pointers, length fields, protocol info, timestamps 2. Data buffer: the actual packet bytes (pointed to, not embedded)
Understanding sk_buff is fundamental to understanding the entire network stack, since every layer operates on it.
Memory layout
sk_buff metadata:
┌──────────────────────────────────────┐
│ next, prev (list linkage) │
│ sk (owning socket) │
│ dev (network device) │
│ len, data_len │
│ protocol, cloned, ... │
│ head, data, tail, end (data ptrs) │
│ cb[48] (layer-private scratch) │
│ ... │
└──────────────────────────────────────┘
Packet data buffer (allocated separately):
┌─────────────┬───────────────────────────────────────┬────────┐
│ headroom │ packet data (data ... tail) │ end │
└─────────────┴───────────────────────────────────────┴────────┘
↑ head ↑ data ↑ tail ↑ end
As headers are added (push down the stack):
data moves left: skb_push()
As headers are consumed (pop up the stack):
data moves right: skb_pull()
The headroom is reserved space before data — lower layers (ethernet, IP) push their headers into headroom as the packet goes down the stack. This avoids copying.
struct sk_buff key fields
// include/linux/skbuff.h
struct sk_buff {
// List linkage (in queues, hash chains)
struct sk_buff *next;
struct sk_buff *prev;
struct sock *sk; // owning socket (or NULL for forwarded packets)
struct net_device *dev; // network device this packet arrived on / will go out
ktime_t tstamp; // packet timestamp
// Per-layer scratch space: each layer stores its own data here.
// Only valid while the packet is at that layer.
char cb[48] __aligned(8);
unsigned int len; // total length (linear + paged data)
unsigned int data_len; // paged data length (scatter-gather)
__u16 mac_len; // MAC header length
__u16 hdr_len; // cloned header length
__u8 cloned:1; // shares data with another skb
__u8 head_frag:1; // head is a page fragment (not kmalloc)
__u8 pp_recycle:1; // owned by page_pool
// Protocol info
__be16 protocol; // Ethernet type (ETH_P_IP, ETH_P_IPV6, ...)
__u16 transport_header; // offset to L4 header
__u16 network_header; // offset to L3 header (IP)
__u16 mac_header; // offset to L2 header (Ethernet)
// Data pointers
sk_buff_data_t tail; // end of linear data
sk_buff_data_t end; // end of buffer (hard limit)
unsigned char *head; // start of buffer
unsigned char *data; // start of valid data
unsigned int truesize; // total memory charged to socket
// ...
// Note: paged fragments are NOT in struct sk_buff itself.
// They live in struct skb_shared_info at skb->end:
// skb_shinfo(skb)->frags[0..nr_frags-1]
};
Data pointer manipulation
// Push data onto the front (add a header going down the stack)
void *skb_push(struct sk_buff *skb, unsigned int len)
{
skb->data -= len;
skb->len += len;
return skb->data;
}
// Pull data off the front (consume a header going up the stack)
void *skb_pull(struct sk_buff *skb, unsigned int len)
{
skb->data += len;
skb->len -= len;
return skb->data;
}
// Add data at the end
void *skb_put(struct sk_buff *skb, unsigned int len)
{
void *tmp = skb_tail_pointer(skb);
skb->tail += len;
skb->len += len;
return tmp;
}
Header access macros
// Get pointer to L2/L3/L4 headers
static inline struct ethhdr *eth_hdr(const struct sk_buff *skb)
{
return (struct ethhdr *)skb_mac_header(skb);
}
static inline struct iphdr *ip_hdr(const struct sk_buff *skb)
{
return (struct iphdr *)skb_network_header(skb);
}
static inline struct tcphdr *tcp_hdr(const struct sk_buff *skb)
{
return (struct tcphdr *)skb_transport_header(skb);
}
// Example: reading the destination IP from a received packet
struct iphdr *iph = ip_hdr(skb);
__be32 dst = iph->daddr;
The cb[] scratch area
Each layer has 48 bytes of private storage in cb[]. This is a critical design — it avoids separate allocations for per-layer metadata:
// TCP uses cb[] to store TCP-specific info
struct tcp_skb_cb {
union {
struct inet_skb_parm h4;
struct inet6_skb_parm h6;
} header;
__u32 seq; // Starting sequence number
__u32 end_seq; // SEQ + FIN + SYN + datalen
__u8 tcp_flags; // TCP header flags
__u8 sacked; // State of sack bits
// ...
};
#define TCP_SKB_CB(__skb) ((struct tcp_skb_cb *)&((__skb)->cb[0]))
// Usage:
TCP_SKB_CB(skb)->seq = tcp_seq;
Cloning and sharing
skb_clone() creates a new sk_buff that shares the same data buffer. The cloned bit is set on both. Useful for sending the same packet to multiple consumers (e.g., packet sockets + forwarding):
struct sk_buff *clone = skb_clone(skb, GFP_ATOMIC);
// clone->data and skb->data point to the same buffer
// modifying headers requires skb_copy_expand() first
skb_copy() creates a full independent copy.
Fragmented packets (scatter-gather)
For large packets (e.g., TCP offload), data may span multiple pages:
/* Paged portion lives in skb_shared_info at skb->end: */
struct skb_shared_info {
skb_frag_t frags[MAX_SKB_FRAGS];
__u8 nr_frags;
// ...
};
// Access via: skb_shinfo(skb)->frags[i]
// data_len = total length of paged portions
// len = linear + paged
// Total accessible data:
// skb->data ... skb->tail (linear)
// + frags[0] ... frags[nr_frags-1] (paged)
Functions like skb_linearize() collapse paged data into the linear portion when a driver doesn't support scatter-gather.
Allocation and freeing
// Allocate a new sk_buff with headroom for headers
struct sk_buff *alloc_skb(unsigned int size, gfp_t priority);
// Allocate from network allocator (uses page_pool for RX paths)
struct sk_buff *napi_alloc_skb(struct napi_struct *napi, unsigned int length);
// Free the sk_buff (decrements refcount, frees when zero)
void kfree_skb(struct sk_buff *skb);
// Free without error accounting (normal path)
void consume_skb(struct sk_buff *skb);
truesize is charged to the socket's receive buffer (sk->sk_rmem_alloc). If the socket buffer fills up, new packets are dropped.
Why this design?
The linear + paged split
Early sk_buff implementations stored packet data in a single contiguous kmalloc() buffer. This worked fine until NICs gained scatter-gather DMA capability in the early 2000s.
A NIC with scatter-gather can DMA-write a received packet into non-contiguous physical pages — the payload goes into page cache pages, the header goes into a small slab allocation. Copying everything into one contiguous buffer would waste memory and CPU cycles. The kernel needed a representation that could hold both.
The solution was the linear + fragment model:
- The linear portion (head…tail) holds packet headers — small, always contiguous, fast to access.
- The paged fragments (skb_shinfo(skb)->frags[]) hold payload — large, may be non-contiguous pages that came directly from DMA.
TCP Segmentation Offload (TSO, ~2002) solidified this: a NIC capable of TSO can take a single large buffer (e.g., 64KB) and split it into MTU-sized Ethernet frames. The kernel sends one sk_buff with the full TCP payload in the paged fragment area; the NIC handles fragmentation. Without paged fragments, the kernel would have had to pre-fragment everything in software.
Why skb_shared_info lives at skb->end
/* net/core/skbuff.c: skb data layout */
/*
* head
* ↓
* [headroom][ packet data ][ skb_shared_info ]
* ↑ ↑ ↑
* data tail end
*/
static inline struct skb_shared_info *skb_shinfo(const struct sk_buff *skb)
{
return (struct skb_shared_info *)skb_end_pointer(skb);
}
Placing skb_shared_info immediately after the linear data buffer (at skb->end) avoids a separate heap allocation for the fragment array. The linear buffer and the fragment metadata are allocated together in one kmalloc() call. This matters for the hot path: every received packet allocates an sk_buff, and reducing that to one allocation saves significant overhead at high packet rates.
Cloning and the reference-counted design
skb_clone() creates a new sk_buff header that shares the same data buffer. This enables sending the same packet to multiple consumers — a packet socket listener and the normal forwarding path can both "have" the same sk_buff without copying the packet data.
The cloned bit signals that the data buffer is shared; any layer that wants to modify headers must call skb_unshare() or pskb_expand_head() to get its own copy first. This copy-on-write discipline prevents one subsystem from corrupting data that another is reading.
Further reading
- Socket Layer Overview — How sk_buff connects to sockets
- Network Device and NAPI — How sk_buffs are allocated during receive
- Life of a Packet (receive) — Full receive path using sk_buff
- Life of a Packet (transmit) — Full transmit path