What happens during swapping
The journey of a page from memory to disk and back
What is swapping?
When memory runs low, the kernel can move anonymous pages (heap, stack, private mappings) to a swap device - either a disk partition or a swap file. This frees physical memory for other uses while preserving the page's contents.
flowchart LR
subgraph SwapOut["Swap Out (Memory → Disk)"]
direction LR
subgraph PM1["Process Memory"]
AP["Anonymous Page"]
end
subgraph SS1["Swap Space (disk)"]
SL1["Swap slot (copy)"]
end
AP -->|write| SL1
PM1 -.->|Page freed!| PM1
end
flowchart RL
subgraph SwapIn["Swap In (Disk → Memory)"]
direction RL
subgraph SS2["Swap Space (disk)"]
SL2["Swap slot"]
end
subgraph PM2["Process Memory"]
NP["New Page (restored)"]
end
SL2 -->|read| NP
end
Unlike file pages (which can be re-read from their backing file), anonymous pages have nowhere to go except swap space.
Why this page was chosen
The kernel doesn't randomly pick pages to swap. It uses the LRU (Least Recently Used) lists to find cold pages - pages that haven't been accessed recently.
The selection process
flowchart TB
subgraph PageReclaimScanning["Page Reclaim Scanning"]
ActiveLRU["<b>Active Anonymous LRU</b><br/>(hot pages - recently used)<br/><i>Not scanned first - these pages are in use</i>"]
InactiveLRU["<b>Inactive Anonymous LRU</b><br/>(cold pages - candidates for swap)<br/><i>Scanned during reclaim</i>"]
Selection["<b>Selection Criteria:</b><br/>- Not recently accessed (PG_referenced clear)<br/>- Not locked (PG_locked clear)<br/>- Not being written back already<br/>- Can be unmapped from all page tables"]
ActiveLRU -->|"demote (if not recently accessed)"| InactiveLRU
InactiveLRU --> Selection
end
The referenced flag check
Pages get a "second chance" via the referenced flag:
flowchart TB
subgraph ReferencedFlagCheck["Page on Inactive LRU"]
Scan["Reclaim scan finds page"]
Check{"PG_referenced<br/>set?"}
Accessed["Page was accessed since last scan:<br/>- Clear PG_referenced<br/>- Rotate to end of list (give it more time)"]
Cold["Page is truly cold:<br/>- Select for swap out"]
Scan --> Check
Check -->|Yes| Accessed
Check -->|No| Cold
end
This prevents swapping out pages that are still being used occasionally.
Swap space organization
Linux supports two types of swap:
| Type | Device | Characteristics |
|---|---|---|
| Swap partition | /dev/sda2 |
Dedicated, fixed size |
| Swap file | /swapfile |
Flexible, can resize, easier to manage |
On modern kernels (2.6+), the kernel maps swap file blocks to disk sectors at swapon time and performs direct block I/O, bypassing all filesystem code. Performance between partition and file swap is effectively identical for non-fragmented files on non-COW filesystems (ext4, XFS). On SSDs, even fragmented swap files show negligible overhead since seek penalties don't apply.
Swap area structure
// From include/linux/swap.h
struct swap_info_struct {
unsigned long flags; // SWP_USED, SWP_WRITEOK, etc.
signed char prio; // Priority (higher = used first)
struct file *swap_file; // The swap file/device
struct block_device *bdev; // Block device (if partition)
unsigned long *swap_map; // Usage count per slot
unsigned long lowest_bit; // Optimization: first free slot hint
unsigned long highest_bit; // Optimization: last used slot
unsigned int pages; // Total slots in this area
unsigned int inuse_pages; // Currently used slots
// ...
};
Swap slots
Swap space is divided into page-sized slots:
flowchart TB
subgraph SwapArea["Swap Area (e.g., 1GB)"]
direction LR
subgraph Slots["Slot Numbers"]
S0["0<br/>used"]
S1["1<br/>free"]
S2["2<br/>used"]
S3["3<br/>free"]
S4["4<br/>free"]
S5["5<br/>used"]
S6["6<br/>used"]
S7["7<br/>free"]
SDOTS["..."]
SN["N-1<br/>free"]
end
end
subgraph SwapMap["swap_map[] tracks usage count for each slot"]
M0["0: count=2 (shared by 2 processes after fork)"]
M1["1: count=0 (free)"]
M2["2: count=1 (one owner)"]
M3["..."]
end
Viewing swap status
# View swap areas
cat /proc/swaps
# Filename Type Size Used Priority
# /dev/nvme0n1p3 partition 8388604 12345 -2
# Detailed swap info
swapon --show
# System-wide swap usage
free -h
Stage 1: Swap slot allocation
When the kernel decides to swap out a page, it first allocates a slot:
// mm/swap_slots.c / mm/swapfile.c (simplified)
swp_entry_t folio_alloc_swap(struct folio *folio)
{
struct swap_info_struct *si;
swp_entry_t entry;
// Find a swap area with free slots
// Higher priority areas are checked first
si = swap_info[...]; // Based on priority
// Allocate a slot
offset = scan_swap_map_slots(si, ...);
// Create swap entry (type + offset encoded)
entry = swp_entry(si->type, offset);
return entry;
}
Swap entry encoding
A swap entry packs the swap area type and offset into a single value:
flowchart LR
subgraph SwapEntry["Swap Entry (64-bit)"]
direction TB
B0["Bits 0: Present = 0<br/>(not a valid page mapping)"]
B1["Bits 1-5: Swap type<br/>(which swap area, 0-31)"]
B2["Bits 6-57: Swap offset<br/>(slot number within the area)"]
B3["Bits 58-63: Flags and reserved"]
end
This entry will replace the page table entry (PTE) after the page is written out.
Stage 2: Writing to swap
Once a slot is allocated, the page content is written to the swap device:
// mm/vmscan.c shrink_folio_list() (simplified)
static unsigned int shrink_folio_list(struct list_head *folio_list, ...)
{
// For each folio...
// 1. Unmap from all page tables first
// (so no one can access while we're writing)
try_to_unmap(folio, TTU_BATCH_FLUSH);
// 2. If anonymous and we can swap...
if (folio_test_anon(folio)) {
if (!add_to_swap(folio))
goto activate; // Can't swap, keep in memory
// 3. Write page to swap device
pageout(folio, mapping);
}
}
The add_to_swap() function
// mm/swap_state.c
bool add_to_swap(struct folio *folio)
{
swp_entry_t entry;
// 1. Get a swap slot
entry = folio_alloc_swap(folio);
if (!entry.val)
return false; // No swap space available
// 2. Set up folio for swap
folio_set_swapcache(folio);
folio->swap = entry;
// 3. Add to swap cache (so others can find it during I/O)
__folio_set_swap_entry(folio, entry);
xa_store(&swap_address_space->i_pages, ...);
return true;
}
Writing to disk
// Simplified I/O path
static int pageout(struct folio *folio, struct address_space *mapping)
{
// Submit write I/O to swap device
// This is asynchronous - we don't wait here
struct writeback_control wbc = {
.sync_mode = WB_SYNC_NONE, // Don't wait
};
return mapping->a_ops->writepage(&folio->page, &wbc);
}
The actual I/O goes through the block layer:
flowchart TB
subgraph PageWritePath["Page Write Path"]
pageout["pageout()"]
swap_writepage["swap_writepage()"]
__swap_writepage["__swap_writepage()"]
submit_bio["submit_bio()"]
block["Block layer"]
driver["Device driver"]
disk["Disk"]
pageout --> swap_writepage
swap_writepage --> __swap_writepage
__swap_writepage --> submit_bio
submit_bio --> block
block --> driver
driver --> disk
end
Stage 3: Page table update
After the write completes, the PTE is updated to contain the swap entry instead of a page mapping:
flowchart LR
subgraph Before["Before swap out"]
direction LR
PTE1["<b>Page Table Entry (PTE)</b><br/>Present=1<br/>PFN=0x12345"]
PP["<b>Physical Page</b><br/>Frame 0x12345<br/>[page data]"]
PTE1 --> PP
end
flowchart LR
subgraph After["After swap out"]
direction LR
SE["<b>Swap Entry</b><br/>Present=0<br/>Type=0<br/>Offset=0x789"]
SD["<b>Swap Device</b><br/>Slot 0x789<br/>[page data]"]
SE --> SD
end
Freed["Physical page returned to buddy allocator!"]
The swap cache
During I/O, the page stays in the swap cache - an address space that allows finding in-flight pages:
flowchart LR
subgraph SwapCachePurpose["Swap Cache Purpose: While page is being written to swap"]
ProcessA["Process A"]
SwapCache["Swap Cache<br/>(finds it!)"]
Page["Page"]
ProcessA -->|tries to access| SwapCache
SwapCache --> Page
end
Note["Without swap cache, Process A would have to wait for write<br/>to complete, then read it back immediately (wasteful!)"]
After I/O completes successfully: - Page is removed from swap cache - Physical page is freed - Only the swap slot remains with the data
Stage 4: Process accesses swapped page
Later, when a process accesses a swapped-out address, a page fault brings it back.
The fault handler detects a swap entry
// arch/x86/mm/fault.c → mm/memory.c
static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
{
pte_t entry = vmf->orig_pte;
if (!pte_present(entry)) {
if (pte_none(entry))
return do_anonymous_page(vmf); // New page
// PTE has swap entry
return do_swap_page(vmf); // ← Our path!
}
// ... handle other cases
}
do_swap_page() brings the page back
// mm/memory.c (simplified)
vm_fault_t do_swap_page(struct vm_fault *vmf)
{
swp_entry_t entry = pte_to_swp_entry(vmf->orig_pte);
struct folio *folio;
// 1. Check swap cache first (maybe another thread faulted already)
folio = swap_cache_get_folio(entry, ...);
if (!folio) {
// 2. Not in cache - need to read from swap
folio = swapin_readahead(entry, ...);
}
// 3. Wait for page to be ready
folio_lock(folio);
// 4. Verify swap entry still valid (race check)
// ...
// 5. Update page table: swap entry → page mapping
pte = mk_pte(&folio->page, vmf->vma->vm_page_prot);
if (vmf->flags & FAULT_FLAG_WRITE)
pte = pte_mkwrite(pte_mkdirty(pte), vmf->vma);
set_pte_at(vmf->vma->vm_mm, vmf->address, vmf->pte, pte);
// 6. Update LRU
folio_add_lru(folio);
// 7. Free swap slot (if we're the last user)
swap_free(entry);
return 0;
}
Swap-in readahead
Like file readahead, the kernel predicts you might access nearby swapped pages:
// mm/swap_state.c
struct folio *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
struct vm_fault *vmf)
{
struct page *page;
// Read the requested page
page = read_swap_cache_async(entry, ...);
// Also read nearby pages (readahead)
if (should_readahead(...)) {
for (offset = -readahead_win; offset <= readahead_win; offset++) {
swp_entry_t ra_entry = swp_entry(type, entry_offset + offset);
read_swap_cache_async(ra_entry, ...);
}
}
return page_folio(page);
}
Stage 5: Swap slot freed
After successful swap-in, the swap slot can be freed:
// mm/swapfile.c
void swap_free(swp_entry_t entry)
{
struct swap_info_struct *si = swap_info[swp_type(entry)];
// Decrement usage count
// (might be >1 if COW shared the slot)
count = swap_map[swp_offset(entry)]--;
if (count == 0) {
// Slot is now free for reuse
si->inuse_pages--;
}
}
Swap slots and COW
When processes share a swapped page (after fork + swap):
flowchart TB
subgraph SharedSlot["Parent and Child share swap slot"]
SwapSlot["<b>Swap slot 0x789</b><br/>swap_map[0x789] = 2 (two users)<br/>Contents: page data"]
ParentPTE["Parent's PTE: swap entry: slot 0x789"]
ChildPTE["Child's PTE: swap entry: slot 0x789"]
ParentPTE --> SwapSlot
ChildPTE --> SwapSlot
end
subgraph ParentSwapIn["When parent swaps in"]
P1["Reads from slot 0x789"]
P2["swap_map[0x789] = 1 (still used by child)"]
P3["Parent's PTE → physical page"]
P4["Slot NOT freed yet"]
P1 --> P2 --> P3 --> P4
end
subgraph ChildSwapIn["When child swaps in"]
C1["Reads from slot 0x789"]
C2["swap_map[0x789] = 0"]
C3["Slot is now free"]
C1 --> C2 --> C3
end
SharedSlot --> ParentSwapIn --> ChildSwapIn
The complete swap cycle
Putting it all together:
flowchart TB
subgraph SwapOutPhase["Swap Out Phase"]
Step1["<b>1. Memory Pressure</b><br/>kswapd wakes or direct reclaim starts"]
Step2["<b>2. Page Selection</b> (mm/vmscan.c)<br/>Scan inactive anonymous LRU<br/>Find cold page without PG_referenced"]
Step3["<b>3. Unmap Page</b><br/>Remove from all page tables via rmap<br/>Page now inaccessible to processes"]
Step4["<b>4. Allocate Swap Slot</b><br/>folio_alloc_swap() finds free slot<br/>Add page to swap cache"]
Step5["<b>5. Write to Swap</b><br/>submit_bio() queues I/O<br/>Async write to swap device"]
Step6["<b>6. Complete Swap Out</b><br/>PTEs now contain swap entries<br/>Physical page freed to buddy<br/>Page removed from swap cache"]
Step1 --> Step2
Step2 --> Step3
Step3 --> Step4
Step4 --> Step5
Step5 --> Step6
end
TimePasses["--- Time passes ---"]
subgraph SwapInPhase["Swap In Phase"]
Step7["<b>7. Process Accesses Address</b><br/>Page fault (PTE not present)<br/>do_swap_page() handles it"]
Step8["<b>8. Swap In</b><br/>Allocate new physical page<br/>Read from swap device<br/>Add to swap cache during I/O"]
Step9["<b>9. Restore Mapping</b><br/>Update PTE: swap entry → page mapping<br/>Add page to active LRU"]
Step10["<b>10. Free Swap Slot</b><br/>swap_free() decrements count<br/>Slot available for reuse (if count=0)"]
Step7 --> Step8
Step8 --> Step9
Step9 --> Step10
end
Step6 --> TimePasses
TimePasses --> Step7
Swap tunables
Swappiness
Controls how aggressively the kernel swaps:
# View current value
cat /proc/sys/vm/swappiness
# Default: 60
# Reduce swapping (prefer reclaiming file cache)
sysctl vm.swappiness=10
# Avoid swapping almost entirely
sysctl vm.swappiness=1
| Value | Effect |
|---|---|
| 0 | Swap only to avoid OOM |
| 1-30 | Prefer file cache reclaim |
| 60 | Default balance |
| 100 | Aggressive swapping |
Swap priority
Multiple swap areas can have different priorities:
# Higher priority = used first
swapon -p 10 /dev/fast_ssd # Priority 10 (used first)
swapon -p 5 /dev/slow_hdd # Priority 5 (used when fast is full)
# Same priority = round-robin
swapon -p 10 /dev/ssd1
swapon -p 10 /dev/ssd2 # Striped across both
VFS cache pressure
Affects the balance between reclaiming file cache vs swapping:
cat /proc/sys/vm/vfs_cache_pressure
# Default: 100
# Lower = keep file caches longer (swap more)
# Higher = drop file caches more readily (swap less)
Swap performance considerations
SSD vs HDD
| Metric | HDD | SSD |
|---|---|---|
| Random read | ~10ms | ~0.1ms |
| Sequential read | ~100MB/s | ~500-3000MB/s |
| Write endurance | Unlimited | Limited (but usually fine) |
SSDs dramatically improve swap performance, making swapping much less painful.
Swap thrashing
When the working set exceeds RAM, the system constantly swaps pages in and out:
flowchart LR
subgraph Thrashing["Thrashing"]
direction TB
TimeArrow["Time →"]
subgraph PageA["Page A"]
A1["in memory"] --> A2["swap out"] --> A3["swap in"] --> A4["swap out"] --> A5["..."]
end
subgraph PageB["Page B"]
B1["swap in"] --> B2["swap out"] --> B3["swap in"] --> B4["swap out"] --> B5["..."]
end
subgraph PageC["Page C"]
C1["swap out"] --> C2["swap in"] --> C3["swap out"] --> C4["swap in"] --> C5["..."]
end
Result["System spends more time swapping than doing useful work!"]
end
Signs of thrashing:
- High swap I/O (vmstat si/so columns)
- High CPU time in kernel (top shows high sy/wa)
- Application responsiveness drops drastically
Solutions: - Add more RAM - Reduce memory usage - Use zswap or zram for compression
zswap and zram
zswap: Compresses pages before writing to swap device:
flowchart LR
subgraph NormalSwap["Normal swap"]
NS_Page["Page"] --> NS_Device["Swap device"]
end
flowchart LR
subgraph WithZswap["With zswap"]
ZS_Page["Page"] --> ZS_Compress["Compress"] --> ZS_RAM["RAM pool"] -.->|if needed| ZS_Device["Swap device"]
end
zram: Creates a compressed RAM disk as swap:
# Enable zram
modprobe zram
echo lz4 > /sys/block/zram0/comp_algorithm
echo 2G > /sys/block/zram0/disksize
mkswap /dev/zram0
swapon -p 100 /dev/zram0 # High priority
Try it yourself
Monitor swap activity
# Real-time swap I/O
vmstat 1
# si = swap in (KB/s)
# so = swap out (KB/s)
# Detailed swap stats
cat /proc/vmstat | grep -E "pswp|swap"
# pswpin - pages swapped in
# pswpout - pages swapped out
Watch a swap operation
# Terminal 1: Monitor swap
watch -n 1 'free -h; echo; cat /proc/swaps'
# Terminal 2: Force swapping
stress --vm 1 --vm-bytes 90% --vm-keep &
# Terminal 3: Watch specific process swap usage
for pid in $(pgrep stress); do
echo "PID $pid: $(grep VmSwap /proc/$pid/status)"
done
Trace swap events
# Enable swap tracing
echo 1 > /sys/kernel/debug/tracing/events/vmscan/mm_vmscan_writepage/enable
echo 1 > /sys/kernel/debug/tracing/events/swap/swap_readpage/enable
# Watch events
cat /sys/kernel/debug/tracing/trace_pipe
Examine swap contents
# Per-process swap usage
for pid in /proc/[0-9]*; do
comm=$(cat $pid/comm 2>/dev/null)
swap=$(grep VmSwap $pid/status 2>/dev/null | awk '{print $2}')
[ "$swap" != "0" ] && [ -n "$swap" ] && echo "$comm: ${swap}kB"
done | sort -t: -k2 -n -r | head -10
Test swap in/out explicitly
# Allocate memory and force swap
cat > /tmp/swap_test.c << 'EOF'
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
int main() {
size_t size = 200 * 1024 * 1024; // 200MB
char *buf = malloc(size);
// Touch all pages
memset(buf, 'A', size);
printf("Allocated and touched %zuMB\n", size / 1024 / 1024);
printf("Press Enter to trigger memory pressure...\n");
getchar();
// Let system swap it out
printf("Sleeping - watch with vmstat...\n");
sleep(30);
// Access again (swap in)
printf("Accessing memory (swap in)...\n");
volatile char c = buf[0]; // Force access
printf("Done. Press Enter to exit.\n");
getchar();
return 0;
}
EOF
gcc -o /tmp/swap_test /tmp/swap_test.c
/tmp/swap_test
Key source files
| File | What It Does |
|---|---|
mm/swapfile.c |
Swap area management, slot allocation |
mm/swap_state.c |
Swap cache operations |
mm/page_io.c |
Swap I/O operations |
mm/memory.c |
do_swap_page() - swap-in handling |
mm/vmscan.c |
Page reclaim, swap-out path |
include/linux/swap.h |
Swap data structures |
History
Swap evolution
Early Linux: Basic swap support from the beginning.
v2.6: Swap clustering for better sequential I/O.
v3.11 (2013): zswap - compressed swap cache.
Commit: 2b2811178e85 ("zswap: add to mm/") | LKML
Author: Seth Jennings (IBM)
In the swap-out flow (Stage 2-3 above), zswap intercepts pages before they hit the swap device. It compresses them and stores them in a RAM pool. Only when the pool fills up do pages get written to the actual swap device. This dramatically reduces swap I/O for compressible workloads.
v3.14 (2014): zram moved from staging to mainline.
Commit: cd67e10ac699 ("zram: promote zram from staging")
Author: Minchan Kim (LG Electronics)
Unlike zswap (which caches before a real swap device), zram is the swap device - a compressed RAM disk. The entire swap flow above happens, but the "disk" is actually compressed memory. This is popular on memory-constrained systems like Android phones and Chromebooks.
v5.8 (2020): Swap slot allocation improvements for SSD performance.
Commit: 490705888107 ("swap: reduce lock contention on swap cache from swap slots allocation")
Author: Huang Ying (Intel)
In Stage 1 (slot allocation) above, the kernel now uses a swap slot cache to batch slot allocations, reducing contention on SSDs where swap I/O is fast enough that allocation overhead becomes significant.
Modern swap improvements
The swap subsystem continues to evolve for better SSD utilization and reduced latency:
- Better slot allocation algorithms
- Improved swap readahead
- Integration with memory cgroups for per-cgroup swap accounting
Notorious bugs and edge cases
Swap involves complex interactions between page tables, swap slots, and caching. Race conditions in this code can cause data corruption or security vulnerabilities.
Case 1: The swap slot ABA problem (CVE-2024-26759)
What happened
In early 2024, Huang Ying (Intel) discovered a race condition in the "skip swapcache" optimization path that could cause data corruption.
The bug
The LKML discussion explains the ABA problem:
When SWP_SYNCHRONOUS_IO is enabled (common on fast NVMe drives), the kernel skips the swap cache for performance. But this creates a race:
- Thread T0: Starts swapping in entry X, allocates page A
- Thread T1: Also starts swapping in entry X (same entry), allocates page B
- Thread T1: Finishes first, installs page B, frees entry X via
swap_free() - Thread T1: Later swaps out a modified page, reuses entry X
- Thread T0: Finishes, sees PTE unchanged (still points to entry X)
- Thread T0: Installs stale page A into PTE
sequenceDiagram
participant T0 as Thread 0
participant T1 as Thread 1
participant Swap as Swap Slot X
T0->>Swap: Read entry X → Page A
T1->>Swap: Read entry X → Page B
T1->>Swap: Installs B, frees X
Note over Swap: Slot X is FREE
T1->>Swap: Swap out modified page → reuses X
Note over Swap: Slot X has NEW data
T0->>T0: PTE still says "swap entry X"
T0->>T0: Installs STALE page A
Note over T0: DATA CORRUPTION!
The PTE value hasn't changed (still swp_entry_t for slot X), so the pte_same() check passes, but the data is completely different.
Real-world implications
This is silent data corruption - applications would see old data instead of current data, with no errors reported. Databases, filesystems, or any application relying on data consistency could be affected.
The fix
Commit: 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
Author: Kairui Song
The fix adds proper locking and verification to prevent the ABA scenario when bypassing the swap cache.
Fixes commit: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device")
Case 2: ZSWAP race conditions
What happened
The zswap compressed swap cache introduced concurrency challenges. LKML bug reports documented theoretical races.
The bug
When a duplicate store and reclaim happen concurrently:
- Entry X with offset A exists in zswap
- Thread 0: Starts reclaiming entry X
- Thread 1: Stores a new page at offset A, allocates new entry Y
- Thread 0: Calls
zswap_get_swap_cache_page()but gets the old data
The swap cache now contains old data for offset A instead of the new data.
Real-world implications
If do_swap_page() retrieves the page from swap cache, it returns stale data - another silent corruption scenario.
Mitigations
Zswap has undergone multiple locking improvements to address these races. The interactions between: - Zswap entry lifecycle - Swap cache operations - Page writeback
...all need careful synchronization.
Case 3: Swap-over-NFS instability
What happened
Swapping to NFS has historically been problematic. When the system is under memory pressure and needs to swap, it also needs memory for network I/O.
The bug
The classic deadlock scenario:
- System under memory pressure
- Kernel tries to swap out pages to NFS
- NFS needs to allocate memory for network packets
- Memory allocation triggers more reclaim
- Deadlock: can't free memory without network, can't do network without memory
Mitigations
PF_MEMALLOC and memory reserves: The kernel reserves memory for critical I/O paths. Processes doing swap I/O can dip into reserves.
Swap-over-NBD patches: Multiple patches over the years improved network swap reliability.
Practical advice: Don't swap to NFS in production. Use local swap or zram.
Case 4: Swap partition vs. swap file races
What happened
Swap files introduce additional complexity compared to swap partitions because the filesystem layer is involved.
The bug
With swap files, swap_writepage() must go through the filesystem. This can race with:
- Filesystem operations on the same device
- Filesystem metadata updates
- Journal commits
Historical issues
- XFS + swap: Early XFS had issues with swap files due to delayed allocation
- Btrfs + swap: Only supported swap files from kernel 5.0 due to COW complexity
- F2FS: Added swap file support in kernel 5.10
Modern status
Most filesystems now support swap files correctly, but the added complexity means more potential for bugs compared to raw partitions.
Case 5: The swap readahead trap
What happened
Swap readahead (reading more than one page on swap-in) is an optimization that can backfire.
The problem
Unlike file readahead (where sequential access is common), swap access patterns are often random - especially on systems with many processes. Aggressive swap readahead:
- Reads pages that won't be used
- Evicts useful pages from memory
- Increases swap I/O
- Makes memory pressure worse
The bug class
Readahead tuning issues have caused performance regressions across kernel versions. The heuristics for "should we readahead?" are imperfect.
Mitigation
# Check current readahead (pages)
cat /proc/sys/vm/page-cluster
# Default: 3 (meaning 2^3 = 8 pages readahead)
# Disable swap readahead
echo 0 > /proc/sys/vm/page-cluster
# On SSDs, readahead matters less (fast random access)
# On HDDs, readahead helps if access is sequential
Case 6: Large folio + swap = data loss (2023)
What happened
The large folio support (kernel 6.1+) introduced a data loss bug affecting XFS and other filesystems. Organizations reported active data loss in production.
The bug
Large folios change how pages are tracked and written back. Interactions with the swap and writeback paths led to pages being marked clean when they were actually dirty, resulting in data loss.
From the LKML report:
"At least 16 instances of data loss in a fleet of 1.5k VMs... happens mostly on nodes running database or storage-oriented workloads, including PostgreSQL, MySQL, RocksDB."
Real-world implications
This is catastrophic for databases and storage systems. Data written to memory but not yet persisted could simply disappear.
The fix
Meta and other companies disabled large folios while fixes were developed:
Commit: a48d5bdc877b ("mm: fix for negative counter: nr_file_hugepages")
Author: Stefan Roesch
Multiple patches were required to fully address the issue.
Summary: Lessons learned
| Bug | Year | Root Cause | Impact | Prevention |
|---|---|---|---|---|
| Swap slot ABA | 2024 | Skip-swapcache race | Data corruption | Fixed in mainline |
| ZSWAP races | Ongoing | Compression + swap cache | Data corruption | Multiple fixes |
| Swap-over-NFS | Historical | Memory deadlock | System hang | Use local swap |
| Large folio + swap | 2023 | Writeback tracking | Data loss | Disable large folios or update kernel |
| Readahead trap | Ongoing | Wrong heuristics | Performance | Tune page-cluster |
The common thread: swap is a "slow path" that gets less testing than the normal memory path. When the system is under memory pressure, it's already in a degraded state, and bugs in swap handling compound the problem.
Further reading
Related docs
- Swap - Swap configuration and management
- Page reclaim - How pages are selected for swap
- Life of a page - Complete page lifecycle
- Running out of memory - What happens when swap isn't enough
LWN articles
- In defense of swap (2016) - Why swap matters
- zswap: compressed swap caching (2013) - Compressed swap
- Large folios for anonymous memory (2022) - The large folio work that led to bugs