Understanding /proc/vmstat
Counters and gauges for memory subsystem diagnostics
What is /proc/vmstat?
/proc/vmstat exposes the kernel's internal memory management counters. Unlike /proc/meminfo which shows current state, /proc/vmstat includes both gauges (current values) and counters (monotonically increasing since boot).
The output is generated by vmstat_show() in mm/vmstat.c, which aggregates per-zone, per-node, and per-CPU statistics.
$ cat /proc/vmstat | head -20
nr_free_pages 131072
nr_zone_inactive_anon 262144
nr_zone_active_anon 524288
nr_zone_inactive_file 196608
nr_zone_active_file 131072
...
pgfault 48392017
pgmajfault 12847
pgscan_kswapd 0
pgscan_direct 0
Counters vs Gauges
This distinction is critical for monitoring. /proc/vmstat mixes both types.
Gauges (current state — can go up or down)
Prefixed with nr_. Reflect the current value, not cumulative. Read directly for the current state.
| Field | What it measures |
|---|---|
nr_free_pages |
Free pages in the buddy allocator |
nr_inactive_anon / nr_active_anon |
Anonymous pages on LRU lists |
nr_inactive_file / nr_active_file |
File pages on LRU lists |
nr_dirty |
Pages modified but not yet written to disk |
nr_writeback |
Pages currently being written to disk |
nr_slab_reclaimable |
Slab pages the kernel can free |
nr_slab_unreclaimable |
Slab pages that cannot be freed |
nr_page_table_pages |
Pages used for page tables |
nr_shmem |
Shared memory / tmpfs pages |
nr_file_pages |
Total pages in the page cache |
nr_anon_pages |
Mapped anonymous pages |
nr_unevictable |
Locked/unevictable pages |
nr_swapcached |
Pages in the swap cache |
nr_kernel_stack |
Kernel stack usage (in KiB) |
Counters (monotonically increasing since boot)
Defined in enum vm_event_item. You must compute deltas between two readings to get rates.
Gotcha: nr_dirtied and nr_written have the nr_ prefix but are actually counters (total pages dirtied/written since boot), not gauges.
Page Fault Counters
| Field | What it counts | Source |
|---|---|---|
pgfault |
Total page faults (minor + major) | handle_mm_fault() |
pgmajfault |
Major faults only (required disk I/O) | mm/memory.c, mm/filemap.c |
Minor faults = pgfault - pgmajfault. A minor fault means the page was already in memory (page cache or COW) and just needed a page table entry. A major fault means the kernel had to read from disk.
Diagnostic: If pgmajfault / pgfault ratio exceeds ~1%, I/O latency is likely impacting application performance.
Reclaim Counters
These are the most important counters for diagnosing memory pressure.
Scanning vs Stealing
| Field | What it counts |
|---|---|
pgscan_kswapd |
Pages examined by kswapd (background reclaim) |
pgscan_direct |
Pages examined by direct reclaim (allocating process, synchronous) |
pgsteal_kswapd |
Pages successfully reclaimed by kswapd |
pgsteal_direct |
Pages successfully reclaimed by direct reclaim |
All incremented in mm/vmscan.c in shrink_inactive_list() and related functions.
By page type:
| Field | What it counts |
|---|---|
pgscan_anon |
Anonymous pages scanned |
pgscan_file |
File-backed pages scanned |
pgsteal_anon |
Anonymous pages reclaimed |
pgsteal_file |
File-backed pages reclaimed |
Reclaim Efficiency
| Ratio | Meaning |
|---|---|
| > 0.8 | Efficient — plenty of cold/clean pages to reclaim |
| 0.3 - 0.8 | Moderate pressure — some pages are dirty or pinned |
| < 0.3 | Struggling — most scanned pages cannot be freed, OOM may follow |
Compute separately for kswapd and direct reclaim, and for anon vs file, to pinpoint the bottleneck.
Allocation Stalls
| Field | What it counts |
|---|---|
allocstall_normal |
Processes that entered direct reclaim in the Normal zone |
allocstall_dma32 |
Same for DMA32 zone |
allocstall_dma |
Same for DMA zone |
allocstall_movable |
Same for Movable zone |
Incremented in mm/page_alloc.c when __alloc_pages_direct_reclaim() is entered.
Every increment means a process blocked waiting for memory. This directly causes latency spikes. If allocstall_normal is incrementing at more than a few per second sustained, the system needs more free memory headroom.
Swap Counters
| Field | What it counts |
|---|---|
pswpin |
Pages read from swap (swap-in) |
pswpout |
Pages written to swap (swap-out) |
pgpgin |
All pages read from block devices (KiB) — includes filesystem I/O |
pgpgout |
All pages written to block devices (KiB) |
pswpin/pswpout are specifically swap operations. pgpgin/pgpgout are broader (all block I/O).
LRU Movement
| Field | What it counts |
|---|---|
pgactivate |
Pages promoted: inactive -> active (accessed while inactive) |
pgdeactivate |
Pages demoted: active -> inactive (candidate for reclaim) |
pglazyfree |
Pages marked for lazy free via MADV_FREE |
pglazyfreed |
Lazyfree pages actually reclaimed |
Compaction Counters
| Field | What it counts |
|---|---|
compact_stall |
Processes that blocked waiting for compaction |
compact_success |
Compaction runs that created a contiguous block |
compact_fail |
Compaction runs that failed |
compact_migrate_scanned |
Pages scanned by migration scanner |
compact_free_scanned |
Pages scanned by free scanner |
compact_isolated |
Pages isolated for migration |
compact_daemon_wake |
Times kcompactd was woken |
Source: mm/compaction.c.
Failure rate: compact_fail / (compact_fail + compact_success) above 50% means severe fragmentation with unmovable pages blocking compaction.
THP Counters
| Field | What it counts |
|---|---|
thp_fault_alloc |
THP allocated on page fault (synchronous) |
thp_fault_fallback |
THP allocation failed on fault, fell back to 4KB |
thp_collapse_alloc |
THP allocated by khugepaged (asynchronous collapse) |
thp_collapse_alloc_failed |
khugepaged failed to allocate THP |
thp_split_page |
THP split back into base pages |
thp_split_pmd |
PMD-level split (PTEs created to replace PMD entry) |
thp_swpout |
THP swapped out as a whole (since v4.13) |
thp_swpout_fallback |
THP swap-out fell back to splitting |
Source: mm/huge_memory.c, mm/khugepaged.c.
OOM Counter
| Field | What it counts |
|---|---|
oom_kill |
Times the OOM killer was invoked |
Incremented in mm/oom_kill.c.
Diagnostic Patterns
Detecting memory pressure
# Watch reclaim activity (1-second deltas)
watch -n 1 'grep -E "pgscan_|pgsteal_|allocstall" /proc/vmstat'
| Signal | Meaning |
|---|---|
pgscan_kswapd increasing |
Background reclaim active — moderate pressure |
pgscan_direct increasing |
Direct reclaim active — serious pressure, processes blocking |
allocstall_normal increasing |
Allocation stalls — latency impact on applications |
pgsteal / pgscan ratio dropping |
Reclaim efficiency falling — running out of easy-to-reclaim pages |
Detecting swap thrashing
High pswpin + pswpout with high pgmajfault = the system is swapping actively and applications are faulting pages back in. The working set does not fit in RAM.
Detecting THP problems
| Pattern | Problem |
|---|---|
thp_fault_fallback rising rapidly |
Memory too fragmented for 2MB allocations |
thp_fault_alloc + thp_split_page both rising |
THP churn — allocated then split, wasting CPU |
compact_stall high + thp_fault_fallback high |
Stalling on compaction but still failing |
Fix: Switch to madvise mode or set defrag=defer:
Detecting compaction problems
| Signal | Problem |
|---|---|
compact_stall increasing |
Processes blocking on compaction |
compact_fail >> compact_success |
Unmovable pages blocking defragmentation |
compact_migrate_scanned >> compact_isolated |
Migration scanner finds pages but cannot move them |
Using /proc/vmstat with Tools
Direct reading (most complete)
# Compute deltas (rate per second)
paste <(cat /proc/vmstat) <(sleep 1; cat /proc/vmstat) | \
awk '$1 == $3 && $2 != $4 {print $1, $4-$2, "delta/s"}'
vmstat command (procps)
$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
| Column | /proc/vmstat source |
|---|---|
si (swap in) |
pswpin delta, converted to KB/s |
so (swap out) |
pswpout delta, converted to KB/s |
bi (block in) |
pgpgin delta |
bo (block out) |
pgpgout delta |
sar command (sysstat)
# Page fault and reclaim rates
sar -B 1
# fault/s = pgfault delta, majflt/s = pgmajfault delta
# pgscank/s = pgscan_kswapd delta, pgscand/s = pgscan_direct delta
# Swap rates
sar -W 1
# pswpin/s, pswpout/s from their respective counters
Try It Yourself
# See which counters are actively changing (memory pressure test)
# Terminal 1: generate pressure
stress --vm 2 --vm-bytes 80% --timeout 30s
# Terminal 2: watch counters change
watch -d -n 1 'grep -E "pgscan|pgsteal|allocstall|pgmajfault|pswp|oom_kill|compact_stall" /proc/vmstat'
# Check reclaim efficiency
awk '/pgscan_kswapd/{scan=$2} /pgsteal_kswapd/{steal=$2} END{if(scan>0) printf "kswapd efficiency: %.1f%%\n", steal/scan*100; else print "No kswapd activity"}' /proc/vmstat
# Check THP success rate
awk '/thp_fault_alloc /{ok=$2} /thp_fault_fallback /{fail=$2} END{total=ok+fail; if(total>0) printf "THP success: %.1f%% (%d/%d)\n", ok/total*100, ok, total; else print "No THP faults"}' /proc/vmstat
Key Source Files
| File | What it contains |
|---|---|
mm/vmstat.c |
Output generation, per-CPU aggregation |
include/linux/vm_event_item.h |
Counter definitions |
include/linux/mmzone.h |
Gauge definitions (zone/node stat items) |
mm/vmscan.c |
Reclaim counters |
mm/page_alloc.c |
Allocation and stall counters |
mm/compaction.c |
Compaction counters |
Further Reading
- /proc/meminfo — complementary current-state view
- Reading an OOM log — what happens when these counters go critical
- Page reclaim — how kswapd and direct reclaim work
- vm sysctl docs — tunable parameters