Understanding /proc/vmstat

Counters and gauges for memory subsystem diagnostics

What is /proc/vmstat?

/proc/vmstat exposes the kernel's internal memory management counters. Unlike /proc/meminfo which shows current state, /proc/vmstat includes both gauges (current values) and counters (monotonically increasing since boot).

The output is generated by vmstat_show() in mm/vmstat.c, which aggregates per-zone, per-node, and per-CPU statistics.

$ cat /proc/vmstat | head -20
nr_free_pages 131072
nr_zone_inactive_anon 262144
nr_zone_active_anon 524288
nr_zone_inactive_file 196608
nr_zone_active_file 131072
...
pgfault 48392017
pgmajfault 12847
pgscan_kswapd 0
pgscan_direct 0

Counters vs Gauges

This distinction is critical for monitoring. /proc/vmstat mixes both types.

Gauges (current state — can go up or down)

Prefixed with nr_. Reflect the current value, not cumulative. Read directly for the current state.

Field	What it measures
`nr_free_pages`	Free pages in the buddy allocator
`nr_inactive_anon` / `nr_active_anon`	Anonymous pages on LRU lists
`nr_inactive_file` / `nr_active_file`	File pages on LRU lists
`nr_dirty`	Pages modified but not yet written to disk
`nr_writeback`	Pages currently being written to disk
`nr_slab_reclaimable`	Slab pages the kernel can free
`nr_slab_unreclaimable`	Slab pages that cannot be freed
`nr_page_table_pages`	Pages used for page tables
`nr_shmem`	Shared memory / tmpfs pages
`nr_file_pages`	Total pages in the page cache
`nr_anon_pages`	Mapped anonymous pages
`nr_unevictable`	Locked/unevictable pages
`nr_swapcached`	Pages in the swap cache
`nr_kernel_stack`	Kernel stack usage (in KiB)

Counters (monotonically increasing since boot)

Defined in enum vm_event_item. You must compute deltas between two readings to get rates.

Gotcha: nr_dirtied and nr_written have the nr_ prefix but are actually counters (total pages dirtied/written since boot), not gauges.

Page Fault Counters

Field	What it counts	Source
`pgfault`	Total page faults (minor + major)	`handle_mm_fault()`
`pgmajfault`	Major faults only (required disk I/O)	`mm/memory.c`, `mm/filemap.c`

Minor faults = pgfault - pgmajfault. A minor fault means the page was already in memory (page cache or COW) and just needed a page table entry. A major fault means the kernel had to read from disk.

Diagnostic: If pgmajfault / pgfault ratio exceeds ~1%, I/O latency is likely impacting application performance.

Reclaim Counters

These are the most important counters for diagnosing memory pressure.

Scanning vs Stealing

Field	What it counts
`pgscan_kswapd`	Pages examined by kswapd (background reclaim)
`pgscan_direct`	Pages examined by direct reclaim (allocating process, synchronous)
`pgsteal_kswapd`	Pages successfully reclaimed by kswapd
`pgsteal_direct`	Pages successfully reclaimed by direct reclaim

All incremented in mm/vmscan.c in shrink_inactive_list() and related functions.

By page type:

Field	What it counts
`pgscan_anon`	Anonymous pages scanned
`pgscan_file`	File-backed pages scanned
`pgsteal_anon`	Anonymous pages reclaimed
`pgsteal_file`	File-backed pages reclaimed

Reclaim Efficiency

efficiency = pgsteal / pgscan

Ratio	Meaning
> 0.8	Efficient — plenty of cold/clean pages to reclaim
0.3 - 0.8	Moderate pressure — some pages are dirty or pinned
< 0.3	Struggling — most scanned pages cannot be freed, OOM may follow

Compute separately for kswapd and direct reclaim, and for anon vs file, to pinpoint the bottleneck.

Allocation Stalls

Field	What it counts
`allocstall_normal`	Processes that entered direct reclaim in the Normal zone
`allocstall_dma32`	Same for DMA32 zone
`allocstall_dma`	Same for DMA zone
`allocstall_movable`	Same for Movable zone

Incremented in mm/page_alloc.c when __alloc_pages_direct_reclaim() is entered.

Every increment means a process blocked waiting for memory. This directly causes latency spikes. If allocstall_normal is incrementing at more than a few per second sustained, the system needs more free memory headroom.

Swap Counters

Field	What it counts
`pswpin`	Pages read from swap (swap-in)
`pswpout`	Pages written to swap (swap-out)
`pgpgin`	All pages read from block devices (KiB) — includes filesystem I/O
`pgpgout`	All pages written to block devices (KiB)

pswpin/pswpout are specifically swap operations. pgpgin/pgpgout are broader (all block I/O).

LRU Movement

Field	What it counts
`pgactivate`	Pages promoted: inactive -> active (accessed while inactive)
`pgdeactivate`	Pages demoted: active -> inactive (candidate for reclaim)
`pglazyfree`	Pages marked for lazy free via `MADV_FREE`
`pglazyfreed`	Lazyfree pages actually reclaimed

Compaction Counters

Field	What it counts
`compact_stall`	Processes that blocked waiting for compaction
`compact_success`	Compaction runs that created a contiguous block
`compact_fail`	Compaction runs that failed
`compact_migrate_scanned`	Pages scanned by migration scanner
`compact_free_scanned`	Pages scanned by free scanner
`compact_isolated`	Pages isolated for migration
`compact_daemon_wake`	Times kcompactd was woken

Source: mm/compaction.c.

Failure rate: compact_fail / (compact_fail + compact_success) above 50% means severe fragmentation with unmovable pages blocking compaction.

THP Counters

Field	What it counts
`thp_fault_alloc`	THP allocated on page fault (synchronous)
`thp_fault_fallback`	THP allocation failed on fault, fell back to 4KB
`thp_collapse_alloc`	THP allocated by khugepaged (asynchronous collapse)
`thp_collapse_alloc_failed`	khugepaged failed to allocate THP
`thp_split_page`	THP split back into base pages
`thp_split_pmd`	PMD-level split (PTEs created to replace PMD entry)
`thp_swpout`	THP swapped out as a whole (since v4.13)
`thp_swpout_fallback`	THP swap-out fell back to splitting

Source: mm/huge_memory.c, mm/khugepaged.c.

OOM Counter

Field	What it counts
`oom_kill`	Times the OOM killer was invoked

Incremented in mm/oom_kill.c.

Diagnostic Patterns

Detecting memory pressure

# Watch reclaim activity (1-second deltas)
watch -n 1 'grep -E "pgscan_|pgsteal_|allocstall" /proc/vmstat'

Signal	Meaning
`pgscan_kswapd` increasing	Background reclaim active — moderate pressure
`pgscan_direct` increasing	Direct reclaim active — serious pressure, processes blocking
`allocstall_normal` increasing	Allocation stalls — latency impact on applications
`pgsteal / pgscan` ratio dropping	Reclaim efficiency falling — running out of easy-to-reclaim pages

Detecting swap thrashing

grep -E "pswpin|pswpout|pgmajfault" /proc/vmstat

High pswpin + pswpout with high pgmajfault = the system is swapping actively and applications are faulting pages back in. The working set does not fit in RAM.

Detecting THP problems

grep "thp_" /proc/vmstat

Pattern	Problem
`thp_fault_fallback` rising rapidly	Memory too fragmented for 2MB allocations
`thp_fault_alloc` + `thp_split_page` both rising	THP churn — allocated then split, wasting CPU
`compact_stall` high + `thp_fault_fallback` high	Stalling on compaction but still failing

Fix: Switch to madvise mode or set defrag=defer:

echo madvise > /sys/kernel/mm/transparent_hugepage/enabled

Detecting compaction problems

Signal	Problem
`compact_stall` increasing	Processes blocking on compaction
`compact_fail >> compact_success`	Unmovable pages blocking defragmentation
`compact_migrate_scanned >> compact_isolated`	Migration scanner finds pages but cannot move them

Using /proc/vmstat with Tools

Direct reading (most complete)

# Compute deltas (rate per second)
paste <(cat /proc/vmstat) <(sleep 1; cat /proc/vmstat) | \
  awk '$1 == $3 && $2 != $4 {print $1, $4-$2, "delta/s"}'

vmstat command (procps)

$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st

Column	/proc/vmstat source
`si` (swap in)	`pswpin` delta, converted to KB/s
`so` (swap out)	`pswpout` delta, converted to KB/s
`bi` (block in)	`pgpgin` delta
`bo` (block out)	`pgpgout` delta

sar command (sysstat)

# Page fault and reclaim rates
sar -B 1
# fault/s = pgfault delta, majflt/s = pgmajfault delta
# pgscank/s = pgscan_kswapd delta, pgscand/s = pgscan_direct delta

# Swap rates
sar -W 1
# pswpin/s, pswpout/s from their respective counters

Try It Yourself

# See which counters are actively changing (memory pressure test)
# Terminal 1: generate pressure
stress --vm 2 --vm-bytes 80% --timeout 30s

# Terminal 2: watch counters change
watch -d -n 1 'grep -E "pgscan|pgsteal|allocstall|pgmajfault|pswp|oom_kill|compact_stall" /proc/vmstat'

# Check reclaim efficiency
awk '/pgscan_kswapd/{scan=$2} /pgsteal_kswapd/{steal=$2} END{if(scan>0) printf "kswapd efficiency: %.1f%%\n", steal/scan*100; else print "No kswapd activity"}' /proc/vmstat

# Check THP success rate
awk '/thp_fault_alloc /{ok=$2} /thp_fault_fallback /{fail=$2} END{total=ok+fail; if(total>0) printf "THP success: %.1f%% (%d/%d)\n", ok/total*100, ok, total; else print "No THP faults"}' /proc/vmstat

Key Source Files

File	What it contains
`mm/vmstat.c`	Output generation, per-CPU aggregation
`include/linux/vm_event_item.h`	Counter definitions
`include/linux/mmzone.h`	Gauge definitions (zone/node stat items)
`mm/vmscan.c`	Reclaim counters
`mm/page_alloc.c`	Allocation and stall counters
`mm/compaction.c`	Compaction counters