OOM scenarios and debugging guide
Practical playbook for preventing, detecting, and diagnosing out-of-memory kills
This is the hands-on companion to Running out of memory, which covers the kernel internals. This page focuses on what to do in practice.
Preventing OOM
Size memory for your workload
# Baseline: check current usage and headroom
free -h
cat /proc/meminfo | grep -E "MemTotal|MemAvailable|SwapTotal|SwapFree"
# Per-process RSS (top consumers)
ps aux --sort=-%mem | head -20
# Sum of all process RSS (approximate committed memory)
awk '/Rss:/ {sum += $2} END {print sum/1024, "MB"}' /proc/[0-9]*/smaps_rollup 2>/dev/null
Rule of thumb: keep MemAvailable above 10-15% of MemTotal during peak load. If it regularly drops below that, add RAM or reduce workload.
Add swap
Running without swap is the single most common cause of unnecessary OOM kills. Even a small swap gives the kernel a pressure valve for cold anonymous pages.
# Check for existing swap
swapon --show
# Create a swap file (4 GB example)
# Note: fallocate may not work on Btrfs or XFS with reflinks.
# Use dd as a portable alternative: dd if=/dev/zero of=/swapfile bs=1M count=4096
fallocate -l 4G /swapfile
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile
# Make permanent
echo '/swapfile none swap sw 0 0' >> /etc/fstab
# Tune swappiness (default 60; lower = prefer reclaiming page cache)
sysctl vm.swappiness=30
See What happens during swapping for details on how the kernel uses swap.
Set cgroup memory limits
Cgroup limits contain runaway processes before they pressure the entire system. See Memory Cgroups for the full picture.
# systemd service (recommended)
# In the unit file [Service] section:
# MemoryMax=2G
# MemoryHigh=1800M
# Manual cgroup v2
echo 2G > /sys/fs/cgroup/myservice/memory.max
echo 1800M > /sys/fs/cgroup/myservice/memory.high
memory.high is a soft throttle -- the kernel slows allocations before hitting the hard memory.max limit. Always set both.
Tune watermarks
# Give kswapd more room to work (default is often too small for servers)
sysctl vm.min_free_kbytes=262144 # 256 MB
sysctl vm.watermark_scale_factor=50 # 0.5% of zone memory per step
See the watermark discussion in Running out of memory.
Detecting approaching OOM
Catching memory pressure before OOM fires is far better than debugging after the fact.
Monitor PSI (Pressure Stall Information)
PSI (available since kernel 4.20) is the best single metric for memory pressure.
# Current pressure (some = at least one task stalled, full = all tasks stalled)
cat /proc/pressure/memory
# some avg10=0.00 avg60=0.00 avg300=0.00 total=0
# full avg10=0.00 avg60=0.00 avg300=0.00 total=0
# Per-cgroup pressure
cat /sys/fs/cgroup/myservice/memory.pressure
Operational guidelines (these are heuristics, not kernel-defined limits — tune for your workload):
some avg10 |
Meaning |
|---|---|
| < 5% | Healthy |
| 5-20% | Moderate pressure, investigate |
| > 20% | Severe pressure, OOM likely if sustained |
PSI trigger for automated response (cgroup v2):
# PSI triggers require holding the fd open — a shell echo will not work
# because closing the fd destroys the trigger (psi_fop_release() calls
# psi_trigger_destroy()). Use a program that holds the fd open and
# calls poll()/epoll(). Example trigger format:
# "some 500000 2000000" (500ms of stall time within a 2s window)
# See kernel docs: Documentation/accounting/psi.rst
Track MemAvailable
# One-liner: log MemAvailable every 10 seconds
while true; do
date +%H:%M:%S | tr '\n' ' '
awk '/MemAvailable/ {print $2/1024, "MB"}' /proc/meminfo
sleep 10
done
# Alert if MemAvailable drops below 5%
awk '/MemTotal/ {total=$2} /MemAvailable/ {avail=$2}
END {pct=avail*100/total; if (pct < 5) print "WARNING: MemAvailable at " pct "%"}' /proc/meminfo
Watch vmstat counters
# Real-time view (1-second interval)
vmstat 1
# Key columns:
# si/so = swap in/out (KB/s) -- nonzero means active swapping
# free = free memory (KB)
# buff/cache = reclaimable memory
# Detailed counters for reclaim activity
watch -n 1 'grep -E "allocstall|pgscan_direct|pgsteal|oom_kill" /proc/vmstat'
| Counter | What it means |
|---|---|
allocstall_normal |
Direct reclaim events (allocating process blocked) |
pgscan_direct |
Pages scanned during direct reclaim |
pgsteal_direct |
Pages successfully reclaimed in direct reclaim |
oom_kill |
Total OOM kills since boot |
If allocstall is rising, the system is under memory pressure and headed toward OOM.
Analyzing OOM after the fact
Reading the OOM kill log
# Find OOM events in kernel log
dmesg -T | grep -i "out of memory"
dmesg -T | grep -i "oom"
# On systemd systems (persisted across reboots)
journalctl -k --grep="Out of memory"
journalctl -k --grep="oom" --since="2024-01-01"
A typical OOM message looks like:
[Thu Jan 4 03:14:15 2024] myapp invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
[...]
[Thu Jan 4 03:14:15 2024] Mem-Info:
[...]
[Thu Jan 4 03:14:15 2024] Out of memory: Killed process 12345 (myapp)
total-vm:4194304kB, anon-rss:3145728kB, file-rss:2048kB,
shmem-rss:0kB, UID:1000 pgtables:8192kB oom_score_adj:0
Key fields to check:
| Field | What to look for |
|---|---|
gfp_mask |
Allocation type that triggered OOM. GFP_KERNEL = kernel allocation, GFP_HIGHUSER_MOVABLE = userspace page |
order |
0 = single page (real OOM). Higher orders may be fragmentation, not true OOM |
anon-rss |
Anonymous resident memory of the killed process |
oom_score_adj |
Was the victim protected or targeted? |
Mem-Info section |
System-wide memory state at time of kill |
order > 0 kills
If order is greater than 0, the system may not actually be out of memory -- it could be a fragmentation issue. Check if MemAvailable was still reasonable in the Mem-Info dump. See Compaction for details.
Correlate with monitoring
The OOM log only captures a snapshot. Correlate with external monitoring to understand the trend:
# Check if the cgroup hit its limit (not system-wide OOM)
cat /sys/fs/cgroup/myservice/memory.events
# Look for oom and oom_kill counters > 0
# Check peak usage
cat /sys/fs/cgroup/myservice/memory.peak
# Was swap available?
dmesg -T | grep -A 50 "invoked oom-killer" | grep -i swap
Questions to answer:
- Was it system-wide or cgroup-scoped? Check if the OOM log mentions a specific cgroup.
- Was swap available? If
SwapFreewas 0 in the Mem-Info dump, adding swap would have helped. - Was memory steadily growing (leak) or did it spike? Correlate with your time-series monitoring.
- Which process was actually consuming memory? The killed process is the largest, not necessarily the leaker.
Common misconfigurations
No swap
Without swap, anonymous pages (heap, stack) cannot be reclaimed at all. The kernel can only reclaim page cache, and once that is exhausted, OOM is immediate.
Even SSDs benefit from swap. The kernel preferentially swaps cold pages that haven't been touched in minutes or hours.
Too many OOM-protected processes
Setting oom_score_adj=-1000 on multiple processes leaves the OOM killer with no good victims.
# Find all OOM-protected processes
for pid in /proc/[0-9]*; do
adj=$(cat "$pid/oom_score_adj" 2>/dev/null)
if [ "$adj" = "-1000" ]; then
comm=$(cat "$pid/comm" 2>/dev/null)
echo "PID $(basename $pid): $comm (oom_score_adj=$adj)"
fi
done
If the only killable processes are tiny (low RSS), the OOM killer may loop without freeing enough memory, or the system may hang. Protect only the one or two truly critical processes (e.g., database, init).
Cgroup limits too tight
A cgroup limit that matches normal usage with no headroom causes OOM kills during minor spikes.
# Check how close the cgroup is to its limit
echo "Current: $(cat /sys/fs/cgroup/myservice/memory.current)"
echo "Max: $(cat /sys/fs/cgroup/myservice/memory.max)"
echo "Peak: $(cat /sys/fs/cgroup/myservice/memory.peak)"
# Check for throttling events (memory.high breaches)
cat /sys/fs/cgroup/myservice/memory.events | grep high
Set memory.max at least 20-30% above typical peak usage. Use memory.high at the actual target to get throttling (graceful slowdown) rather than kills.
Overcommit misconfiguration
# Check overcommit policy
cat /proc/sys/vm/overcommit_memory
# 0 = heuristic (default, usually fine)
# 1 = always overcommit (dangerous: no malloc failures, just OOM kills)
# 2 = strict accounting (may reject allocations too aggressively)
cat /proc/sys/vm/overcommit_ratio # Used with mode 2 (percentage of RAM)
See Memory Overcommit for the full explanation.
Testing OOM handling
stress-ng: controlled memory pressure
# Allocate 90% of available memory across 4 workers
stress-ng --vm 4 --vm-bytes 90% --vm-keep --timeout 60s
# Trigger OOM deliberately (allocate more than available)
stress-ng --vm 2 --vm-bytes 120% --vm-keep --timeout 30s
# Gradual pressure ramp (useful with monitoring)
stress-ng --vm 1 --vm-bytes 50% --vm-method all --vm-keep --timeout 120s
Cgroup-based controlled OOM
Test OOM within a cgroup without risking the host:
# Create a test cgroup with a small limit
mkdir -p /sys/fs/cgroup/oom-test
echo 100M > /sys/fs/cgroup/oom-test/memory.max
echo $$ > /sys/fs/cgroup/oom-test/cgroup.procs
# Allocate more than the limit (will trigger cgroup OOM)
stress-ng --vm 1 --vm-bytes 200M --vm-keep --timeout 10s
# Check that OOM fired within the cgroup
cat /sys/fs/cgroup/oom-test/memory.events | grep oom
# Clean up: move shell back to root cgroup
echo $$ > /sys/fs/cgroup/cgroup.procs
rmdir /sys/fs/cgroup/oom-test
oom_score_adj manipulation
# Make a specific process the OOM target
echo 1000 > /proc/$PID/oom_score_adj
# Verify scores
for pid in $(pgrep -x myapp); do
echo "PID $pid: score=$(cat /proc/$pid/oom_score) adj=$(cat /proc/$pid/oom_score_adj)"
done
Verify your application handles OOM gracefully
# Check if a process restarted after OOM (systemd)
systemctl status myservice
# Look for: "Main process exited, code=killed, status=9/KILL"
# Confirm systemd will restart it
systemctl show myservice | grep Restart=
Userspace OOM handlers
systemd-oomd
systemd-oomd (introduced in systemd 247, stabilized in 248) kills processes based on PSI pressure and swap usage before the kernel OOM killer fires. It acts earlier and more predictably.
# Check if running
systemctl status systemd-oomd
# View configuration
cat /etc/systemd/oomd.conf
# [OOM]
# SwapUsedLimit=90%
# DefaultMemoryPressureDurationSec=20s
# Per-service overrides (in unit file)
# [Service]
# ManagedOOMSwap=kill
# ManagedOOMMemoryPressure=kill
# ManagedOOMMemoryPressureLimit=80%
How it differs from the kernel OOM killer:
| Aspect | Kernel OOM | systemd-oomd |
|---|---|---|
| Trigger | Allocation failure after reclaim exhaustion | PSI threshold or swap usage |
| Scope | System-wide or per-cgroup | Per-cgroup (systemd slices) |
| Reaction time | Last resort (late) | Configurable (early) |
| Victim selection | Highest oom_score |
Highest memory usage in cgroup |
| Configuration | /proc/*/oom_score_adj |
Unit file directives |
Other userspace handlers
earlyoom -- lightweight daemon, popular on desktops:
# Install and configure
# earlyoom monitors MemAvailable% and SwapFree%
# Default: kill when both drop below 10%
earlyoom -m 5 -s 5 # Kill at 5% memory and 5% swap remaining
nohang -- more configurable alternative to earlyoom, supports PSI triggers and custom actions per-process.
When to use userspace OOM handlers
- Desktops/workstations: earlyoom or systemd-oomd prevent the system from becoming unresponsive.
- Container hosts: systemd-oomd with per-slice policies gives predictable behavior.
- Servers with critical services: userspace handlers can make smarter kill decisions than the kernel.
Quick reference
One-page diagnostic checklist
# 1. Is swap available?
swapon --show
# 2. Current memory state
free -h
# 3. Memory pressure (kernel 4.20+)
cat /proc/pressure/memory
# 4. Top memory consumers
ps aux --sort=-%mem | head -10
# 5. Recent OOM events
dmesg -T | grep -i "out of memory"
# 6. Direct reclaim activity (rising = trouble)
grep -E "allocstall|oom_kill" /proc/vmstat
# 7. Cgroup limits and events
for cg in /sys/fs/cgroup/*/; do
if [ -f "$cg/memory.max" ]; then
max=$(cat "$cg/memory.max")
[ "$max" != "max" ] && echo "$(basename $cg): current=$(cat $cg/memory.current) max=$max"
fi
done
# 8. OOM-protected processes
grep "^-1000$" /proc/*/oom_score_adj 2>/dev/null | head -20
Further reading
- Running out of memory -- kernel OOM internals (watermarks, reclaim stages, OOM killer scoring)
- Memory Cgroups -- cgroup memory limits and accounting
- Memory Overcommit -- how Linux promises more memory than it has
- What happens during swapping -- the swap subsystem in detail
- Page Reclaim -- how the kernel reclaims memory before resorting to OOM