What Happens When a Container Hits Its Memory Limit
The full sequence from normal operation to OOM kill and recovery
The Story in Brief
A container's memory limit is not a wall that appears without warning. The kernel's cgroup memory controller imposes a graduated series of interventions before killing anything: soft throttling at memory.high, escalating reclaim pressure, PSI signals visible to observability tooling, and finally — only if all else fails — a cgroup-scoped OOM kill that is entirely separate from the global OOM killer.
Understanding each stage helps you set limits intelligently, write alerting that fires early enough to be useful, and diagnose OOM kills that already happened.
flowchart TD
A["memory.current well below memory.high\nNormal operation"]
B["memory.current approaches memory.high\nReclaim begins; PSI 'some' rises"]
C["memory.current exceeds memory.high\nProcess throttled; sleeps until usage drops"]
D["Reclaim keeps up\nUsage pulled back below memory.high"]
E["Reclaim cannot keep up\nmemory.current approaches memory.max"]
F["memory.current exceeds memory.max\nCgroup OOM invoked"]
G["OOM victim killed\nMemory freed; cgroup resumes"]
A --> B --> C
C --> D --> A
C --> E --> F --> G --> A
This page follows that path step by step. See Memory Cgroups for how the accounting works, Tuning Containers for how to configure limits, and Running out of memory for the global OOM path that operates outside cgroups.
Stage 1: Normal Operation (below memory.high)
While memory.current stays below memory.high, the cgroup memory controller is effectively invisible to running processes. Allocations succeed at normal speed; kswapd runs at the system level for global pressure; no per-cgroup throttling occurs.
The kernel tracks per-cgroup memory usage via struct mem_cgroup in include/linux/memcontrol.h. Every page charged to a cgroup updates the per-cpu counters attached to that struct via mod_lruvec_state() in mm/memcontrol.c. The counters are periodically synced to the global mem_cgroup->vmstats array that memory.stat reads.
# Check current usage and the limits in effect
CGROUP=/sys/fs/cgroup/mycontainer
cat $CGROUP/memory.current # usage now (bytes)
cat $CGROUP/memory.high # soft limit
cat $CGROUP/memory.max # hard limit
# Normal state: current << high
The PSI file at $CGROUP/memory.pressure will show all zeros during this stage — no tasks are stalling on memory.
Stage 2: Approaching memory.high — Throttling Begins
When an allocation would push memory.current past memory.high, the kernel enters mem_cgroup_handle_over_high() in mm/memcontrol.c. This function is called from the task's return-to-userspace path (not from the charge function directly — the over_high check is deferred to avoid reclaiming on every allocation).
The throttling mechanism
mem_cgroup_handle_over_high() does two things:
-
Triggers synchronous reclaim on the current cgroup. It calls
reclaim_high(), which callstry_to_free_mem_cgroup_pages()to reclaim pages from the cgroup's LRU lists. This is direct reclaim scoped to a single cgroup — it does not touch other cgroups. -
Throttles the allocating process if reclaim did not bring usage back below
memory.high. The process is put to sleep for a calculated number of jiffies, yielding the CPU. The sleep duration scales with how far overmemory.highthe cgroup is — a small overage causes a short sleep; a large overage causes a proportionally longer one (this scaling is a heuristic, not a fixed table).
The throttling is per-process, not per-cgroup
Each process that allocates while the cgroup is over memory.high incurs its own sleep. A single-threaded application serializes its own allocations naturally. A massively multi-threaded application may have many threads sleeping simultaneously.
What you observe externally
From outside the container, the process appears to be using CPU less efficiently — its runnable time increases while it blocks inside the kernel. PSI begins to register:
cat $CGROUP/memory.pressure
# some avg10=8.50 avg60=2.10 avg300=0.40 total=128000
# full avg10=0.00 avg60=0.00 avg300=0.00 total=0
some goes above zero because at least one task is stalling on memory. full stays zero as long as there is at least one runnable task in the cgroup not blocked on memory. See PSI for how these are computed.
The memory.events counter high increments each time memory.high is crossed:
cat $CGROUP/memory.events
# low 0
# high 34 <- number of times memory.high was exceeded
# max 0
# oom 0
# oom_kill 0
Set memory.high as your first alert threshold
A rising high counter with no oom events means the container is exercising the throttle path regularly. This is the ideal place to add more memory or tune the application — before it gets worse.
Stage 3: Sustained Throttling — Reclaim, PSI, and Feedback
If the workload continues to allocate faster than reclaim can free pages, memory.current remains near or above memory.high for an extended period. The system enters a self-balancing loop:
Allocation attempt
│
▼
mem_cgroup_handle_over_high()
│
├─► try_to_free_mem_cgroup_pages()
│ │
│ ├─► shrink_lruvec() — walk LRU, reclaim cold pages
│ └─► wakeup_flusher_threads() — write dirty pages back
│
└─► sleep (throttle if still over high)
try_to_free_mem_cgroup_pages() calls shrink_lruvec() in mm/vmscan.c which walks the cgroup's LRU lists in priority order: inactive file cache first (cheapest to reclaim), then active file cache, then — if no reclaimable file pages remain — anonymous pages (which may require swap I/O).
What gets reclaimed and in what order
| Page type | Reclaimable? | Notes |
|---|---|---|
| Inactive file-backed (page cache) | Yes, always | Cleanest reclaim — no I/O if not dirty |
| Dirty file-backed | Yes, after writeback | Requires flushing to disk first |
| Active file-backed | Yes, after LRU aging | Kernel tries to preserve hot pages |
| Anonymous (heap, stack) | Only if swap is available | Triggers swap I/O if swapped out |
memory.min-protected pages |
No | Hard floor, never reclaimed |
memory.low-protected pages |
Not unless critical | Best-effort floor |
| Shared memory (tmpfs, shmem) | Yes | Charged to cgroup that faulted it in |
| Kernel slab (kmem) | Partially | Reclaimable dentries and inodes via shrink_slab() |
PSI pressure builds
As reclaim consumes CPU time and swap I/O delays resume of throttled tasks, PSI pressure accumulates:
cat $CGROUP/memory.pressure
# some avg10=45.00 avg60=22.00 avg300=8.00 total=3200000
# full avg10=12.00 avg60=5.00 avg300=1.00 total=890000
A full value above zero means there were moments when every task in the cgroup was simultaneously blocked on memory — no useful work was happening at all. This is a serious signal.
PSI is the foundation of proactive memory management tools like Meta's oomd and systemd-oomd, which monitor memory.pressure and kill workloads before the kernel OOM killer fires. See PSI for how to set up a memory.pressure notification trigger.
memory.reclaim for proactive relief
If you detect rising high counts or PSI pressure through monitoring, you can manually trigger proactive reclaim without waiting for the kernel to throttle processes:
# Reclaim 200MB from the cgroup (v5.19+, commit 94968384dde1)
echo 200M > /sys/fs/cgroup/mycontainer/memory.reclaim
This was introduced in commit 94968384dde1 ("memcg: introduce per-memcg reclaim interface") by Shakeel Butt (LWN article). It is a write-only interface that triggers try_to_free_mem_cgroup_pages() directly from userspace, useful for pre-emptive memory management in orchestrators.
Stage 4: Exceeding memory.max — Cgroup OOM
If reclaim from the throttle path cannot keep memory.current below memory.high, and usage continues to rise until it reaches memory.max, the kernel invokes a different path: the cgroup OOM killer.
Entry point: mem_cgroup_out_of_memory()
When an allocation fails after exhausting cgroup reclaim, mem_cgroup_oom() in mm/memcontrol.c is called. It:
- Calls
out_of_memory()inmm/oom_kill.cwithoom_memcgset — this signals a cgroup-scoped OOM. select_bad_process()is called with the cgroup constraint, so it only considers tasks within the target cgroup.- After the victim is selected,
oom_kill_process()callsmem_cgroup_get_oom_group()to check whethermemory.oom.groupis set — if so, the entire cgroup subtree is killed, not just the victim.
How cgroup OOM differs from global OOM
| Global OOM | Cgroup OOM | |
|---|---|---|
| Trigger | System-wide allocation fails after direct reclaim | Cgroup memory.max exceeded after cgroup reclaim |
| Candidate pool | All processes on the system | Only processes in the affected cgroup (and descendents) |
| Score basis | totalpages = system RAM + swap |
totalpages = cgroup memory.max |
| Kill prefix in dmesg | Out of memory: |
Memory cgroup out of memory: |
| Constraint field | CONSTRAINT_NONE |
CONSTRAINT_MEMCG |
| Memory dump | Full zone/node Mem-Info | Cgroup memory stats via mem_cgroup_print_oom_meminfo() |
| Other containers affected | Potentially | No — completely isolated |
The OOM kill sequence
flowchart TD
A["Allocation fails\nafter cgroup reclaim exhausted"]
B["mem_cgroup_oom()\nin mm/memcontrol.c"]
C["mem_cgroup_print_oom_meminfo()\nDumps cgroup memory.stat to dmesg"]
D["select_bad_process()\nScores tasks within cgroup only"]
E["oom_kill_process()\nSends SIGKILL to victim"]
F["__oom_kill_process()\nLogs kill line, delivers signal"]
G["Victim exits\nPages freed back to cgroup"]
A --> B --> C --> D --> E --> F --> G
What dmesg shows for a cgroup OOM
[ 891.112245] Memory cgroup out of memory: Killed process 12001 (myapp) \
total-vm:2097152kB, anon-rss:1048320kB, file-rss:512kB, \
shmem-rss:0kB, UID:1000 pgtables:2048kB oom_score_adj:0
Key differences from a global OOM log:
- The line begins with
Memory cgroup out of memory:notOut of memory: - The
constraint=CONSTRAINT_MEMCGfield in the kill line names the cgroup path - The memory state section shows cgroup stats, not global zone stats
- The task table lists only processes inside the cgroup
See Reading an OOM log for a full annotated breakdown of these fields.
memory.oom.group: kill the whole container at once
By default, the OOM killer selects a single process within the cgroup. For multi-process containers (web servers with worker pools, sidecars, etc.), killing one process often leaves the rest in a broken state.
Commit 3d8b38eb81ca ("mm, oom: introduce memory.oom.group", Roman Gushchin, kernel v4.19) introduced memory.oom.group:
When set, mem_cgroup_get_oom_group() walks up the cgroup hierarchy to find the highest ancestor with oom_group set, then oom_cgroup_kill() sends SIGKILL to every task in that subtree (except those with oom_score_adj = -1000). The dmesg log includes:
oom.group and nested cgroups
When memory.oom.group is set on an ancestor, all tasks in all descendant cgroups are killed, not just the cgroup that triggered the OOM. Design your cgroup hierarchy accordingly if you have mixed-criticality workloads sharing a parent.
memory.events after an OOM
cat /sys/fs/cgroup/mycontainer/memory.events
# low 0
# high 152
# max 8 <- number of times memory.max was hit
# oom 3 <- OOM events (one per OOM invocation)
# oom_kill 3 <- individual processes killed
# oom_group_kill 0
max counts how many times usage reached memory.max — this includes attempts that triggered reclaim successfully before requiring an OOM kill. oom and oom_kill count only cases where the OOM killer actually ran.
Stage 5: Recovery After OOM Kill
When the victim process exits, its pages are freed back to the cgroup. The sequence:
- The process receives SIGKILL and its signal handler (or default disposition) terminates it.
exit_mm()is called, which callsmmput(), which drops themm_structreference count.- When the count reaches zero,
__mmput()runsexit_mmap(), which callsunmap_vmas()to walk all VMAs andfree_pgtables()to release page tables. - Each page is unmapped and freed via
release_pages(), which callsmem_cgroup_uncharge()(which takes astruct folio *in current kernels) to subtract from the cgroup's counter. memory.currentdrops. If it falls belowmemory.max, further allocations can succeed.
Will the container recover automatically?
It depends on what was killed and what the container runtime does:
Single-process container (PID 1 killed): The container runtime detects that PID 1 exited and restarts the container according to its restart policy (e.g., --restart=always in Docker, restartPolicy: Always in Kubernetes).
Multi-process container (worker killed, not PID 1): If memory.oom.group is not set, only the selected process dies. The container may continue running in a degraded state — this is the most dangerous case, because the OOM kill is silent from the perspective of container health checks unless you explicitly monitor memory.events.
memory.oom.group = 1: All processes die. The runtime detects the exit of PID 1 and performs a clean restart.
# Watch for OOM events in a running container (poll memory.events)
watch -n 5 "cat /sys/fs/cgroup/mycontainer/memory.events | grep oom"
# Or use inotify (not all kernels support it on cgroupfs)
# Most production setups poll memory.events at a fixed interval via their metrics agent
The memory.events oom_kill counter does not reset
memory.events counters are monotonically increasing since the cgroup was created. They do not reset when an OOM kill occurs or when the container restarts (as long as the cgroup is reused). Some orchestrators recreate the cgroup on restart, giving you a fresh baseline. Others reuse it. Check which behavior your runtime uses before building alerting on absolute event counts.
Alert on rate, not absolute count
Build alerts on oom_kill events per unit time (e.g., rate over 5 minutes) rather than the raw counter value. A container that had one OOM kill three weeks ago should not be permanently in alert state.
The Full Sequence at a Glance
memory.current (grows ───────────────────────────────────────────────►)
│ Normal operation │ Throttle zone │ OOM zone
│ (below memory.high) │ (above memory.high) │ (above memory.max)
│ │ │
│ Fast allocations │ - Reclaim runs │ - mem_cgroup_oom()
│ No PSI pressure │ - Process sleeps │ - select_bad_process()
│ memory.events unchanged │ - PSI 'some' rises │ - SIGKILL to victim
│ │ - high counter rises │ - oom/oom_kill++
│ │ │
├───────────────────────────────┼───────────────────────┼──────────────────────
memory.high memory.max
Practical Configuration
The gap between memory.high and memory.max is the throttle zone. A well-configured container reaches memory.high under load, gets throttled and reclaimed back, and never reaches memory.max. This design is explained in Tuning Containers.
CGROUP=/sys/fs/cgroup/mycontainer
# Profile first: observe memory.peak under realistic load (v5.19+)
cat $CGROUP/memory.peak # highest memory.current ever seen
# Set max at ~1.5x observed peak (heuristic: leaves headroom for spikes)
echo 1500M > $CGROUP/memory.max
# Set high at ~80% of max (heuristic: gives throttle zone before OOM)
echo 1200M > $CGROUP/memory.high
# Kill group on OOM for clean restart
echo 1 > $CGROUP/memory.oom.group
# Allow swap buffer between high and max
echo 512M > $CGROUP/memory.swap.max
The 80% and 1.5x figures are heuristics
There is no universally correct gap between memory.high and memory.max. The right value depends on your application's allocation burst size and how fast your reclaim path is. Observe memory.pressure and memory.events under load to calibrate.
Key Kernel Functions (mm/memcontrol.c)
| Function | What it does |
|---|---|
mem_cgroup_charge() |
Called on every page allocation to update cgroup counters and check limits |
mem_cgroup_handle_over_high() |
Entry point for high-limit throttling; calls reclaim_high() then sleeps |
reclaim_high() |
Calls try_to_free_mem_cgroup_pages() to free pages from the cgroup |
mem_cgroup_oom() |
Entry point for max-limit OOM; selects victim and invokes out_of_memory() |
mem_cgroup_get_oom_group() |
Determines the target cgroup for OOM (respects memory.oom.group) |
mem_cgroup_print_oom_meminfo() |
Dumps cgroup memory stats to dmesg during OOM |
mem_cgroup_uncharge() |
Called on page free to subtract from cgroup counters |
Further Reading
- Memory Cgroups — How the accounting works and what is charged
- Tuning Containers — How to set memory.high, memory.max, and memory.min correctly
- Reading an OOM log — How to parse the dmesg output from a cgroup OOM kill
- Running out of memory — The global OOM path: watermarks, kswapd, direct reclaim
- PSI — How pressure stall information is computed and how to use it for alerting