Memory Cgroups (memcg)

Container memory limits and accounting

What Is memcg?

Memory cgroups (memcg) allow limiting and tracking memory usage for groups of processes. They're the foundation of container memory limits (Docker, Kubernetes, systemd).

Container A (limit: 1GB)      Container B (limit: 2GB)
┌─────────────────────┐      ┌─────────────────────┐
│  Process 1          │      │  Process 3          │
│  Process 2          │      │  Process 4          │
│  [shared memory]    │      │  [shared memory]    │
└─────────────────────┘      └─────────────────────┘
         │                            │
         └────────────┬───────────────┘
                      │
              Memory Controller
              (tracks & enforces)

cgroup v1 vs v2

Linux has two cgroup implementations:

Feature	v1	v2
Hierarchy	Multiple (one per controller)	Single unified
Memory+IO coordination	Difficult	Integrated
Pressure notification	Limited	PSI (Pressure Stall Info)
Default (modern systems)	Legacy	Preferred

v2 is recommended for new deployments. This document focuses on v2.

# Check which version is mounted
mount | grep cgroup

# v1: /sys/fs/cgroup/memory, /sys/fs/cgroup/cpu, etc.
# v2: /sys/fs/cgroup (unified)

Basic Operations

Create a cgroup

# Create a new cgroup
mkdir /sys/fs/cgroup/mygroup

# Move a process into it
echo $PID > /sys/fs/cgroup/mygroup/cgroup.procs

# Enable memory controller
echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control

Set Memory Limit

# Set 500MB limit (cgroup v2)
echo 500M > /sys/fs/cgroup/mygroup/memory.max

# Set soft limit (reclaim target)
echo 400M > /sys/fs/cgroup/mygroup/memory.high

View Usage

# Current memory usage
cat /sys/fs/cgroup/mygroup/memory.current

# Detailed statistics
cat /sys/fs/cgroup/mygroup/memory.stat

Memory Limits (v2)

File	Description
`memory.max`	Hard limit. OOM kill if exceeded.
`memory.high`	Soft limit. Throttle and reclaim aggressively.
`memory.low`	Protection. Best-effort preservation under pressure.
`memory.min`	Hard protection. Never reclaim below this.

                    memory.max
                        │
        OOM kill ──────>│
                        │
                    memory.high
                        │
     Aggressive ───────>│
      reclaim           │
                        │
                    memory.low
                        │
    Protected ─────────>│ (best-effort)
                        │
                    memory.min
                        │
     Cannot ───────────>│ (hard guarantee)
      reclaim

Example: Container Limits

# Production container setup
echo 1G > /sys/fs/cgroup/container/memory.max    # Hard limit
echo 800M > /sys/fs/cgroup/container/memory.high # Start throttling
echo 200M > /sys/fs/cgroup/container/memory.low  # Protect this much

What's Accounted

memcg tracks:

Type	Accounted	Notes
Anonymous pages	Yes	Heap, stack, mmap(MAP_ANONYMOUS)
File cache	Yes	Pages from file reads
Slab (kmem)	Yes (v2)	Kernel objects for this cgroup
Huge pages	Separate	Via `hugetlb` controller
Kernel stacks	Yes (v2)	Per-task kernel stacks

Memory Statistics

cat /sys/fs/cgroup/mygroup/memory.stat

# Key fields:
# anon        - Anonymous memory
# file        - File cache
# slab        - Kernel slab objects
# sock        - Network socket buffers
# pgfault     - Page faults
# pgmajfault  - Major page faults (disk I/O)

Per-cgroup Reclaim

When a cgroup approaches its limit, reclaim happens within that cgroup first.

Global memory OK
        │
        v
Cgroup A at memory.high
        │
        v
Reclaim from Cgroup A only
(other cgroups unaffected)

memory.reclaim (v5.19+)

Proactively trigger reclaim:

# Reclaim 100MB from cgroup
echo 100M > /sys/fs/cgroup/mygroup/memory.reclaim

Useful for: - Pre-warming before load spike - Reducing memory before migration - Testing reclaim behavior

OOM Handling

When a cgroup exceeds memory.max:

Kernel tries reclaim within cgroup
If insufficient, triggers cgroup OOM
OOM killer selects process within cgroup
Selected process is killed

# View OOM events
cat /sys/fs/cgroup/mygroup/memory.events

# Fields:
# oom        - OOM events count
# oom_kill   - Processes killed by OOM
# max        - Times memory.max was hit
# high       - Times memory.high was hit

OOM Group Kill (v4.19+)

Commit: 3d8b38eb81ca ("mm, oom: introduce memory.oom.group") | LKML

Author: Roman Gushchin

Kill entire cgroup instead of single process:

echo 1 > /sys/fs/cgroup/mygroup/memory.oom.group

Useful for containers where killing one process leaves others broken.

Pressure Stall Information (PSI)

PSI shows how much time tasks spend waiting for memory:

cat /sys/fs/cgroup/mygroup/memory.pressure

# avg10=0.00 avg60=0.00 avg300=0.00 total=12345
# some: percentage of time at least one task was stalled
# full: percentage of time all tasks were stalled

Use Cases

Autoscaling: Scale up when pressure increases
Health checks: Detect memory-constrained containers
Load balancing: Move workloads from pressured nodes

# Monitor system-wide pressure
cat /proc/pressure/memory

Hierarchical Limits

Cgroups are hierarchical. Child limits can't exceed parent:

root (limit: 8GB)
├── container-a (limit: 2GB)
│   ├── app (limit: 1GB)      <- effective: 1GB
│   └── sidecar (limit: 3GB)  <- effective: 2GB (parent limit)
└── container-b (limit: 4GB)

Swap Control

# Limit swap usage (v2)
echo 0 > /sys/fs/cgroup/mygroup/memory.swap.max    # No swap
echo 500M > /sys/fs/cgroup/mygroup/memory.swap.max # 500MB swap

# View swap usage
cat /sys/fs/cgroup/mygroup/memory.swap.current

History

Memory Controller Introduction (v2.6.25, 2008)

Commit: 8cdea7c05454 ("Memory controller: cgroups setup")

Author: Balbir Singh

Note: Pre-2009 LKML archives on lore.kernel.org are sparse. The commit message documents the design.

Initial memory cgroup implementation for cgroup v1.

Unified Hierarchy (cgroup v2, v4.5, 2016)

The cgroup v2 unified hierarchy was marked non-experimental in v4.5, with the memory controller considered stable for production use. The memory controller was reworked significantly from v1, with cleaner semantics and integrated pressure stall information.

Kernel Memory Accounting (v4.5+)

Kernel memory (slab, stacks) included in memory.current by default in v2.

memory.reclaim (v5.19, 2022)

Commit: 94968384dde1 ("memcg: introduce per-memcg reclaim interface") | LKML

Author: Shakeel Butt

Proactive reclaim interface.

Try It Yourself

Create a Memory-Limited Shell

# Create cgroup
sudo mkdir /sys/fs/cgroup/test

# Enable memory controller (if needed)
echo "+memory" | sudo tee /sys/fs/cgroup/cgroup.subtree_control

# Set 100MB limit
echo 100M | sudo tee /sys/fs/cgroup/test/memory.max

# Move current shell into cgroup
echo $$ | sudo tee /sys/fs/cgroup/test/cgroup.procs

# Now try to allocate more than 100MB
python3 -c "x = 'a' * 150_000_000"  # Will be OOM killed

Monitor a Container

# Find container's cgroup
CGROUP=$(cat /proc/<container-pid>/cgroup | cut -d: -f3)

# Watch memory usage
watch -n 1 cat /sys/fs/cgroup/$CGROUP/memory.current

# View detailed stats
cat /sys/fs/cgroup/$CGROUP/memory.stat

Simulate Memory Pressure

# Create cgroup with low limit
sudo mkdir /sys/fs/cgroup/pressure-test
echo 50M | sudo tee /sys/fs/cgroup/pressure-test/memory.max

# Run memory-hungry process
echo $$ | sudo tee /sys/fs/cgroup/pressure-test/cgroup.procs
stress --vm 1 --vm-bytes 100M

# Watch pressure
cat /sys/fs/cgroup/pressure-test/memory.pressure

Common Issues

Container OOM Despite Free System Memory

Container hit its memory.max limit.

Debug: Check memory.events for oom count.

Solutions: - Increase limit - Optimize application memory usage - Add swap and memory.swap.max

Slow Container Startup

Memory being reclaimed during startup.

Debug: Check memory.pressure

Solutions: - Increase memory.high - Pre-warm with memory.reclaim - Check if limit is too low

Kernel Memory Growing Unbounded (v1)

In cgroup v1, kernel memory wasn't limited by default.

Solution: Use cgroup v2, or set memory.kmem.limit_in_bytes in v1.

References

Key Code

File	Description
`mm/memcontrol.c`	Memory controller implementation
`include/linux/memcontrol.h`	memcg API

Kernel Documentation

Documentation/admin-guide/cgroup-v2.rst - Authoritative cgroup v2 docs

reclaim - How memory is reclaimed
page-allocator - Global memory allocation
glossary - cgroup, OOM definitions