systemd and Cgroup Integration

How systemd manages the cgroup hierarchy: slices, scopes, services, and delegation

Overview

systemd is the cgroup manager on nearly every modern Linux distribution. Since systemd 232 (2016), systemd assumes exclusive ownership of the cgroup v2 hierarchy: it creates sub-trees for every unit, writes resource limits to the appropriate interface files, and enforces the rule that only one userspace agent manages a given subtree. No other tool should write to /sys/fs/cgroup directly unless systemd has explicitly delegated that subtree.

This arrangement solves a problem that plagued cgroup v1: multiple managers (Docker, LXC, Kubernetes, systemd itself) competing to write the same cgroup files, often overwriting each other's settings.

Unit types and their cgroup mapping

systemd exposes three unit types that map to cgroups:

Service units (`.service`)

A long-running daemon. systemd creates a cgroup at system.slice/<name>.service/ and moves the daemon's processes into it at start time.

/sys/fs/cgroup/system.slice/nginx.service/
├── cgroup.procs        ← contains nginx worker PIDs
├── cpu.max             ← set from CPUQuota=
├── memory.max          ← set from MemoryMax=
├── io.weight           ← set from IOWeight=
└── pids.max            ← set from TasksMax=

The cgroup is created when the service starts and removed (after a drain) when it stops. If the service crashes and restarts, the same cgroup path is reused.

Scope units (`.scope`)

A group of externally-started processes registered with systemd via D-Bus (org.freedesktop.systemd1.Manager.StartTransientUnit). Container runtimes use scopes to register containers they start themselves. The cgroup path is user.slice/user-<uid>.slice/<name>.scope/ for user scopes.

Unlike services, systemd does not start or stop processes in a scope — it only creates the cgroup and tracks it. The runtime that registered the scope is responsible for the processes.

Slice units (`.slice`)

Hierarchical grouping containers for services and scopes. A slice does not hold processes directly; it holds child slices, services, and scopes. Its cgroup provides a resource envelope shared by all children.

The default slices are: - system.slice — all system services - user.slice — all user sessions - machine.slice — VMs and containers managed by systemd-machined

The default cgroup layout

/sys/fs/cgroup/
├── cgroup.controllers
├── cgroup.subtree_control
├── init.scope/            ← PID 1 (systemd) itself
│   └── cgroup.procs       ← contains 1
├── system.slice/          ← system services
│   ├── cgroup.subtree_control
│   ├── nginx.service/
│   │   ├── cgroup.procs
│   │   ├── cpu.max
│   │   └── memory.max
│   ├── sshd.service/
│   │   └── cgroup.procs
│   └── containerd.service/
│       └── cgroup.procs   ← contains containerd (not containers)
├── user.slice/            ← user sessions
│   └── user-1000.slice/
│       ├── session-1.scope/
│       │   └── cgroup.procs   ← login shell and its children
│       └── user@1000.service/ ← user systemd instance
│           └── cgroup.procs
└── machine.slice/         ← VMs / containers via machined

systemd ensures the system.slice and user.slice always exist and that cgroup.subtree_control is populated with the controllers listed in DefaultCPUAccounting=, DefaultMemoryAccounting=, etc. in /etc/systemd/system.conf.

Resource limits in unit files

Unit file properties map directly to cgroup v2 interface files:

[Service]
# cpu.max = quota period (microseconds)
CPUQuota=50%          # cpu.max = "500000 1000000"  (50% of one CPU)

# memory.max (hard limit; triggers OOM kill when exceeded)
MemoryMax=512M        # memory.max = 536870912

# memory.high (soft limit; triggers reclaim before OOM)
MemoryHigh=400M       # memory.high = 419430400

# memory.swap.max (limit swap usage)
MemorySwapMax=0       # memory.swap.max = 0  (no swap)

# io.weight (proportional IO scheduling weight, range 1-10000)
IOWeight=100          # io.weight = 100

# io.max (per-device hard bandwidth limit)
IOReadBandwidthMax=/dev/sda 50M   # io.max = "8:0 rbps=52428800"
IOWriteBandwidthMax=/dev/sda 25M  # io.max = "8:0 wbps=26214400"

# pids.max
TasksMax=100          # pids.max = 100

# cpuset.cpus (pin to specific CPUs)
AllowedCPUs=0-3       # cpuset.cpus = "0-3"

# cpuset.mems (pin to specific NUMA nodes)
AllowedMemoryNodes=0  # cpuset.mems = "0"

When systemd writes these, it uses sd_bus_call_method() over D-Bus for live changes, and then writes directly to the cgroup interface files during unit startup via the internal cg_set_attribute() helper (in src/basic/cgroup-util.c of the systemd source).

Live property changes

systemctl set-property applies a resource change to a running unit without restarting it:

# Reduce nginx to 25% CPU quota
systemctl set-property nginx.service CPUQuota=25%

# Limit memory
systemctl set-property nginx.service MemoryMax=256M

# The change is:
# 1. Sent via D-Bus to PID 1 (systemd)
# 2. systemd writes the value to /sys/fs/cgroup/system.slice/nginx.service/cpu.max
# 3. Optionally persisted to /etc/systemd/system/nginx.service.d/50-CPUQuota.conf

# Verify the change took effect:
cat /sys/fs/cgroup/system.slice/nginx.service/cpu.max
# 250000 1000000

The --runtime flag applies the change only until next reboot (no .conf file created):

systemctl set-property --runtime nginx.service MemoryMax=256M

Delegation

When Delegate=yes is set in a unit file, systemd marks the service's cgroup as delegated. systemd will still create the cgroup and set top-level limits, but it will not recurse into the subtree — it will not touch cgroup.subtree_control or any files below the delegation point. The service itself is responsible for managing its own sub-hierarchy.

[Service]
Delegate=yes
# Or delegate specific controllers only:
Delegate=cpu memory pids

This is used by: - containerd / Docker — create per-container sub-cgroups under containerd.service/ - Kubernetes kubelet — creates per-pod cgroups under kubelet.service/ or a dedicated slice - User systemd instances — user@1000.service is delegated so the per-user systemd can manage user.slice/user-1000.slice/

When Delegate=yes, systemd uses chown(2) on the cgroup directory to the service's User= UID, which allows the unprivileged process to write cgroup.subtree_control and create child cgroup subdirectories without root. There is no cgroup.delegate kernel interface file; delegation is purely a matter of filesystem ownership on the cgroup directory.

The correct pattern for a container runtime using delegation:

# systemd has created: /sys/fs/cgroup/system.slice/containerd.service/
# Delegate=yes means containerd owns this subtree.

# containerd creates a cgroup for container abc123:
mkdir /sys/fs/cgroup/system.slice/containerd.service/abc123/

# Write limits BEFORE moving any process into the cgroup:
echo "512M" > /sys/fs/cgroup/system.slice/containerd.service/abc123/memory.max
echo "100"  > /sys/fs/cgroup/system.slice/containerd.service/abc123/pids.max

# Then move the container init process:
echo $CONTAINER_PID > /sys/fs/cgroup/system.slice/containerd.service/abc123/cgroup.procs

Transient units

systemd-run creates a transient scope or service cgroup on the fly without writing a unit file:

# Run a command in a transient scope under myslice.slice with resource limits
systemd-run --scope \
    --slice=myslice.slice \
    --property=MemoryMax=256M \
    --property=CPUQuota=50% \
    -- stress --vm 2 --vm-bytes 200M

# Run a one-shot service (waits for completion)
systemd-run --service-type=oneshot \
    --property=TasksMax=20 \
    -- /usr/local/bin/batch-job.sh

# Check the generated cgroup:
systemd-cgls /sys/fs/cgroup/myslice.slice/

Transient units are registered via org.freedesktop.systemd1.Manager.StartTransientUnit over D-Bus. The unit and its cgroup are removed automatically when the process exits.

Observability

# Show the full cgroup tree annotated with systemd unit names:
systemd-cgls

# Example output:
# Control group /:
# -.slice
# ├─system.slice
# │ ├─nginx.service
# │ │ ├─1234 nginx: master process /usr/sbin/nginx
# │ │ └─1235 nginx: worker process
# │ └─sshd.service
# │   └─5678 sshd: accepting connections
# └─user.slice
#   └─user-1000.slice
#     └─session-1.scope
#       ├─9012 -bash
#       └─9013 systemd-cgls

# Live per-cgroup resource usage (like top for cgroups):
systemd-cgtop

# Show resource settings for a unit:
systemctl show nginx.service | grep -E 'CPU|Memory|IO|Tasks'
# CPUQuota=50%
# MemoryMax=536870912
# TasksMax=100

# Check raw cgroup interface files:
cat /sys/fs/cgroup/system.slice/nginx.service/cpu.stat
cat /sys/fs/cgroup/system.slice/nginx.service/memory.current
cat /sys/fs/cgroup/system.slice/nginx.service/io.stat

How systemd enables controllers

systemd reads /proc/cgroups to discover available controllers and uses cgroup.subtree_control to propagate them down the hierarchy. The DefaultCPUAccounting=yes / DefaultMemoryAccounting=yes / DefaultIOAccounting=yes knobs in /etc/systemd/system.conf control which controllers systemd enables globally.

# See which controllers systemd has enabled:
cat /sys/fs/cgroup/cgroup.subtree_control
# cpuset cpu io memory hugetlb pids

# See controllers available to a service's cgroup:
cat /sys/fs/cgroup/system.slice/cgroup.subtree_control

If a controller is not listed in cgroup.subtree_control at the relevant level, interface files like memory.max won't exist in the child cgroup — a common source of confusion when controllers are disabled by default.