Power Capping and RAPL
Intel RAPL, the powercap framework, and constraining CPU power consumption
What is RAPL?
Running Average Power Limit (RAPL) is an Intel hardware feature introduced with Sandy Bridge (2011) that does two things:
- Measures energy consumption of major hardware domains using on-die counters
- Limits sustained power draw by throttling clock frequencies when a domain's rolling average exceeds a configured cap
RAPL is not a software power meter — the counters are built into the package's power management unit and are accurate to roughly 61 µJ per tick. It is the basis for perf stat -e power/energy-pkg/, powertop's package power display, and many data-center power management tools.
RAPL domains
Each power domain covers a different physical scope:
| Domain | Scope | Available since |
|---|---|---|
package |
Entire CPU socket (cores + uncore + LLC) | Sandy Bridge |
core |
CPU cores + L1/L2 caches only | Sandy Bridge |
uncore |
GPU (integrated graphics / ring bus) | Sandy Bridge |
dram |
DRAM power consumption | Sandy Bridge EP |
psys |
Platform-wide (SoC + VR losses) | Skylake |
On a dual-socket server you will see intel-rapl:0 and intel-rapl:1 for the two packages, each with child zones for their sub-domains.
The MSR interface
At the hardware level RAPL is exposed through Model-Specific Registers:
MSR_PKG_ENERGY_STATUS (0x611) — rolling energy counter, ~61 µJ units
MSR_PKG_POWER_LIMIT (0x610) — configures limit 1 (sustained) and limit 2 (burst)
MSR_PKG_POWER_INFO (0x614) — TDP, min power, max power (read-only)
MSR_PP0_ENERGY_STATUS (0x639) — core domain energy
MSR_PP1_ENERGY_STATUS (0x641) — uncore domain energy
MSR_DRAM_ENERGY_STATUS (0x619) — DRAM domain energy
MSR_PLATFORM_ENERGY_STATUS (0x64d) — psys domain energy (Skylake+)
The energy unit (Joules per LSB) is read from MSR_RAPL_POWER_UNIT (0x606), bits [12:8]. The power unit (Watts per LSB) comes from bits [3:0] of the same MSR, typically 1/8 W.
Direct MSR access requires CAP_SYS_RAWIO and the msr kernel module. The powercap sysfs interface is the preferred path for userspace.
The powercap framework
drivers/powercap/ provides a generic abstraction for power-capping hardware. The framework:
- Represents each hardware domain as a powercap zone (
struct powercap_zone) - Exposes zones as directories under
/sys/class/powercap/ - Allows drivers for different hardware (Intel RAPL, ARM SCMI) to register zones without duplicating sysfs code
/* include/linux/powercap.h */
struct powercap_zone {
int id;
char *name;
void *control_type_inst;
const struct powercap_zone_ops *ops;
struct powercap_control_type *control_type;
struct device dev;
struct idr idr;
struct idr *parent_idr;
/* ... */
};
struct powercap_zone_ops {
int (*get_max_energy_range_uj)(struct powercap_zone *, u64 *);
int (*get_energy_uj)(struct powercap_zone *, u64 *);
int (*reset_energy_uj)(struct powercap_zone *);
int (*get_power_uw)(struct powercap_zone *, u64 *); /* u64, not u32 */
int (*set_enable)(struct powercap_zone *, bool);
int (*get_enable)(struct powercap_zone *, bool *);
int (*release)(struct powercap_zone *);
};
A control type (e.g., intel-rapl) is registered first with powercap_register_control_type(), then each RAPL domain is registered as a zone with powercap_register_zone(). Child zones (core, dram, uncore) are registered with the package zone as parent.
The intel_rapl driver
drivers/powercap/intel_rapl_common.c is the driver that bridges RAPL MSRs to the powercap framework. It:
- Enumerates RAPL domains by probing MSRs for each CPU package
- Reads the energy/power unit from
MSR_RAPL_POWER_UNIT - Registers each domain as a
powercap_zone - Implements
get_energy_ujby reading the appropriate energy MSR and scaling
Two thin wrappers load on top:
intel_rapl_msr.c— accesses RAPL via MSRs directly (bare metal, most common)intel_rapl_mmio.c— accesses RAPL via MMIO registers (used on some Atom SoCs)
On systems with the driver loaded, lsmod | grep rapl should show intel_rapl_common, intel_rapl_msr, and (on some platforms) rapl for the perf event backend.
sysfs interface
# List all powercap zones
ls /sys/class/powercap/
# intel-rapl intel-rapl:0 intel-rapl:0:0 intel-rapl:0:1 intel-rapl:0:2
# Zone names
cat /sys/class/powercap/intel-rapl:0/name # "package-0"
cat /sys/class/powercap/intel-rapl:0:0/name # "core"
cat /sys/class/powercap/intel-rapl:0:1/name # "uncore" or "dram"
# Read current accumulated energy (µJ; wrap-around counter)
cat /sys/class/powercap/intel-rapl:0/energy_uj
# 4882348230
# Compute power: read energy_uj twice, 1 second apart
E1=$(cat /sys/class/powercap/intel-rapl:0/energy_uj)
sleep 1
E2=$(cat /sys/class/powercap/intel-rapl:0/energy_uj)
echo "Package power: $(( (E2 - E1) / 1000000 )) W"
# Read current long-term power limit (µW)
cat /sys/class/powercap/intel-rapl:0/constraint_0_power_limit_uw
# 45000000 (= 45 W)
# Set a 35 W long-term power limit on package 0
echo 35000000 > /sys/class/powercap/intel-rapl:0/constraint_0_power_limit_uw
# Enable constraint 0 (may already be enabled)
echo 1 > /sys/class/powercap/intel-rapl:0/constraint_0_enabled
# Read the time window over which the limit is averaged
cat /sys/class/powercap/intel-rapl:0/constraint_0_time_window_us
# 999424 (~1 second)
Constraint 0 vs Constraint 1
RAPL supports two independent power constraints per zone:
| Constraint | Name | Typical time window | Behavior |
|---|---|---|---|
constraint_0 |
Long-term (PL1) | 1 s – 28 s | Sustained cap; hardware throttles if rolling average exceeds this |
constraint_1 |
Short-term (PL2) | 1 ms – 7 ms | Burst cap; allows exceeding PL1 briefly for responsiveness |
PL2 is always ≥ PL1. A typical OEM configuration: PL1 = 15 W (thermal envelope), PL2 = 25 W (burst for up to ~28 s). After the burst window expires, the processor clamps to PL1.
# Short-term (burst) limit
cat /sys/class/powercap/intel-rapl:0/constraint_1_name
# "short_term"
cat /sys/class/powercap/intel-rapl:0/constraint_1_power_limit_uw
# 64000000 (64 W burst)
cat /sys/class/powercap/intel-rapl:0/constraint_1_time_window_us
# 2440 (~2.4 ms)
Measuring energy with perf
The perf tool can read RAPL energy counters via the power PMU, which uses the same MSRs under the hood but integrates with the perf event infrastructure:
# Measure package energy for a workload
perf stat -e power/energy-pkg/ -- ./my_workload
# Performance counter stats for './my_workload':
# 24.35 Joules power/energy-pkg/
# 3.152370284 seconds time elapsed
# All available RAPL events
perf list | grep power
# power/energy-cores/ [Kernel PMU event]
# power/energy-gpu/ [Kernel PMU event]
# power/energy-pkg/ [Kernel PMU event]
# power/energy-ram/ [Kernel PMU event]
The perf backend is in arch/x86/events/rapl.c and registers a PMU named power that maps energy-pkg to MSR_PKG_ENERGY_STATUS, energy-cores to MSR_PP0_ENERGY_STATUS, and so on.
ARM equivalent: SCMI power capping
On ARM platforms, SCMI (System Control and Management Interface) provides a comparable capability via the platform firmware. The driver drivers/powercap/arm_scmi_powercap.c registers SCMI power domains as powercap zones, so the same /sys/class/powercap/ interface works on ARM servers (e.g., Ampere Altra).
# On an ARM server with SCMI powercap
ls /sys/class/powercap/
# arm-scmi arm-scmi:0 ...
cat /sys/class/powercap/arm-scmi:0/name
# "package_0"
The SCMI protocol (defined in Arm DEN0056) runs over shared memory or mailbox between Linux and the SCP (System Control Processor). SCMI message type POWERCAP_CAP_GET / POWERCAP_CAP_SET map directly to the powercap zone ops.
Interaction with thermal management
RAPL power limits and the thermal framework are complementary, not overlapping:
Temperature-based: Thermal zone → trip point → cpufreq cooling → frequency drop
Power-based: RAPL PL1/PL2 → hardware throttles clocks directly
RAPL acts faster (hardware decision, microsecond granularity).
Thermal acts on sustained temperature (software, second granularity).
PROCHOT (thermal limit from temperature sensor) overrides both.
Setting a RAPL limit lower than TDP controls steady-state power without relying on temperature sensors. This is useful in power-capped environments (data centers with per-rack power limits) where thermal headroom is not the constraint.
Observability
# turbostat shows per-package RAPL data continuously
sudo turbostat --show PkgWatt,CorWatt,RAMWatt --interval 1
# Watch energy counters directly
watch -n 1 'for z in /sys/class/powercap/intel-rapl:*/; do
echo -n "$(cat $z/name): $(cat $z/energy_uj) µJ "; done; echo'
# Check if a constraint is currently active/throttling
# (no direct sysfs node; infer from cpu frequency drop or perf counters)
perf stat -e cpu-cycles,ref-cycles -- sleep 1
# If ref-cycles >> cpu-cycles, the CPU is being throttled
Further reading
- cpufreq and P-states — frequency scaling; RAPL throttling reduces frequency via the same HWP mechanism
- Thermal Management — temperature-based throttling; complements RAPL power limits
- Power Management War Stories — real RAPL misconfiguration incident
drivers/powercap/intel_rapl_common.c— RAPL driver implementationarch/x86/events/rapl.c— perf RAPL PMU backend- Intel Software Developer's Manual, Volume 3B, Chapter 15 — RAPL MSR definitions