Hardware Node Settings¶
This page list the known possible hardware and software configurations that the user can leverage on the compute nodes.
Warning
These features are still under development. For now they have been tested on
the iml-ia770 partition. It may work for other partitions but it is
largely untested yet! Feel free to give us some feedback if you are
interested in this ;-).
cpudev¶
cpudev is a tools especially designed for Dalek. It helps to configure the CPU
device. cpudev enables to modify the CPU driver, its governor, its
frequency per core, idle states, and so on. One may noticed that this is
generally not possible on traditional clusters and it makes Dalek quite unique
from this point of view.
You can easily add cpudev to your PATH by doing:
The life time of the cpudev modifications is the SLURM job. Moreover, only
exclusive jobs (jobs that take a compute node exclusively) can use it. When a
job terminates, the original CPU parameters are set back to ensure
reproducibility of the experiments on the nodes. It means that cpudev should
be invoked at the beginning of each job.
Tips
A man page is available for this command, as well as reminders using the
--help option with any of the following subcommands.
Info
This utility is available for anyone which is in the cpudev group. It
embarking the functionalities detailed in the Technical
Details section.
Usage¶
cpudev is a command-line tool for configuring CPU settings, including
frequency and idle states. It allows applying configurations via specific
commands or a YAML file. It is closely related to the CPU subsystems CPUFreq
and CPUIdle, which we recommend you be aware of before using this utility
(see the Technical Details section below).
cpudev uses commands that can be chained, applying them sequentially. The
parameters set by former commands will be memorized for the next ones. You can,
for example, use the cpufreq command multiple times in the same call, allowing
you to specify different configurations for different subsets of CPUs.
Subcommands¶
apply: Apply configurations from a YAML file.cpufreq: Set CPUFreq settings for specified CPU policies.cpuidle: Enable or disable specific CPU idle states for specified CPU sets.driver: Enable or disable the specified CPUFreq driver.
Below are the corresponding subcommands options.
cpudev apply [OPTIONS]¶
--config PATH: YAML configuration file to apply settings from.
cpudev cpufreq [OPTIONS]¶
-g, --governor TEXT: Set CPUFreq governor (e.g.,performance,powersave).-p, --policies TEXT: Specify CPU policies to target (e.g.,'0','0 9','0-9','*').-f, --frequency INTEGER: Set fixed CPU frequency (in kHz). Note that this option will not work unless the governor is set touserspaceand the driver toacpi. Seecpufreqfor more details.--config PATH: YAML configuration file.
cpudev cpuidle [OPTIONS]¶
-c, --cpus TEXT: Specify CPU numbers to target (e.g.,'0','0 9','0-9','*').-s, --idle-states TEXT: Specify CPU idle states (e.g.,'0','0 9','0-9','*').--disable: Disable the specified CPU idle states.--enable: Enable the specified CPU idle states.--config PATH: YAML configuration file.
cpudev driver [OPTIONS]¶
{intel_pstate|acpi}: Specify the CPUFreq driver to enable or disable.--config PATH: YAML configuration file.
CPUs and Policies Selection¶
The CPU numbers and policy selector options support a syntax similar to that of
the pdftk cat command. You can assemble a query by concatenating the following
blocks with spaces:
k: Selects the item whose name is suffixed byk. For example, in the context of CPUs,"8"would select"cpu8".i-j: Allows you to select a continuous range of values betweeniandj. Boundaries are included. Example:"0-5"is equivalent to"0 1 2 3 4 5".*: Selects every possible number depending on the context. If used forcpus, it will look at/sys/devices/system/cpu/cpu[0-9]+and generate a range from it. If used forpolicies, it will do the same with/sys/devices/system/cpu/cpufreq/policy[0-9]+. Finally, if used for selectingidle_states, it will look at/sys/devices/system/cpu/cpu0/cpuidle/state[0-9]+. Note that for the latter case, it assumes every CPU has the same range of idle states.
No assumptions or verifications are made about the query and the system when
computing it. This means, except for "*", boundaries about the number of items
in a category (cpus, idle_states, etc.) are not known or verified: an error will
be output if an invalid query is made.
Example: "0-3 5 7-8" would select every item between 0 and 8 except
4 and 6.
Configuration File¶
The YAML file should contain sections for cpufreq, cpuidle, and driver
parameters. Multiple list items are supported within these categories,
simulating the fact that commands can be chained. This way, it is possible to
apply different settings to different CPU subsets.
The driver category supports only "intel_pstate" or "acpi" values. The
order is important as the file is processed sequentially. For example, you
should place the driver parameter at the top if you want to use governors from a
specific one.
Example:
driver: "acpi"
cpufreq:
- governor: userspace
policies: "0-2"
frequency: "2500000"
- governor: performance
policies: "4"
cpuidle:
- cpus: "0-9"
idle_states: "1"
disable: true
Examples¶
-
Apply a configuration from a YAML file:
-
Set the
powersavegovernor for CPU policies 0 to 3: -
Disable CPU idle state 1 for CPU 0:
-
Enable the
intel_pstatedriver: -
Show help for applying a configuration:
Technical Details¶
Sources
About Frequency (APERF, MPERF, TSC)¶
How is it computed ?¶
On x86¶
The files /proc/cpuinfo and
/sys/devices/system/cpu/cpu*/cpufreq/scaling_frequency use the following
method to calculate CPU frequency.
The calculus used by the kernel to retreive the frequency is the following:
\(\text{BusyMHz} = \frac{\Delta_{\text{APERF}}}{\Delta_{\text{MPERF}}} \times \text{freq}_{\text{base}}\)
On x86 platforms, computing a core's frequency involves three registers (MSRs):
- APERF and MPERF: Individually meaningless, but their ratio
(
APERF/MPERF) provides a coefficient to multiply with a base frequency. According to the Intel Developer's manual, it is also used to compute the usage proportion of the cores (Volume 3B, 19.17, p.682 and Volume 3B, 16.2, p.500). - TSC: A frequency-invariant counter.
According to the Linux kernel source code,
the base frequency (\(\text{freq}_{\text{base}}\)) is a fixed value stored in
the cpu_khz variable, which can be retrieved using BPF. This frequency is
calculated using the TSC and represents the frequency of the core at the maximum
non-turbo P-State. The following code snippet allows yhou to retreive such a
value:
CPU_KHZ_ADDR=$(sudo cat /proc/kallsyms | grep "D cpu_khz" | cut -f1 -d" ") && sudo bpftrace -e "BEGIN { \$cpu_khz_addr = 0x$CPU_KHZ_ADDR ; printf(\"cpu_khz: %d\", *\$cpu_khz_addr); exit();}"
Important
The \(\text{freq}_{\text{base}}\) value need not to be confused with the nominal frequency given by chip manufacturers : these two values are different.
Linux Kernel source code comment
The scheduler wants to do frequency invariant accounting and needs a
\(<1\) ratio to account for the "current" frequency, corresponding to
\(\text{freq}_{\text{curr}} / \text{freq}_{\text{max}}\). Since the frequency
\(\text{freq}_{\text{curr}}\) on x86 is controlled by micro-controller and our
P-State setting is little more than a request/hint, we need to observe the
effective frequency "BusyMHz", i.e. the average frequency over a time
interval after discarding idle time. This is given by:
\(\text{BusyMHz} = \frac{\Delta_{\text{APERF}}}{\Delta_{\text{MPERF}}} \times \text{freq}_{\text{base}}\)
where \(\text{freq}_{\text{base}}\) is the max non-turbo P-State.
The \(\text{freq}_{\text{max}}\) term has to be set to a somewhat arbitrary
value, because we can't know which turbo states will be available at a given
point in time: it all depends on the thermal headroom of the entire package.
We set it to the turbo level with 4 cores active.
Benchmarks show that's a good compromise between the 1C turbo ratio
\(\text{freq}_\text{curr} / \text{freq}_\text{max}\) would rarely reach 1 and
something close to \(\text{freq}_\text{base}\), which would ignore the entire
turbo range (a conspicuous part, making
\(\times \text{freq}_{\text{curr}} / \text{freq}_{\text{max}}\) always maxed
out).
How can I set it ? (CPUFreq)¶
CPUFreq is a Linux kernel software interface that manages DVFS, an hardware feature in modern chips designed to reduce power consumption, alongside techniques like clock-gating and power-gating.
The CPUFreq software stack includes:
| Component | Role |
|---|---|
| Core | Registers CPU cores within the software and assigns them policies. |
| Governor | Decides the optimal frequency based on system metrics. It holds the main algorithm to decide which entity needs to scale up or down its frequency. |
| Policy | Applies the governor's decisions across associated CPUs. |
| Driver | Interfaces with hardware to set and retrieve frequencies. |
Theoretically, any governor can work with any hardware, and different governors can manage distinct logical CPU subsets.
Each governor can have its own set of parameters we can change to influence its
choices (see the
schedutil
governor for example).
On Intel¶
Intel developed its own driver and governor (Intel P-State) for processors since the Sandy Bridge architecture. It supports Hardware-managed P-States, enabling automatic frequency adjustments. The driver operates in two modes:
| Mode | Description |
|---|---|
| Active | User-space frequency control is disabled; only "powersave" and "performance" governors are available. |
| Passive | Behaves like a standard CPUFreq driver, allowing Linux kernel governors. |
Tip
Intel also introduced EPB, a 16-level scale to prioritize performance or
power efficiency, used by Intel P-State for frequency arbitration. We can
modify this bias directly through sysfs. More information on
its dedicated page.
Unfortunately, it is not possible to set a specific frequency with this driver enabled.
So, to get to achieve this goal, we must follow the next steps :
- Deactivate the Intel P-State, so it hands the control back to the vanilla driver
- Select a core we want to set the frequence of, here it will be
cpu0 - Select the governor for this core (e.g.
cpu0) that allows us to set the frequency:userspace - Set the frequency we are targeting in the following files (here it is 1800000, which represents 1.8 GHz):
- You can then verify the effects of this process by executing the following command:
About Sleeping (CPUIdle)¶
CPUIdle is the Linux kernel software stack that manages processor idle states. Unlike CPUFreq, which adjusts frequency, CPUIdle allows a logical CPU to stop executing instructions and power down parts of its circuitry to save energy.
The CPUIdle stack shares similarities with CPUFreq, using the same governor and driver concepts allowing different policies to be taken in function of the platform or the idle states defined.
A sleep state is defined by two key metrics:
- Target residency: Minimum time required in the state to save more energy than a lighter state (includes entry time).
- Exit latency: Maximum delay between the kernel requesting sleep and waking the CPU for a new instruction.
What concretely entails a sleep state is not clear nor defined in the source code (it points to assembly labels coming from compiled drivers). It depends on the platform they are defined.
Info
When no tasks are available, the kernel schedules a special idle (or
swapper) task, triggering the selected CPUIdle governor to enter a sleep
state. However, scheduler ticks (1–10 ms) prevent deep sleep states.
"Tickless" kernels address this by disabling periodic ticks and waking
only on external interrupts.
Playing with Sleep States¶
System Analysis¶
Commands to inspect CPUIdle configuration:
-
List available governors:
-
List available C-states:
-
Check current configuration:
- Current governor :
- Enabled/disabled states for a CPU:
0= enabled,1= disabled.
Configuration¶
Commands to modify CPUIdle behavior:
-
Change governor:
-
Enable/Disable C-states:
- Disable a state (e.g.,
state3forcpu0): - Re-enable a state:
- Disable all deep states (e.g., from
C2):
- Disable a state (e.g.,