Skip to content

Hardware Node Settings

This page list the known possible hardware and software configurations that the user can leverage on the compute nodes.

Warning

These features are still under development. For now they have been tested on the iml-ia770 partition. It may work for other partitions but it is largely untested yet! Feel free to give us some feedback if you are interested in this ;-).

cpudev

cpudev is a tools especially designed for Dalek. It helps to configure the CPU device. cpudev enables to modify the CPU driver, its governor, its frequency per core, idle states, and so on. One may noticed that this is generally not possible on traditional clusters and it makes Dalek quite unique from this point of view.

You can easily add cpudev to your PATH by doing:

module load cpudev

The life time of the cpudev modifications is the SLURM job. Moreover, only exclusive jobs (jobs that take a compute node exclusively) can use it. When a job terminates, the original CPU parameters are set back to ensure reproducibility of the experiments on the nodes. It means that cpudev should be invoked at the beginning of each job.

Tips

A man page is available for this command, as well as reminders using the --help option with any of the following subcommands.

Info

This utility is available for anyone which is in the cpudev group. It embarking the functionalities detailed in the Technical Details section.

Usage

cpudev [OPTIONS] COMMAND1 [ARGS]... [COMMAND2 [ARGS]...]...

cpudev is a command-line tool for configuring CPU settings, including frequency and idle states. It allows applying configurations via specific commands or a YAML file. It is closely related to the CPU subsystems CPUFreq and CPUIdle, which we recommend you be aware of before using this utility (see the Technical Details section below).

cpudev uses commands that can be chained, applying them sequentially. The parameters set by former commands will be memorized for the next ones. You can, for example, use the cpufreq command multiple times in the same call, allowing you to specify different configurations for different subsets of CPUs.

Subcommands

  • apply: Apply configurations from a YAML file.
  • cpufreq: Set CPUFreq settings for specified CPU policies.
  • cpuidle: Enable or disable specific CPU idle states for specified CPU sets.
  • driver: Enable or disable the specified CPUFreq driver.

Below are the corresponding subcommands options.

cpudev apply [OPTIONS]

  • --config PATH: YAML configuration file to apply settings from.

cpudev cpufreq [OPTIONS]

  • -g, --governor TEXT: Set CPUFreq governor (e.g., performance, powersave).
  • -p, --policies TEXT: Specify CPU policies to target (e.g., '0', '0 9', '0-9', '*').
  • -f, --frequency INTEGER: Set fixed CPU frequency (in kHz). Note that this option will not work unless the governor is set to userspace and the driver to acpi. See cpufreq for more details.
  • --config PATH: YAML configuration file.

cpudev cpuidle [OPTIONS]

  • -c, --cpus TEXT: Specify CPU numbers to target (e.g., '0', '0 9', '0-9', '*').
  • -s, --idle-states TEXT: Specify CPU idle states (e.g., '0', '0 9', '0-9', '*').
  • --disable: Disable the specified CPU idle states.
  • --enable: Enable the specified CPU idle states.
  • --config PATH: YAML configuration file.

cpudev driver [OPTIONS]

  • {intel_pstate|acpi}: Specify the CPUFreq driver to enable or disable.
  • --config PATH: YAML configuration file.

CPUs and Policies Selection

The CPU numbers and policy selector options support a syntax similar to that of the pdftk cat command. You can assemble a query by concatenating the following blocks with spaces:

  • k: Selects the item whose name is suffixed by k. For example, in the context of CPUs, "8" would select "cpu8".
  • i-j: Allows you to select a continuous range of values between i and j. Boundaries are included. Example: "0-5" is equivalent to "0 1 2 3 4 5".
  • *: Selects every possible number depending on the context. If used for cpus, it will look at /sys/devices/system/cpu/cpu[0-9]+ and generate a range from it. If used for policies, it will do the same with /sys/devices/system/cpu/cpufreq/policy[0-9]+. Finally, if used for selecting idle_states, it will look at /sys/devices/system/cpu/cpu0/cpuidle/state[0-9]+. Note that for the latter case, it assumes every CPU has the same range of idle states.

No assumptions or verifications are made about the query and the system when computing it. This means, except for "*", boundaries about the number of items in a category (cpus, idle_states, etc.) are not known or verified: an error will be output if an invalid query is made.

Example: "0-3 5 7-8" would select every item between 0 and 8 except 4 and 6.

Configuration File

The YAML file should contain sections for cpufreq, cpuidle, and driver parameters. Multiple list items are supported within these categories, simulating the fact that commands can be chained. This way, it is possible to apply different settings to different CPU subsets.

The driver category supports only "intel_pstate" or "acpi" values. The order is important as the file is processed sequentially. For example, you should place the driver parameter at the top if you want to use governors from a specific one.

Example:

driver: "acpi"
cpufreq:
  - governor: userspace
    policies: "0-2"
    frequency: "2500000"
  - governor: performance
    policies: "4"
cpuidle:
  - cpus: "0-9"
    idle_states: "1"
    disable: true

Examples

  • Apply a configuration from a YAML file:

    cpudev apply --config my_config.yaml
    

  • Set the powersave governor for CPU policies 0 to 3:

    cpudev cpufreq --governor powersave --policies "0-3"
    

  • Disable CPU idle state 1 for CPU 0:

    cpudev cpuidle --cpus "0" --idle-states "1" --disable
    

  • Enable the intel_pstate driver:

    cpudev driver intel_pstate
    

  • Show help for applying a configuration:

    cpudev apply --help
    

Technical Details

About Frequency (APERF, MPERF, TSC)

How is it computed ?
On x86

The files /proc/cpuinfo and /sys/devices/system/cpu/cpu*/cpufreq/scaling_frequency use the following method to calculate CPU frequency.

The calculus used by the kernel to retreive the frequency is the following:

\(\text{BusyMHz} = \frac{\Delta_{\text{APERF}}}{\Delta_{\text{MPERF}}} \times \text{freq}_{\text{base}}\)

On x86 platforms, computing a core's frequency involves three registers (MSRs):

  • APERF and MPERF: Individually meaningless, but their ratio (APERF/MPERF) provides a coefficient to multiply with a base frequency. According to the Intel Developer's manual, it is also used to compute the usage proportion of the cores (Volume 3B, 19.17, p.682 and Volume 3B, 16.2, p.500).
  • TSC: A frequency-invariant counter.

According to the Linux kernel source code, the base frequency (\(\text{freq}_{\text{base}}\)) is a fixed value stored in the cpu_khz variable, which can be retrieved using BPF. This frequency is calculated using the TSC and represents the frequency of the core at the maximum non-turbo P-State. The following code snippet allows yhou to retreive such a value:

CPU_KHZ_ADDR=$(sudo cat /proc/kallsyms | grep "D cpu_khz" | cut -f1 -d" ") && sudo bpftrace -e "BEGIN { \$cpu_khz_addr = 0x$CPU_KHZ_ADDR ; printf(\"cpu_khz: %d\", *\$cpu_khz_addr); exit();}"

Important

The \(\text{freq}_{\text{base}}\) value need not to be confused with the nominal frequency given by chip manufacturers : these two values are different.

Linux Kernel source code comment

The scheduler wants to do frequency invariant accounting and needs a \(<1\) ratio to account for the "current" frequency, corresponding to \(\text{freq}_{\text{curr}} / \text{freq}_{\text{max}}\). Since the frequency \(\text{freq}_{\text{curr}}\) on x86 is controlled by micro-controller and our P-State setting is little more than a request/hint, we need to observe the effective frequency "BusyMHz", i.e. the average frequency over a time interval after discarding idle time. This is given by:
\(\text{BusyMHz} = \frac{\Delta_{\text{APERF}}}{\Delta_{\text{MPERF}}} \times \text{freq}_{\text{base}}\) where \(\text{freq}_{\text{base}}\) is the max non-turbo P-State. The \(\text{freq}_{\text{max}}\) term has to be set to a somewhat arbitrary value, because we can't know which turbo states will be available at a given point in time: it all depends on the thermal headroom of the entire package. We set it to the turbo level with 4 cores active. Benchmarks show that's a good compromise between the 1C turbo ratio \(\text{freq}_\text{curr} / \text{freq}_\text{max}\) would rarely reach 1 and something close to \(\text{freq}_\text{base}\), which would ignore the entire turbo range (a conspicuous part, making \(\times \text{freq}_{\text{curr}} / \text{freq}_{\text{max}}\) always maxed out).

How can I set it ? (CPUFreq)

CPUFreq is a Linux kernel software interface that manages DVFS, an hardware feature in modern chips designed to reduce power consumption, alongside techniques like clock-gating and power-gating.

The CPUFreq software stack includes:

Component Role
Core Registers CPU cores within the software and assigns them policies.
Governor Decides the optimal frequency based on system metrics. It holds the main algorithm to decide which entity needs to scale up or down its frequency.
Policy Applies the governor's decisions across associated CPUs.
Driver Interfaces with hardware to set and retrieve frequencies.

Theoretically, any governor can work with any hardware, and different governors can manage distinct logical CPU subsets.

Each governor can have its own set of parameters we can change to influence its choices (see the schedutil governor for example).

On Intel

Intel developed its own driver and governor (Intel P-State) for processors since the Sandy Bridge architecture. It supports Hardware-managed P-States, enabling automatic frequency adjustments. The driver operates in two modes:

Mode Description
Active User-space frequency control is disabled; only "powersave" and "performance" governors are available.
Passive Behaves like a standard CPUFreq driver, allowing Linux kernel governors.

Tip

Intel also introduced EPB, a 16-level scale to prioritize performance or power efficiency, used by Intel P-State for frequency arbitration. We can modify this bias directly through sysfs. More information on its dedicated page.

Unfortunately, it is not possible to set a specific frequency with this driver enabled.

So, to get to achieve this goal, we must follow the next steps :

  • Deactivate the Intel P-State, so it hands the control back to the vanilla driver
    echo passive | sudo tee /sys/devices/system/cpu/intel_pstate/status
    
  • Select a core we want to set the frequence of, here it will be cpu0
  • Select the governor for this core (e.g. cpu0) that allows us to set the frequency: userspace
    echo "userspace" | sudo tee /sys/devices/system/cpu/cpu0/scaling_governor
    
  • Set the frequency we are targeting in the following files (here it is 1800000, which represents 1.8 GHz):
    echo 1800000 | sudo tee /sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq
    echo 1800000 | sudo tee /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
    
  • You can then verify the effects of this process by executing the following command:
    CPU_ID="cpu0" && cat /proc/cpuinfo | grep "cpu MHz" | sed "$((${CPU_ID: -1} + 1))q;d"
    

About Sleeping (CPUIdle)

CPUIdle is the Linux kernel software stack that manages processor idle states. Unlike CPUFreq, which adjusts frequency, CPUIdle allows a logical CPU to stop executing instructions and power down parts of its circuitry to save energy.

The CPUIdle stack shares similarities with CPUFreq, using the same governor and driver concepts allowing different policies to be taken in function of the platform or the idle states defined.

A sleep state is defined by two key metrics:

  • Target residency: Minimum time required in the state to save more energy than a lighter state (includes entry time).
  • Exit latency: Maximum delay between the kernel requesting sleep and waking the CPU for a new instruction.

What concretely entails a sleep state is not clear nor defined in the source code (it points to assembly labels coming from compiled drivers). It depends on the platform they are defined.

Info

When no tasks are available, the kernel schedules a special idle (or swapper) task, triggering the selected CPUIdle governor to enter a sleep state. However, scheduler ticks (1–10 ms) prevent deep sleep states. "Tickless" kernels address this by disabling periodic ticks and waking only on external interrupts.

Playing with Sleep States
System Analysis

Commands to inspect CPUIdle configuration:

  • List available governors:

    cat /sys/devices/system/cpu/cpuidle/available_governors
    

  • List available C-states:

    cat /sys/devices/system/cpu/cpu0/cpuidle/state*/name
    

  • Check current configuration:

    • Current governor :
      cat /sys/devices/system/cpu/cpuidle/current_governor
      
    • Enabled/disabled states for a CPU:
      cat /sys/devices/system/cpu/cpu0/cpuidle/state*/disable
      
      0 = enabled, 1 = disabled.
Configuration

Commands to modify CPUIdle behavior:

  • Change governor:

    echo "teo" | sudo tee /sys/devices/system/cpu/cpuidle/current_governor
    

  • Enable/Disable C-states:

    • Disable a state (e.g., state3 for cpu0):
      echo "1" | sudo tee /sys/devices/system/cpu/cpu0/cpuidle/state3/disable
      
    • Re-enable a state:
      echo "0" | sudo tee /sys/devices/system/cpu/cpu0/cpuidle/state3/disable
      
    • Disable all deep states (e.g., from C2):
      for STATE in /sys/devices/system/cpu/cpu0/cpuidle/state[2-9]/disable; do
          echo "1" | sudo tee $STATE;
      done