Skip to content

Hardware Node Settings

This page list the known possible hardware and software configurations that the user can leverage on the compute nodes. One may noticed that this is generally not possible on traditional clusters and it makes Dalek quite unique from this point of view.

CPU Driver

Dalek compute nodes are configured to allow users to modify the CPU driver parameters. The only prerequisite is membership in the cpudev group. To be added to the cpudev group, please contact the system administrators.

To prevent unexpected behavior, the CPU configuration is automatically reset when a SLURM job terminates. For this reason, when working with CPU drivers, we strongly recommend running exclusive jobs (i.e., reserving the entire node).

To ensure system stability, the cpudev_backup script is executed at node boot time. This script creates a snapshot of the /sys/devices/system/cpu hierarchy and stores it in the /tmp/cpudev_sysfs_backup.txt file. When a job ends, the cpudev_restore script is automatically invoked to restore the original CPU configuration.

To modify CPU drivers (e.g., adjusting frequencies or idle behaviors), we strongly encourage users to use the cpupower command-line tools. Normally, cpupower requires sudo privileges. But, on Dalek, users belonging to the cpudev group are allowed to use it via:

sudo cpupower [parameters]

Please refer to the official cpupower documentation for usage details:

Advanced users may also experiment by directly modifying files under /sys/devices/system/cpu. To enable this, a dedicated helper binary, cpudev_setperms, is installed on the nodes. This tool parses the /sys/devices/system/cpu hierarchy and, for each file:

  • changes the ownership from root:root to root:cpudev,
  • grants the cpudev group the same permissions as the root user.

This allows members of the cpudev group to modify the relevant sysfs entries directly. The tool must be run on an exclusively reserved node and can be invoked as follows:

sudo cpudev_setperms

Warning

When modifying files directly under /sys/devices/system/cpu, additional files may be created (for example, when switching CPU drivers). In such cases, cpudev_setperms must be run again to update permissions on the newly created files.

Tips

powertop is installed on the nodes and can be run with sudo if you are in the cpudev group.

cpudev [Advanced]

Info

For standard CPU driver adjustments, we recommend using cpupower, as described in the previous section, rather than relying on cpudev.

cpudev is a tools especially designed for Dalek. It helps to configure the CPU device. cpudev enables to modify the CPU driver, its governor, its frequency per core, idle states, and so on.

You can easily add cpudev to your PATH by doing:

module load cpudev

The life time of the cpudev modifications is the SLURM job. Moreover, only exclusive jobs (jobs that take a compute node exclusively) can use it. When a job terminates, the original CPU parameters are set back to ensure reproducibility of the experiments on the nodes. It means that cpudev should be invoked at the beginning of each job.

Tips

A man page is available for this command, as well as reminders using the --help option with any of the following subcommands.

Info

This utility is available for anyone which is in the cpudev group. It embarking the functionalities detailed in the Technical Details section.

Usage

cpudev [OPTIONS] COMMAND1 [ARGS]... [COMMAND2 [ARGS]...]...

cpudev is a command-line tool for configuring CPU settings, including frequency and idle states. It allows applying configurations via specific commands or a YAML file. It is closely related to the CPU subsystems CPUFreq and CPUIdle, which we recommend you be aware of before using this utility (see the Technical Details section below).

cpudev uses commands that can be chained, applying them sequentially. The parameters set by former commands will be memorized for the next ones. You can, for example, use the cpufreq command multiple times in the same call, allowing you to specify different configurations for different subsets of CPUs.

Subcommands

  • apply: Apply configurations from a YAML file.
  • cpufreq: Set CPUFreq settings for specified CPU policies.
  • cpuidle: Enable or disable specific CPU idle states for specified CPU sets.
  • driver: Enable or disable the specified CPUFreq driver.

Below are the corresponding subcommands options.

cpudev apply PATH
  • PATH: Path to a YAML configuration file to apply settings from.
cpudev cpufreq [OPTIONS]
  • -g, --governor TEXT: Set CPUFreq governor (e.g., performance, powersave).
  • -p, --policies TEXT: Specify CPU policies to target (e.g., '0', '0 9', '0-9', '*').
  • -f, --frequency INTEGER: Set fixed CPU frequency (in kHz). Note that this option will not work unless the governor is set to userspace and the driver to acpi. See cpufreq for more details.
  • --config PATH: YAML configuration file.
cpudev cpuidle [OPTIONS]
  • -c, --cpus TEXT: Specify CPU numbers to target (e.g., '0', '0 9', '0-9', '*').
  • -s, --idle-states TEXT: Specify CPU idle states (e.g., '0', '0 9', '0-9', '*').
  • --disable: Disable the specified CPU idle states.
  • --enable: Enable the specified CPU idle states.
  • --config PATH: YAML configuration file.
cpudev driver DRIVER_NAME
  • {intel_pstate|amd_pstate|acpi}: Specify the CPUFreq driver to enable or disable.
CPUs and Policies Selection

The CPU numbers and policy selector options support a syntax similar to that of the pdftk cat command. You can assemble a query by concatenating the following blocks with spaces:

  • k: Selects the item whose name is suffixed by k. For example, in the context of CPUs, "8" would select "cpu8".
  • i-j: Allows you to select a continuous range of values between i and j. Boundaries are included. Example: "0-5" is equivalent to "0 1 2 3 4 5".
  • *: Selects every possible number depending on the context. If used for cpus, it will look at /sys/devices/system/cpu/cpu[0-9]+ and generate a range from it. If used for policies, it will do the same with /sys/devices/system/cpu/cpufreq/policy[0-9]+. Finally, if used for selecting idle_states, it will look at /sys/devices/system/cpu/cpu0/cpuidle/state[0-9]+. Note that for the latter case, it assumes every CPU has the same range of idle states.

No assumptions or verifications are made about the query and the system when computing it. This means, except for "*", boundaries about the number of items in a category (cpus, idle_states, etc.) are not known or verified: an error will be output if an invalid query is made.

Example: "0-3 5 7-8" would select every item between 0 and 8 except 4 and 6.

Configuration File

The YAML file should contain sections for cpufreq, cpuidle, and driver parameters. Multiple list items are supported within these categories, simulating the fact that commands can be chained. This way, it is possible to apply different settings to different CPU subsets.

The driver category supports only "intel_pstate", "amd_pstate" or "acpi" values. The order is important as the file is processed sequentially. For example, you should place the driver parameter at the top if you want to use governors from a specific one.

Example:

driver: "acpi"
cpufreq:
  - governor: userspace
    policies: "0-2"
    frequency: "2500000"
  - governor: performance
    policies: "4"
cpuidle:
  - cpus: "0-9"
    idle_states: "1"
    disable: true

Examples

  • Apply a configuration from a YAML file:

    cpudev apply my_config.yaml
    

  • Set the powersave governor for CPU policies 0 to 3:

    cpudev cpufreq --governor powersave --policies "0-3"
    

  • Disable CPU idle state 1 for CPU 0:

    cpudev cpuidle --cpus "0" --idle-states "1" --disable
    

  • Enable the intel_pstate driver:

    cpudev driver intel_pstate
    

  • Show help for applying a configuration:

    cpudev apply --help
    

Technical Details

About Frequency (APERF, MPERF, TSC)
How is it computed ?
# On x86

The files /proc/cpuinfo and /sys/devices/system/cpu/cpu*/cpufreq/scaling_frequency use the following method to calculate CPU frequency.

The calculus used by the kernel to retreive the frequency is the following:

\(\text{BusyMHz} = \frac{\Delta_{\text{APERF}}}{\Delta_{\text{MPERF}}} \times \text{freq}_{\text{base}}\)

On x86 platforms, computing a core's frequency involves three registers (MSRs):

  • APERF and MPERF: Individually meaningless, but their ratio (APERF/MPERF) provides a coefficient to multiply with a base frequency. According to the Intel Developer's manual, it is also used to compute the usage proportion of the cores (Volume 3B, 19.17, p.682 and Volume 3B, 16.2, p.500).
  • TSC: A frequency-invariant counter.

According to the Linux kernel source code, the base frequency (\(\text{freq}_{\text{base}}\)) is a fixed value stored in the cpu_khz variable, which can be retrieved using BPF. This frequency is calculated using the TSC and represents the frequency of the core at the maximum non-turbo P-State. The following code snippet allows yhou to retreive such a value:

CPU_KHZ_ADDR=$(sudo cat /proc/kallsyms | grep "D cpu_khz" | cut -f1 -d" ") && sudo bpftrace -e "BEGIN { \$cpu_khz_addr = 0x$CPU_KHZ_ADDR ; printf(\"cpu_khz: %d\", *\$cpu_khz_addr); exit();}"

Important

The \(\text{freq}_{\text{base}}\) value need not to be confused with the nominal frequency given by chip manufacturers : these two values are different.

Linux Kernel source code comment

The scheduler wants to do frequency invariant accounting and needs a \(<1\) ratio to account for the "current" frequency, corresponding to \(\text{freq}_{\text{curr}} / \text{freq}_{\text{max}}\). Since the frequency \(\text{freq}_{\text{curr}}\) on x86 is controlled by micro-controller and our P-State setting is little more than a request/hint, we need to observe the effective frequency "BusyMHz", i.e. the average frequency over a time interval after discarding idle time. This is given by:
\(\text{BusyMHz} = \frac{\Delta_{\text{APERF}}}{\Delta_{\text{MPERF}}} \times \text{freq}_{\text{base}}\) where \(\text{freq}_{\text{base}}\) is the max non-turbo P-State. The \(\text{freq}_{\text{max}}\) term has to be set to a somewhat arbitrary value, because we can't know which turbo states will be available at a given point in time: it all depends on the thermal headroom of the entire package. We set it to the turbo level with 4 cores active. Benchmarks show that's a good compromise between the 1C turbo ratio \(\text{freq}_\text{curr} / \text{freq}_\text{max}\) would rarely reach 1 and something close to \(\text{freq}_\text{base}\), which would ignore the entire turbo range (a conspicuous part, making \(\times \text{freq}_{\text{curr}} / \text{freq}_{\text{max}}\) always maxed out).

How can I set it ? (CPUFreq)

CPUFreq is a Linux kernel software interface that manages DVFS, an hardware feature in modern chips designed to reduce power consumption, alongside techniques like clock-gating and power-gating.

The CPUFreq software stack includes:

Component Role
Core Registers CPU cores within the software and assigns them policies.
Governor Decides the optimal frequency based on system metrics. It holds the main algorithm to decide which entity needs to scale up or down its frequency.
Policy Applies the governor's decisions across associated CPUs.
Driver Interfaces with hardware to set and retrieve frequencies.

Theoretically, any governor can work with any hardware, and different governors can manage distinct logical CPU subsets.

Each governor can have its own set of parameters we can change to influence its choices (see the schedutil governor for example).

# On Intel

Intel developed its own driver and governor (Intel P-State) for processors since the Sandy Bridge architecture. It supports Hardware-managed P-States, enabling automatic frequency adjustments. The driver operates in two modes:

Mode Description
Active User-space frequency control is disabled; only "powersave" and "performance" governors are available.
Passive Behaves like a standard CPUFreq driver, allowing Linux kernel governors.

Tip

Intel also introduced EPB, a 16-level scale to prioritize performance or power efficiency, used by Intel P-State for frequency arbitration. We can modify this bias directly through sysfs. More information on its dedicated page.

Unfortunately, it is not possible to set a specific frequency with this driver enabled.

So, to get to achieve this goal, we must follow the next steps :

  • Deactivate the Intel P-State, so it hands the control back to the vanilla driver
    echo passive | sudo tee /sys/devices/system/cpu/intel_pstate/status
    
  • Select a core we want to set the frequence of, here it will be cpu0
  • Select the governor for this core (e.g. cpu0) that allows us to set the frequency: userspace
    echo "userspace" | sudo tee /sys/devices/system/cpu/cpu0/scaling_governor
    
  • Set the frequency we are targeting in the following files (here it is 1800000, which represents 1.8 GHz):
    echo 1800000 | sudo tee /sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq
    echo 1800000 | sudo tee /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
    
  • You can then verify the effects of this process by executing the following command:
    CPU_ID="cpu0" && cat /proc/cpuinfo | grep "cpu MHz" | sed "$((${CPU_ID: -1} + 1))q;d"
    
About Sleeping (CPUIdle)

CPUIdle is the Linux kernel software stack that manages processor idle states. Unlike CPUFreq, which adjusts frequency, CPUIdle allows a logical CPU to stop executing instructions and power down parts of its circuitry to save energy.

The CPUIdle stack shares similarities with CPUFreq, using the same governor and driver concepts allowing different policies to be taken in function of the platform or the idle states defined.

A sleep state is defined by two key metrics:

  • Target residency: Minimum time required in the state to save more energy than a lighter state (includes entry time).
  • Exit latency: Maximum delay between the kernel requesting sleep and waking the CPU for a new instruction.

What concretely entails a sleep state is not clear nor defined in the source code (it points to assembly labels coming from compiled drivers). It depends on the platform they are defined.

Info

When no tasks are available, the kernel schedules a special idle (or swapper) task, triggering the selected CPUIdle governor to enter a sleep state. However, scheduler ticks (1–10 ms) prevent deep sleep states. "Tickless" kernels address this by disabling periodic ticks and waking only on external interrupts.

Playing with Sleep States
# System Analysis

Commands to inspect CPUIdle configuration:

  • List available governors:

    cat /sys/devices/system/cpu/cpuidle/available_governors
    

  • List available C-states:

    cat /sys/devices/system/cpu/cpu0/cpuidle/state*/name
    

  • Check current configuration:

    • Current governor :
      cat /sys/devices/system/cpu/cpuidle/current_governor
      
    • Enabled/disabled states for a CPU:
      cat /sys/devices/system/cpu/cpu0/cpuidle/state*/disable
      
      0 = enabled, 1 = disabled.
# Configuration

Commands to modify CPUIdle behavior:

  • Change governor:

    echo "teo" | sudo tee /sys/devices/system/cpu/cpuidle/current_governor
    

  • Enable/Disable C-states:

    • Disable a state (e.g., state3 for cpu0):
      echo "1" | sudo tee /sys/devices/system/cpu/cpu0/cpuidle/state3/disable
      
    • Re-enable a state:
      echo "0" | sudo tee /sys/devices/system/cpu/cpu0/cpuidle/state3/disable
      
    • Disable all deep states (e.g., from C2):
      for STATE in /sys/devices/system/cpu/cpu0/cpuidle/state[2-9]/disable; do
          echo "1" | sudo tee $STATE;
      done
      

CPU Driver Source Codes

Backup Script

/usr/local/sbin/cpudev_backup
#!/bin/bash

# CPU sysfs path and save path
BASE_PATH="/sys/devices/system/cpu"
#BACKUP_FILE="/tmp/cpu_sysfs_backup_$(date +%Y%m%d_%H%M%S).txt"
BACKUP_FILE="/tmp/cpudev_sysfs_backup.txt"

# Files list to exclude
EXCLUDE_FILES=("uevent" "modalias" "subsystem" "device" "scaling_setspeed")

# Create the save file
echo "Path of sysfs CPU parameters save: $BACKUP_FILE"
echo "# sysfs CPU parameters - $(date)" > "$BACKUP_FILE"
echo "# Format: path=value" >> "$BACKUP_FILE"

# Build the command line with file exclusions
exclude_args=()
for file in "${EXCLUDE_FILES[@]}"; do
    exclude_args+=(! -name "$file")
done

# Browse files, exclude those in the list and check owner permissions
find "$BASE_PATH" -type f -perm -u=w "${exclude_args[@]}" 2>/dev/null | while read -r file; do
    value=$(cat "$file" 2>/dev/null)
    if [ $? -eq 0 ]; then
        echo "$file=$value" >> "$BACKUP_FILE"
    fi
done

echo "Save complete."

Restore Script

/usr/local/sbin/cpudev_restore
#!/bin/bash

# Put back the root group to prevent abusive uses later
echo "Restore root group in /sys/devices/system/cpu/ directory"
chgrp -R root /sys/devices/system/cpu/*

# Check that the save file is given, if not stop here
if [ $# -ne 1 ]; then
    exit 0
fi

BACKUP_FILE="$1"

if [ ! -f "$BACKUP_FILE" ]; then
    echo "Error: '$BACKUP_FILE' does not exist."
    exit 1
fi

# Define paths priority order
PRIORITY_PATHS=(
    "intel_pstate" # first those in this sub-folder
    "cpufreq"      # then those is cpufreq/
)

echo "Restore sysfs CPU parameters from $BACKUP_FILE"

# Read the backup file and sort the lines according to priority order
declare -A file_values
declare -A b_file_values # to check
declare -A i_file_values # for a new attempt at ignored files

while IFS= read -r line; do
    # ignore comments
    if [[ "$line" == \#* ]]; then
        continue
    fi
    file=$(echo "$line" | cut -d'=' -f1)
    value=$(echo "$line" | cut -d'=' -f2-)
    file_values["$file"]="$value"
    b_file_values["$file"]="$value"
done < "$BACKUP_FILE"

# Function to restore a file if necessary
restore_file() {
    local file="$1"
    local target_value="$2"
    if [ -n "$target_value" ]; then
        if [ -w "$file" ]; then
            current_value=$(cat "$file" 2>/dev/null)
            if [ "$current_value" != "$target_value" ]; then
                echo "$target_value" | sudo tee "$file" > /dev/null
                echo "Restored: $file = $target_value"
            else
                echo "Up to date: $file ($target_value)"
            fi
        else
            i_file_values["$file"]="$target_value"
            echo "Ignored (writing failed): $file"
        fi
    fi
}

# Restore according to priority order
for pattern in "${PRIORITY_PATHS[@]}"; do
    echo "--- Restoring files in $pattern/ ---"
    for file in "${!file_values[@]}"; do
        if [[ "$file" == *"$pattern"* ]]; then
            restore_file "$file" "${file_values[$file]}"
            unset file_values["$file"]  # to avoid to be processed two times
        fi
    done
done
echo "--- Restoring the remaining files  ---"
for file in "${!file_values[@]}"; do
    restore_file "$file" "${file_values[$file]}"
done

# New attempt for ignored files
echo "--- Restoring previously ignored files ---"
for file in "${!i_file_values[@]}"; do
    restore_file "$file" "${i_file_values[$file]}"
    unset i_file_values["$file"] # to avoid infinite looping
done

# Verify that the shaft has been properly restored
for file in "${!b_file_values[@]}"; do
    v=$(cat "$file" 2>/dev/null)
    t="${b_file_values[$file]}"
    if [[ "$v" != "$t" ]]; then
        echo "Wrong restoration of the sysfs tree. '$file' should have the value '$t', but it has the value '$v'."
        exit 2
    fi
done

echo "Restoration complete."

Set Permissions

/usr/local/sbin/cpudev_setperms
// g++ -std=c++17 -O2 -Wall -o cpudev_setperms cpudev_setperms.cpp

#include <iostream>
#include <string>
#include <vector>
#include <cstring>
#include <regex>
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <sys/stat.h>
#include <pwd.h>
#include <grp.h>
#include <ftw.h>
#include <errno.h>
#include <fcntl.h>
#include <limits.h>
#include <sstream>
#include <syslog.h>

static const std::vector<std::string> g_local_users = {"administrator", "powerstate", "prober"};
static const std::string g_slurm_path = "/opt/slurm/bin/";
static const std::string g_cpudev_path = "/sys/devices/system/cpu/";
static const std::string g_cpudev_group = "cpudev";

// ================ Get current username =======================================
std::string get_sudo_invoker() {
    // First, check if the program was run via sudo
    const char* sudoUser = getenv("SUDO_USER");
    if (sudoUser && *sudoUser) {
        return std::string(sudoUser);
    }

    // Fallback: use the real UID of the process
    uid_t uid = getuid();
    struct passwd* pw = getpwuid(uid);
    if (pw && pw->pw_name) {
        return std::string(pw->pw_name);
    }

    // Final fallback
    return "unknown";
}

// ================ Run external binary safely and capture its stdout ==========
std::string run_program_capture(const std::string &prog, const std::vector<std::string> &args) {
    int pipefd[2];
    if (pipe(pipefd) == -1) {
        syslog(LOG_ERR, "pipe() failed: %s", strerror(errno));
        return "";
    }

    pid_t pid = fork();
    if (pid < 0) {
        syslog(LOG_ERR, "fork() failed: %s", strerror(errno));
        close(pipefd[0]); close(pipefd[1]);
        return "";
    }

    if (pid == 0) {
        // child
        dup2(pipefd[1], STDOUT_FILENO);
        close(pipefd[0]);
        close(pipefd[1]);

        std::vector<char*> argv;
        argv.reserve(args.size() + 2);
        argv.push_back(const_cast<char*>(prog.c_str()));
        for (const auto &a : args)
            argv.push_back(const_cast<char*>(a.c_str()));
        argv.push_back(nullptr);

        execv(prog.c_str(), argv.data());
        _exit(127);
    }

    // parent
    close(pipefd[1]);
    std::string out;
    char buf[512];
    ssize_t n;
    while ((n = read(pipefd[0], buf, sizeof(buf))) > 0) {
        out.append(buf, buf + n);
    }
    close(pipefd[0]);

    int status = 0;
    waitpid(pid, &status, 0);
    if (WIFEXITED(status) && WEXITSTATUS(status) != 0) {
        syslog(LOG_WARNING, "Program %s exited with status %d", prog.c_str(), WEXITSTATUS(status));
    } else if (WIFSIGNALED(status)) {
        syslog(LOG_WARNING, "Program %s terminated by signal %d", prog.c_str(), WTERMSIG(status));
    }

    while (!out.empty() && (out.back() == '\n' || out.back() == '\r'))
        out.pop_back();
    return out;
}

// ================ get hostname (safe) ========================================
std::string get_hostname() {
    char host[256];
    if (gethostname(host, sizeof(host)) == 0) {
        return std::string(host);
    }
    return "";
}

// ================ get Slurm JobID (first token) ==============================
std::string get_slurm_job_id(const std::string &user) {
    std::string prog = g_slurm_path + "squeue";
    std::string host = get_hostname();
    if (host.empty()) return "";
    std::vector<std::string> args = {"--noheader", ("--nodelist=" + host), ("--user=" + user), "--Format=JobID"};
    std::string out = run_program_capture(prog, args);
    if (out.empty()) return "";

    size_t pos = out.find('\n');
    std::string firstline = (pos == std::string::npos) ? out : out.substr(0, pos);
    std::istringstream iss(firstline);
    std::string token;
    if (!(iss >> token)) return "";
    std::regex jobre("^\\d+$");
    if (std::regex_match(token, jobre)) return token;
    return "";
}

// ================ parse "OverSubscribe=" value from scontrol output ===========
std::string get_over_subscribe_flag(const std::string &job_id) {
    std::string prog = g_slurm_path + "scontrol";
    std::vector<std::string> args = {"show", "job", job_id};
    std::string out = run_program_capture(prog, args);
    if (out.empty()) return "";
    std::string key = "OverSubscribe=";
    size_t p = out.find(key);
    if (p == std::string::npos) return "";
    p += key.size();
    size_t q = p;
    while (q < out.size() && !isspace((unsigned char)out[q])) ++q;
    return out.substr(p, q - p);
}

// ================ Check if user is local (from hard-coded list) =============
bool is_local_user(const std::string &user) {
    for (auto &u : g_local_users) if (u == user) return true;
    return false;
}

// ================ Resolve group name to gid =================================
bool lookup_gid(const std::string &groupname, gid_t &out_gid) {
    struct group *g = getgrnam(groupname.c_str());
    if (!g) {
        syslog(LOG_ERR, "getgrnam('%s') failed", groupname.c_str());
        return false;
    }
    out_gid = g->gr_gid;
    return true;
}

// Global state for nftw callback
static gid_t g_cpudev_gid = (gid_t)-1;
static bool g_do_chgrp = true;
static bool g_do_chmod_g_eq_u = true;
static int g_change_errors = 0;

// ================ nftw callback ==============================================
int nftw_callback(const char *fpath, const struct stat *sb, int typeflag, struct FTW * /*ftwbuf*/) {
    (void)typeflag; // FTW_PHYS used so symlinks won't be followed
    // Change group preserving owner
    if (g_do_chgrp) {
        if (chown(fpath, sb->st_uid, g_cpudev_gid) != 0) {
            // log occasionally; avoid extremely noisy logs - increment counter and log sample
            ++g_change_errors;
            if (g_change_errors <= 5) {
                syslog(LOG_WARNING, "chown failed on %s: %s", fpath, strerror(errno));
            } else if (g_change_errors == 6) {
                syslog(LOG_WARNING, "Further chown failures suppressed (many)");
            }
        }
    }

    // Set group bits equal to owner bits (g = u)
    if (g_do_chmod_g_eq_u) {
        mode_t cur = sb->st_mode;
        mode_t ubits = (cur & S_IRWXU);
        mode_t new_mode = (cur & ~S_IRWXG) | ((ubits >> 3) & S_IRWXG);
        if ((cur & S_IRWXG) != (new_mode & S_IRWXG)) {
            if (chmod(fpath, new_mode) != 0) {
                ++g_change_errors;
                if (g_change_errors <= 5) {
                    syslog(LOG_WARNING, "chmod failed on %s: %s", fpath, strerror(errno));
                } else if (g_change_errors == 6) {
                    syslog(LOG_WARNING, "Further chmod failures suppressed (many)");
                }
            }
        }
    }
    return 0; // continue
}

// ================ Recursively operate on g_cpudev_path safely ===============
bool apply_cpudev_changes() {
    if (!lookup_gid(g_cpudev_group.c_str(), g_cpudev_gid)) {
        syslog(LOG_ERR, "Group '%s' not found", g_cpudev_group.c_str());
        return false;
    }
    g_change_errors = 0;
    // Use 20 file descriptors at the same time.
    // NFTW_PHYS prevents following symlinks (equivalent to chmod -P)
    if (nftw(g_cpudev_path.c_str(), nftw_callback, 20, FTW_PHYS) != 0) {
        syslog(LOG_ERR, "nftw failed on %s: %s", g_cpudev_path.c_str(), strerror(errno));
        return false;
    }
    if (g_change_errors > 0) {
        syslog(LOG_WARNING, "Completed with %d change errors under %s", g_change_errors, g_cpudev_path.c_str());
        std::clog << "(WW) Completed with " << g_change_errors
                  << " change errors under " << g_cpudev_path << std::endl;
    } else {
        syslog(LOG_INFO, "Successfully updated ownership and permissions under %s", g_cpudev_path.c_str());
    }
    return (g_change_errors == 0);
}

// ================ main =======================================================
int main() {
    // open syslog
    openlog("cpudev_setperms", LOG_PID | LOG_CONS, LOG_DAEMON);
    syslog(LOG_INFO, "Program start");

    std::string user = get_sudo_invoker();
    syslog(LOG_INFO, "Invoked by user: %s", user.c_str());

    std::string job_id = get_slurm_job_id(user);
    if (!job_id.empty()) {
        syslog(LOG_INFO, "Found SLURM JobID %s for user %s", job_id.c_str(), user.c_str());
    } else {
        syslog(LOG_INFO, "No SLURM JobID found for user %s on this node", user.c_str());
    }

    std::string over_sub;
    if (!job_id.empty()) {
        over_sub = get_over_subscribe_flag(job_id);
        syslog(LOG_INFO, "OverSubscribe for job %s = '%s'", job_id.c_str(), over_sub.c_str());
    }

    bool local = is_local_user(user);
    if (local) syslog(LOG_INFO, "User %s is in local user list", user.c_str());

    if (over_sub == "NO" || local) {
        syslog(LOG_INFO, "Proceeding to change ownership/permissions for %s", g_cpudev_path.c_str());

        if (!apply_cpudev_changes()) {
            syslog(LOG_ERR, "Failed to update ownership/permissions under %s", g_cpudev_path.c_str());
            std::cerr << "(EE) Failed to update some ownership/permission entries under " << g_cpudev_path << "\n";
            closelog();
            return 3;
        }

        syslog(LOG_INFO, "Completed ownership/permission updates for user %s", user.c_str());
        std::cout << "(II) Permissions and ownership updated successfully for user: " << user << "\n";
        closelog();
        return 0;
    } else {
        syslog(LOG_WARNING, "Node is NOT exclusively allocated and user is not local: aborting for user %s", user.c_str());
        std::cerr << "(EE) This node is NOT exclusively allocated via SLURM, you cannot run this program.\n";
        closelog();
        return 4;
    }
}

SPANK Plugin to Invoke Restore Script

/mnt/nfs/software/slurm/lib/spank/cpudev.so
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <signal.h>
#include <stdio.h>
#include <syslog.h>
#include <pwd.h>

#include <slurm/spank.h>

#define LOG

#ifdef LOG
#define OPENLOG(...) openlog(__VA_ARGS__)
#define SYSLOG(...) syslog(__VA_ARGS__)
#define CLOSELOG(...) closelog(__VA_ARGS__)
#else
#define OPENLOG(...)
#define SYSLOG(...)
#define CLOSELOG(...)
#endif

/*
 * All spank plugins must define this macro for the Slurm plugin loader.
 */
SPANK_PLUGIN(cpudev, 1);

int slurm_spank_task_exit(spank_t spank, int argc, char *argv[])
{
    spank_context_t calling_context = spank_context();

    OPENLOG("slurm_spank_cpudev", LOG_CONS | LOG_PID | LOG_NDELAY, LOG_DAEMON);
    SYSLOG(LOG_NOTICE, "task_exit - program started by SLURM, UID is %d", getuid());

    switch (calling_context)
    {
        case S_CTX_REMOTE: {
            char *nodename = getenv("SLURMD_NODENAME");
            if (nodename == NULL)
            {
                slurm_error("cpudev: getenv of \"SLURMD_NODENAME\" failed!");
                SYSLOG(LOG_ERR, "task_exit - getenv of \"SLURMD_NODENAME\" failed!");
            }
            else
            {
                const char* path_restore_script = "/usr/local/sbin/cpudev_restore";
                if (access(path_restore_script, F_OK) == 0)
                {
                    uid_t uid;
                    /* spank_err_t rc = */ spank_get_item(spank, S_JOB_UID, &uid);
                    struct passwd *pws;
                    pws = getpwuid(uid);

                    char proberctl_script[2048];
                    const char* path_backup = "/tmp/cpudev_sysfs_backup.txt";
                    int is_backup_file;
                    if ((is_backup_file = access(path_backup, F_OK)) == 0)
                    {
                        // changes the group to "root" and restore the CPU sysfs values
                        snprintf(proberctl_script, sizeof(proberctl_script), "%s %s", path_restore_script, path_backup);
                    }
                    else
                    {
                        // in this case the script just changes the group to "root"
                        snprintf(proberctl_script, sizeof(proberctl_script), "%s", path_restore_script);
                    }

                    if (system(proberctl_script))
                    {
                        slurm_error ("cpudev_restore: system command failed!");
                        SYSLOG(LOG_ERR, "task_exit - fail to run cpudev_restore script (user=%s,backup=%d)", pws->pw_name, is_backup_file);
                    }
                    else
                    {
                        SYSLOG(LOG_NOTICE, "task_exit - run cpudev_restore script (user=%s,backup=%d)", pws->pw_name, is_backup_file);
                    }
                }
                else
                {
                    SYSLOG(LOG_NOTICE, "task_exit - run cpudev_restore script is not installed on this node");
                }
            }
            break;
        }
        default:
            break;
    }

    CLOSELOG();
    return 0;
}

cpudev

/usr/local/sbin/cpudev
#!/usr/bin/python3
import click
import re
import os
import glob
import yaml
import time
import subprocess

def call_cpudev_setperms():
    """
    As this script is intented to be used on the Dalek system, each "important" modification
    made to the /sys/devices/system/cpu sysfs needs its rights to be updated. This function
    serves this purpose.
    """
    try:
        result = subprocess.run(
            ["sudo", "cpudev_setperms"],
            capture_output=True,
            text=True,
            check=True
    )
    except subprocess.CalledProcessError as e:
        raise RuntimeError(
            f"The following command failed: {e.cmd}\n"
            f"return code: {e.returncode}\n"
            f"stdout:\n{e.stdout}\n"
            f"stderr:\n{e.stderr}"
        ) from e

def parse_range(range_str: str, subsystem: str = "cpufreq") -> list:
    """Parse a string like '0', '0 9', '0-9', or '*' into a list of integers."""
    if range_str == "*":
        if subsystem == "cpufreq":
            return list(range(len(glob.glob("/sys/devices/system/cpu/cpufreq/policy*"))))
        elif subsystem == "cpuidle":
            return list(range(len(glob.glob("/sys/devices/system/cpu/cpu[0-9]*"))))
        elif subsystem == "states":
            return list(range(len(glob.glob("/sys/devices/system/cpu/cpu0/cpuidle/state[0-9]*"))))
    items = []
    for part in range_str.split():
        if '-' in part:
            start, end = map(int, part.split('-'))
            items.extend(range(start, end + 1))
        else:
            items.append(int(part))
    return items

def load_config(config_file):
    """Load configuration from a YAML file."""
    try:
        with open(config_file, 'r') as f:
            return yaml.safe_load(f)
    except Exception as e:
        raise RuntimeError(
            f"When manipulating '{config_file}' file with 'r' mode"
        ) from e

def apply_cpufreq_config(cfg):
    """Apply cpufreq configuration."""
    governor = cfg['governor'] if 'governor' in cfg else None
    policies = cfg['policies'] if 'policies' in cfg else "0"
    frequency = cfg['frequency'] if 'frequency' in cfg else None

    # Checks if the driver is set to acpi, else it is raises an error
    status_path_intel = "/sys/devices/system/cpu/intel_pstate/status"
    status_path_amd = "/sys/devices/system/cpu/amd_pstate/status"
    status = "inactive"
    if os.path.exists(status_path_intel):
        with open(status_path_intel, "r") as f:
            status = f.read()
    if os.path.exists(status_path_amd):
        with open(status_path_amd, "r") as f:
            status = f.read()
    if "active" in status:
        raise RuntimeError("Driver intel_pstate or amd_pstate is set, impossible to apply cpufreq config. Change it to acpi, you can use the 'driver' subcommand to do so.")

    policy_list = parse_range(policies)
    for policy in policy_list:
        governor_path = f"/sys/devices/system/cpu/cpufreq/policy{policy}/scaling_governor"
        if governor and os.path.exists(governor_path):
            try:
                with open(governor_path, 'w') as f:
                    f.write(governor)
            except Exception as e:
                raise RuntimeError(
                    f"When manipulating '{governor_path}' file with 'w' mode"
                ) from e
            time.sleep(0.5)

        if frequency:
            freq_set_path = f"/sys/devices/system/cpu/cpufreq/policy{policy}/scaling_setspeed"
            freq_min_path = f"/sys/devices/system/cpu/cpufreq/policy{policy}/scaling_min_freq"
            freq_max_path = f"/sys/devices/system/cpu/cpufreq/policy{policy}/scaling_max_freq"
            if os.path.exists(freq_set_path):
                try:
                    with open(freq_set_path, 'w') as f:
                        f.write(str(frequency))
                except Exception as e:
                    raise RuntimeError(
                        f"When manipulating '{freq_set_path}' file with 'w' mode"
                    ) from e
            if os.path.exists(freq_min_path):
                try:
                    with open(freq_min_path, 'w') as f:
                        f.write(str(frequency))
                except Exception as e:
                    raise RuntimeError(
                        f"When manipulating '{freq_min_path}' file with 'w' mode"
                    ) from e
            if os.path.exists(freq_max_path):
                try:
                    with open(freq_max_path, 'w') as f:
                        f.write(str(frequency))
                except Exception as e:
                    raise RuntimeError(
                        f"When manipulating '{freq_max_path}' file with 'w' mode"
                    ) from e

def apply_driver_config(cfg):
    """Apply configuration."""
    driver = cfg['driver'] if 'driver' in cfg else "acpi"
    reg = re.compile("(intel|amd)_pstate")
    if reg.match(driver):
        status_path = f"/sys/devices/system/cpu/{driver}/status"
        if os.path.exists(status_path):
            try:
                with open(status_path, 'w') as f:
                    f.write(str(active))
            except Exception as e:
                raise RuntimeError(
                    f"When manipulating '{status_path}' file with 'w' mode"
                ) from e
    elif driver == "acpi":
        # Not sure it even exists on ARM
        status_path_intel = "/sys/devices/system/cpu/intel_pstate/status"
        status_path_amd = "/sys/devices/system/cpu/amd_pstate/status"
        if os.path.exists(status_path_intel):
            try:
                with open(status_path_intel, 'w') as f:
                    f.write("passive")
            except Exception as e:
                raise RuntimeError(
                    f"When manipulating '{status_path_intel}' file with 'w' mode"
                ) from e
        elif os.path.exists(status_path_amd):
            try:
                with open(status_path_amd, 'w') as f:
                    f.write("passive")
            except Exception as e:
                raise RuntimeError(
                    f"When manipulating '{status_path_amd}' file with 'w' mode"
                ) from e
    # Changing the driver creates new files
    call_cpudev_setperms()

def apply_cpuidle_config(cfg):
    """Apply cpuidle configuration."""
    policies = cfg['cpus'] if 'cpus' in cfg else "0"
    idle_states = cfg['idle_states'] if 'idle_states' in cfg else "0"
    disable = cfg['disable'] if 'disable' in cfg else False
    enable = cfg['enable'] if 'enable' in cfg else False

    policy_list = parse_range(policies, "cpuidle")
    state_list = parse_range(idle_states, "states")
    for policy in policy_list:
        for state in state_list:
            state_path = f"/sys/devices/system/cpu/cpu{policy}/cpuidle/state{state}/disable"
            if os.path.exists(state_path):
                try:
                    with open(state_path, 'w') as f:
                        if enable:
                            f.write("0")
                        elif disable:
                            f.write("1")
                except Exception as e:
                    raise RuntimeError(
                        f"When manipulating '{state_path}' file with 'w' mode"
                    ) from e

@click.group(chain=True)
@click.pass_context
def cli(ctx):
    """Command line interface for CPU settings."""
    ctx.ensure_object(dict)
    call_cpudev_setperms()

@cli.command()
@click.option('-g', '--governor', type=str, help="Set CPUFreq governor (e.g., performance, powersave)")
@click.option('-p', '--policies', type=str, default="0", help="Specify CPU policies to target (e.g., '0', '0 9', '0-9', '*')")
@click.option('-f', '--frequency', type=int, help="Set fixed CPU frequency (in kHz)")
@click.option('--config', type=click.Path(exists=True), help="YAML configuration file")
@click.pass_context
def cpufreq(ctx, governor, policies, frequency, config):
    """Set CPUFreq settings for specified CPU policies. Currently supported : governor, frequency."""
    if config:
        cfg = load_config(config)
        specific_cfg = cfg['cpufreq'] if 'cpufreq' in cfg else []
        for item in specific_cfg:
            apply_cpufreq_config(item)
    else:
        apply_cpufreq_config({'governor': governor, 'policies': policies, 'frequency': frequency})

@cli.command()
@click.argument('driver', type=click.Choice(['intel_pstate', 'amd_pstate', 'acpi']))
@click.pass_context
def driver(ctx, driver):
    """Sets CPUFreq driver to the specified one."""
    apply_driver_config({'driver': driver})

@cli.command()
@click.option('-c', '--cpus', type=str, default="0", help="Specify CPU number to target (e.g., '0', '0 9', '0-9', '*')")
@click.option('-s', '--idle-states', type=str, default="0", help="Specify CPU idle states (e.g., '0', '0 9', '0-9', '*')")
@click.option('--disable', is_flag=True, help="Disable the specified CPU idle states")
@click.option('--enable', is_flag=True, help="Enable the specified CPU idle states")
@click.option('--config', type=click.Path(exists=True), help="YAML configuration file")
@click.pass_context
def cpuidle(ctx, cpus, idle_states, disable, enable, config):
    """Enable or disable specific CPU idle states for specified CPUs."""
    if config:
        cfg = load_config(config)
        specific_cfg = cfg['cpuidle'] if 'cpuidle' in cfg else []
        for item in specific_cfg:
            apply_cpuidle_config(item)
    else:
        apply_cpuidle_config({'cpus': cpus, 'idle_states': idle_states, 'disable': disable, 'enable': enable})

@cli.command()
@click.argument('config', type=click.Path(exists=True))
@click.pass_context
def apply(ctx, config):
    """Apply configurations from a YAML file."""
    if config:
        cfg = load_config(config)
        for command in cfg:
            if command == 'cpufreq':
                for item in cfg[command]:
                    apply_cpufreq_config(item)
            elif command == 'driver':
                apply_driver_config({"driver":cfg[command]})
            elif command == 'cpuidle':
                for item in cfg[command]:
                    apply_cpuidle_config(item)

if __name__ == "__main__":
    cli(obj={})