Hardware Node Settings
This page list the known possible hardware and software configurations that the
user can leverage on the compute nodes. One may noticed that this is generally
not possible on traditional clusters and it makes Dalek quite unique from this
point of view.
CPU Driver
Dalek compute nodes are configured to allow users to modify the CPU driver
parameters. The only prerequisite is membership in the cpudev group. To be
added to the cpudev group, please contact the system
administrators.
To prevent unexpected behavior, the CPU configuration is automatically reset
when a SLURM job terminates. For this reason, when working with CPU drivers, we
strongly recommend running exclusive jobs (i.e., reserving the entire
node).
To ensure system stability, the cpudev_backup
script is executed at node boot time. This script creates a snapshot of the
/sys/devices/system/cpu hierarchy and stores it in the
/tmp/cpudev_sysfs_backup.txt file. When a job ends, the
cpudev_restore script is automatically
invoked to restore the original CPU configuration.
To modify CPU drivers (e.g., adjusting frequencies or idle behaviors), we
strongly encourage users to use the cpupower command-line tools.
Normally, cpupower requires sudo privileges. But, on Dalek, users belonging to
the cpudev group are allowed to use it via:
sudo cpupower [parameters]
Please refer to the official cpupower documentation for usage details:
Advanced users may also experiment by directly modifying files under
/sys/devices/system/cpu. To enable this, a dedicated helper binary,
cpudev_setperms, is installed on the nodes.
This tool parses the /sys/devices/system/cpu hierarchy and, for each file:
- changes the ownership from
root:root to root:cpudev,
- grants the
cpudev group the same permissions as the root user.
This allows members of the cpudev group to modify the relevant sysfs entries
directly. The tool must be run on an exclusively reserved node and can be
invoked as follows:
Warning
When modifying files directly under /sys/devices/system/cpu, additional
files may be created (for example, when switching CPU drivers). In such
cases, cpudev_setperms must be run again to update permissions on the
newly created files.
Tips
powertop is installed on the nodes and can be run with sudo if you are
in the cpudev group.
cpudev [Advanced]
Info
For standard CPU driver adjustments, we recommend using cpupower, as
described in the previous section, rather than relying on cpudev.
cpudev is a tools especially designed for Dalek. It helps to configure the CPU
device. cpudev enables to modify the CPU driver, its governor, its
frequency per core, idle states, and so on.
You can easily add cpudev to your PATH by doing:
The life time of the cpudev modifications is the SLURM job. Moreover, only
exclusive jobs (jobs that take a compute node exclusively) can use it. When a
job terminates, the original CPU parameters are set back to ensure
reproducibility of the experiments on the nodes. It means that cpudev should
be invoked at the beginning of each job.
Tips
A man page is available for this command, as well as reminders using the
--help option with any of the following subcommands.
Info
This utility is available for anyone which is in the cpudev group. It
embarking the functionalities detailed in the Technical
Details section.
Usage
cpudev [OPTIONS] COMMAND1 [ARGS]... [COMMAND2 [ARGS]...]...
cpudev is a command-line tool for configuring CPU settings, including
frequency and idle states. It allows applying configurations via specific
commands or a YAML file. It is closely related to the CPU subsystems CPUFreq
and CPUIdle, which we recommend you be aware of before using this utility
(see the Technical Details section below).
cpudev uses commands that can be chained, applying them sequentially. The
parameters set by former commands will be memorized for the next ones. You can,
for example, use the cpufreq command multiple times in the same call, allowing
you to specify different configurations for different subsets of CPUs.
Subcommands
apply: Apply configurations from a YAML file.
cpufreq: Set CPUFreq settings for specified CPU policies.
cpuidle: Enable or disable specific CPU idle states for specified CPU
sets.
driver: Enable or disable the specified CPUFreq driver.
Below are the corresponding subcommands options.
cpudev apply PATH
PATH: Path to a YAML configuration file to apply settings from.
cpudev cpufreq [OPTIONS]
-g, --governor TEXT: Set CPUFreq governor (e.g., performance,
powersave).
-p, --policies TEXT: Specify CPU policies to target (e.g., '0', '0 9',
'0-9', '*').
-f, --frequency INTEGER: Set fixed CPU frequency (in kHz). Note that this
option will not work unless the governor is set to userspace and the driver
to acpi. See cpufreq for more details.
--config PATH: YAML configuration file.
cpudev cpuidle [OPTIONS]
-c, --cpus TEXT: Specify CPU numbers to target (e.g., '0', '0 9',
'0-9', '*').
-s, --idle-states TEXT: Specify CPU idle states (e.g., '0', '0 9',
'0-9', '*').
--disable: Disable the specified CPU idle states.
--enable: Enable the specified CPU idle states.
--config PATH: YAML configuration file.
cpudev driver DRIVER_NAME
{intel_pstate|amd_pstate|acpi}: Specify the CPUFreq driver to enable or
disable.
CPUs and Policies Selection
The CPU numbers and policy selector options support a syntax similar to that of
the pdftk cat command. You can assemble a query by concatenating the following
blocks with spaces:
k: Selects the item whose name is suffixed by k. For example, in the
context of CPUs, "8" would select "cpu8".
i-j: Allows you to select a continuous range of values between i and
j. Boundaries are included. Example: "0-5" is equivalent to
"0 1 2 3 4 5".
*: Selects every possible number depending on the context. If used for
cpus, it will look at /sys/devices/system/cpu/cpu[0-9]+ and generate a
range from it. If used for policies, it will do the same with
/sys/devices/system/cpu/cpufreq/policy[0-9]+. Finally, if used for selecting
idle_states, it will look at
/sys/devices/system/cpu/cpu0/cpuidle/state[0-9]+. Note that for the latter
case, it assumes every CPU has the same range of idle states.
No assumptions or verifications are made about the query and the system when
computing it. This means, except for "*", boundaries about the number of items
in a category (cpus, idle_states, etc.) are not known or verified: an error will
be output if an invalid query is made.
Example: "0-3 5 7-8" would select every item between 0 and 8 except
4 and 6.
Configuration File
The YAML file should contain sections for cpufreq, cpuidle, and driver
parameters. Multiple list items are supported within these categories,
simulating the fact that commands can be chained. This way, it is possible to
apply different settings to different CPU subsets.
The driver category supports only "intel_pstate", "amd_pstate" or "acpi"
values. The order is important as the file is processed sequentially. For
example, you should place the driver parameter at the top if you want to use
governors from a specific one.
Example:
driver: "acpi"
cpufreq:
- governor: userspace
policies: "0-2"
frequency: "2500000"
- governor: performance
policies: "4"
cpuidle:
- cpus: "0-9"
idle_states: "1"
disable: true
Examples
-
Apply a configuration from a YAML file:
cpudev apply my_config.yaml
-
Set the powersave governor for CPU policies 0 to 3:
cpudev cpufreq --governor powersave --policies "0-3"
-
Disable CPU idle state 1 for CPU 0:
cpudev cpuidle --cpus "0" --idle-states "1" --disable
-
Enable the intel_pstate driver:
cpudev driver intel_pstate
-
Show help for applying a configuration:
Technical Details
About Frequency (APERF, MPERF, TSC)
How is it computed ?
# On x86
The files /proc/cpuinfo and
/sys/devices/system/cpu/cpu*/cpufreq/scaling_frequency use the following
method to calculate CPU frequency.
The calculus used by the kernel to retreive the frequency is the following:
\(\text{BusyMHz} = \frac{\Delta_{\text{APERF}}}{\Delta_{\text{MPERF}}} \times \text{freq}_{\text{base}}\)
On x86 platforms, computing a core's frequency involves three registers (MSRs):
- APERF and MPERF: Individually meaningless, but their ratio
(
APERF/MPERF) provides a coefficient to multiply with a base frequency.
According to the Intel Developer's manual, it is also used to compute the
usage proportion of the cores (Volume 3B, 19.17, p.682 and Volume 3B, 16.2,
p.500).
- TSC: A frequency-invariant counter.
According to the Linux kernel source code,
the base frequency (\(\text{freq}_{\text{base}}\)) is a fixed value stored in
the cpu_khz variable, which can be retrieved using BPF. This frequency is
calculated using the TSC and represents the frequency of the core at the maximum
non-turbo P-State. The following code snippet allows yhou to retreive such a
value:
CPU_KHZ_ADDR=$(sudo cat /proc/kallsyms | grep "D cpu_khz" | cut -f1 -d" ") && sudo bpftrace -e "BEGIN { \$cpu_khz_addr = 0x$CPU_KHZ_ADDR ; printf(\"cpu_khz: %d\", *\$cpu_khz_addr); exit();}"
Important
The \(\text{freq}_{\text{base}}\) value need not to be confused with the
nominal frequency given by chip manufacturers : these two values are
different.
Linux Kernel source code comment
The scheduler wants to do frequency invariant accounting and needs a
\(<1\) ratio to account for the "current" frequency, corresponding to
\(\text{freq}_{\text{curr}} / \text{freq}_{\text{max}}\). Since the frequency
\(\text{freq}_{\text{curr}}\) on x86 is controlled by micro-controller and our
P-State setting is little more than a request/hint, we need to observe the
effective frequency "BusyMHz", i.e. the average frequency over a time
interval after discarding idle time. This is given by:
\(\text{BusyMHz} = \frac{\Delta_{\text{APERF}}}{\Delta_{\text{MPERF}}} \times \text{freq}_{\text{base}}\)
where \(\text{freq}_{\text{base}}\) is the max non-turbo P-State.
The \(\text{freq}_{\text{max}}\) term has to be set to a somewhat arbitrary
value, because we can't know which turbo states will be available at a given
point in time: it all depends on the thermal headroom of the entire package.
We set it to the turbo level with 4 cores active.
Benchmarks show that's a good compromise between the 1C turbo ratio
\(\text{freq}_\text{curr} / \text{freq}_\text{max}\) would rarely reach 1 and
something close to \(\text{freq}_\text{base}\), which would ignore the entire
turbo range (a conspicuous part, making
\(\times \text{freq}_{\text{curr}} / \text{freq}_{\text{max}}\) always maxed
out).
How can I set it ? (CPUFreq)
CPUFreq is a Linux kernel software interface that manages DVFS, an
hardware feature in modern chips designed to reduce power consumption, alongside
techniques like clock-gating and power-gating.
The CPUFreq software stack includes:
| Component |
Role |
| Core |
Registers CPU cores within the software and assigns them policies. |
| Governor |
Decides the optimal frequency based on system metrics. It holds the main algorithm to decide which entity needs to scale up or down its frequency. |
| Policy |
Applies the governor's decisions across associated CPUs. |
| Driver |
Interfaces with hardware to set and retrieve frequencies. |
Theoretically, any governor can work with any hardware, and different
governors can manage distinct logical CPU subsets.
Each governor can have its own set of parameters we can change to influence its
choices (see the
schedutil
governor for example).
# On Intel
Intel developed its own driver and governor (Intel P-State) for processors
since the Sandy Bridge architecture. It supports Hardware-managed
P-States, enabling automatic frequency adjustments. The driver operates in two
modes:
| Mode |
Description |
| Active |
User-space frequency control is disabled; only "powersave" and "performance" governors are available. |
| Passive |
Behaves like a standard CPUFreq driver, allowing Linux kernel governors. |
Tip
Intel also introduced EPB, a 16-level scale to prioritize performance or
power efficiency, used by Intel P-State for frequency arbitration. We can
modify this bias directly through sysfs. More information on
its dedicated page.
Unfortunately, it is not possible to set a specific frequency with this driver
enabled.
So, to get to achieve this goal, we must follow the next steps :
- Deactivate the Intel P-State, so it hands the control back to the vanilla
driver
echo passive | sudo tee /sys/devices/system/cpu/intel_pstate/status
- Select a core we want to set the frequence of, here it will be
cpu0
- Select the governor for this core (e.g.
cpu0) that allows us to set the
frequency: userspace
echo "userspace" | sudo tee /sys/devices/system/cpu/cpu0/scaling_governor
- Set the frequency we are targeting in the following files (here it is 1800000,
which represents 1.8 GHz):
echo 1800000 | sudo tee /sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq
echo 1800000 | sudo tee /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
- You can then verify the effects of this process by executing the following
command:
CPU_ID="cpu0" && cat /proc/cpuinfo | grep "cpu MHz" | sed "$((${CPU_ID: -1} + 1))q;d"
About Sleeping (CPUIdle)
CPUIdle is the Linux kernel software stack that manages processor idle states.
Unlike CPUFreq, which adjusts frequency, CPUIdle allows a logical CPU to
stop executing instructions and power down parts of its circuitry to save
energy.
The CPUIdle stack shares similarities with CPUFreq, using the same governor
and driver concepts allowing different policies to be taken in function of the
platform or the idle states defined.
A sleep state is defined by two key metrics:
- Target residency: Minimum time required in the state to save more energy
than a lighter state (includes entry time).
- Exit latency: Maximum delay between the kernel requesting sleep and waking
the CPU for a new instruction.
What concretely entails a sleep state is not clear nor defined in the source
code (it points to assembly labels coming from compiled drivers). It depends on
the platform they are defined.
Info
When no tasks are available, the kernel schedules a special idle (or
swapper) task, triggering the selected CPUIdle governor to enter a sleep
state. However, scheduler ticks (1–10 ms) prevent deep sleep states.
"Tickless" kernels address this by disabling periodic ticks and waking
only on external interrupts.
Playing with Sleep States
# System Analysis
Commands to inspect CPUIdle configuration:
# Configuration
Commands to modify CPUIdle behavior:
-
Change governor:
echo "teo" | sudo tee /sys/devices/system/cpu/cpuidle/current_governor
-
Enable/Disable C-states:
- Disable a state (e.g.,
state3 for cpu0):
echo "1" | sudo tee /sys/devices/system/cpu/cpu0/cpuidle/state3/disable
- Re-enable a state:
echo "0" | sudo tee /sys/devices/system/cpu/cpu0/cpuidle/state3/disable
- Disable all deep states (e.g., from
C2):
for STATE in /sys/devices/system/cpu/cpu0/cpuidle/state[2-9]/disable; do
echo "1" | sudo tee $STATE;
done
CPU Driver Source Codes
Backup Script
| /usr/local/sbin/cpudev_backup |
|---|
| #!/bin/bash
# CPU sysfs path and save path
BASE_PATH="/sys/devices/system/cpu"
#BACKUP_FILE="/tmp/cpu_sysfs_backup_$(date +%Y%m%d_%H%M%S).txt"
BACKUP_FILE="/tmp/cpudev_sysfs_backup.txt"
# Files list to exclude
EXCLUDE_FILES=("uevent" "modalias" "subsystem" "device" "scaling_setspeed")
# Create the save file
echo "Path of sysfs CPU parameters save: $BACKUP_FILE"
echo "# sysfs CPU parameters - $(date)" > "$BACKUP_FILE"
echo "# Format: path=value" >> "$BACKUP_FILE"
# Build the command line with file exclusions
exclude_args=()
for file in "${EXCLUDE_FILES[@]}"; do
exclude_args+=(! -name "$file")
done
# Browse files, exclude those in the list and check owner permissions
find "$BASE_PATH" -type f -perm -u=w "${exclude_args[@]}" 2>/dev/null | while read -r file; do
value=$(cat "$file" 2>/dev/null)
if [ $? -eq 0 ]; then
echo "$file=$value" >> "$BACKUP_FILE"
fi
done
echo "Save complete."
|
Restore Script
| /usr/local/sbin/cpudev_restore |
|---|
| #!/bin/bash
# Put back the root group to prevent abusive uses later
echo "Restore root group in /sys/devices/system/cpu/ directory"
chgrp -R root /sys/devices/system/cpu/*
# Check that the save file is given, if not stop here
if [ $# -ne 1 ]; then
exit 0
fi
BACKUP_FILE="$1"
if [ ! -f "$BACKUP_FILE" ]; then
echo "Error: '$BACKUP_FILE' does not exist."
exit 1
fi
# Define paths priority order
PRIORITY_PATHS=(
"intel_pstate" # first those in this sub-folder
"cpufreq" # then those is cpufreq/
)
echo "Restore sysfs CPU parameters from $BACKUP_FILE"
# Read the backup file and sort the lines according to priority order
declare -A file_values
declare -A b_file_values # to check
declare -A i_file_values # for a new attempt at ignored files
while IFS= read -r line; do
# ignore comments
if [[ "$line" == \#* ]]; then
continue
fi
file=$(echo "$line" | cut -d'=' -f1)
value=$(echo "$line" | cut -d'=' -f2-)
file_values["$file"]="$value"
b_file_values["$file"]="$value"
done < "$BACKUP_FILE"
# Function to restore a file if necessary
restore_file() {
local file="$1"
local target_value="$2"
if [ -n "$target_value" ]; then
if [ -w "$file" ]; then
current_value=$(cat "$file" 2>/dev/null)
if [ "$current_value" != "$target_value" ]; then
echo "$target_value" | sudo tee "$file" > /dev/null
echo "Restored: $file = $target_value"
else
echo "Up to date: $file ($target_value)"
fi
else
i_file_values["$file"]="$target_value"
echo "Ignored (writing failed): $file"
fi
fi
}
# Restore according to priority order
for pattern in "${PRIORITY_PATHS[@]}"; do
echo "--- Restoring files in $pattern/ ---"
for file in "${!file_values[@]}"; do
if [[ "$file" == *"$pattern"* ]]; then
restore_file "$file" "${file_values[$file]}"
unset file_values["$file"] # to avoid to be processed two times
fi
done
done
echo "--- Restoring the remaining files ---"
for file in "${!file_values[@]}"; do
restore_file "$file" "${file_values[$file]}"
done
# New attempt for ignored files
echo "--- Restoring previously ignored files ---"
for file in "${!i_file_values[@]}"; do
restore_file "$file" "${i_file_values[$file]}"
unset i_file_values["$file"] # to avoid infinite looping
done
# Verify that the shaft has been properly restored
for file in "${!b_file_values[@]}"; do
v=$(cat "$file" 2>/dev/null)
t="${b_file_values[$file]}"
if [[ "$v" != "$t" ]]; then
echo "Wrong restoration of the sysfs tree. '$file' should have the value '$t', but it has the value '$v'."
exit 2
fi
done
echo "Restoration complete."
|
Set Permissions
| /usr/local/sbin/cpudev_setperms |
|---|
| // g++ -std=c++17 -O2 -Wall -o cpudev_setperms cpudev_setperms.cpp
#include <iostream>
#include <string>
#include <vector>
#include <cstring>
#include <regex>
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <sys/stat.h>
#include <pwd.h>
#include <grp.h>
#include <ftw.h>
#include <errno.h>
#include <fcntl.h>
#include <limits.h>
#include <sstream>
#include <syslog.h>
static const std::vector<std::string> g_local_users = {"administrator", "powerstate", "prober"};
static const std::string g_slurm_path = "/opt/slurm/bin/";
static const std::string g_cpudev_path = "/sys/devices/system/cpu/";
static const std::string g_cpudev_group = "cpudev";
// ================ Get current username =======================================
std::string get_sudo_invoker() {
// First, check if the program was run via sudo
const char* sudoUser = getenv("SUDO_USER");
if (sudoUser && *sudoUser) {
return std::string(sudoUser);
}
// Fallback: use the real UID of the process
uid_t uid = getuid();
struct passwd* pw = getpwuid(uid);
if (pw && pw->pw_name) {
return std::string(pw->pw_name);
}
// Final fallback
return "unknown";
}
// ================ Run external binary safely and capture its stdout ==========
std::string run_program_capture(const std::string &prog, const std::vector<std::string> &args) {
int pipefd[2];
if (pipe(pipefd) == -1) {
syslog(LOG_ERR, "pipe() failed: %s", strerror(errno));
return "";
}
pid_t pid = fork();
if (pid < 0) {
syslog(LOG_ERR, "fork() failed: %s", strerror(errno));
close(pipefd[0]); close(pipefd[1]);
return "";
}
if (pid == 0) {
// child
dup2(pipefd[1], STDOUT_FILENO);
close(pipefd[0]);
close(pipefd[1]);
std::vector<char*> argv;
argv.reserve(args.size() + 2);
argv.push_back(const_cast<char*>(prog.c_str()));
for (const auto &a : args)
argv.push_back(const_cast<char*>(a.c_str()));
argv.push_back(nullptr);
execv(prog.c_str(), argv.data());
_exit(127);
}
// parent
close(pipefd[1]);
std::string out;
char buf[512];
ssize_t n;
while ((n = read(pipefd[0], buf, sizeof(buf))) > 0) {
out.append(buf, buf + n);
}
close(pipefd[0]);
int status = 0;
waitpid(pid, &status, 0);
if (WIFEXITED(status) && WEXITSTATUS(status) != 0) {
syslog(LOG_WARNING, "Program %s exited with status %d", prog.c_str(), WEXITSTATUS(status));
} else if (WIFSIGNALED(status)) {
syslog(LOG_WARNING, "Program %s terminated by signal %d", prog.c_str(), WTERMSIG(status));
}
while (!out.empty() && (out.back() == '\n' || out.back() == '\r'))
out.pop_back();
return out;
}
// ================ get hostname (safe) ========================================
std::string get_hostname() {
char host[256];
if (gethostname(host, sizeof(host)) == 0) {
return std::string(host);
}
return "";
}
// ================ get Slurm JobID (first token) ==============================
std::string get_slurm_job_id(const std::string &user) {
std::string prog = g_slurm_path + "squeue";
std::string host = get_hostname();
if (host.empty()) return "";
std::vector<std::string> args = {"--noheader", ("--nodelist=" + host), ("--user=" + user), "--Format=JobID"};
std::string out = run_program_capture(prog, args);
if (out.empty()) return "";
size_t pos = out.find('\n');
std::string firstline = (pos == std::string::npos) ? out : out.substr(0, pos);
std::istringstream iss(firstline);
std::string token;
if (!(iss >> token)) return "";
std::regex jobre("^\\d+$");
if (std::regex_match(token, jobre)) return token;
return "";
}
// ================ parse "OverSubscribe=" value from scontrol output ===========
std::string get_over_subscribe_flag(const std::string &job_id) {
std::string prog = g_slurm_path + "scontrol";
std::vector<std::string> args = {"show", "job", job_id};
std::string out = run_program_capture(prog, args);
if (out.empty()) return "";
std::string key = "OverSubscribe=";
size_t p = out.find(key);
if (p == std::string::npos) return "";
p += key.size();
size_t q = p;
while (q < out.size() && !isspace((unsigned char)out[q])) ++q;
return out.substr(p, q - p);
}
// ================ Check if user is local (from hard-coded list) =============
bool is_local_user(const std::string &user) {
for (auto &u : g_local_users) if (u == user) return true;
return false;
}
// ================ Resolve group name to gid =================================
bool lookup_gid(const std::string &groupname, gid_t &out_gid) {
struct group *g = getgrnam(groupname.c_str());
if (!g) {
syslog(LOG_ERR, "getgrnam('%s') failed", groupname.c_str());
return false;
}
out_gid = g->gr_gid;
return true;
}
// Global state for nftw callback
static gid_t g_cpudev_gid = (gid_t)-1;
static bool g_do_chgrp = true;
static bool g_do_chmod_g_eq_u = true;
static int g_change_errors = 0;
// ================ nftw callback ==============================================
int nftw_callback(const char *fpath, const struct stat *sb, int typeflag, struct FTW * /*ftwbuf*/) {
(void)typeflag; // FTW_PHYS used so symlinks won't be followed
// Change group preserving owner
if (g_do_chgrp) {
if (chown(fpath, sb->st_uid, g_cpudev_gid) != 0) {
// log occasionally; avoid extremely noisy logs - increment counter and log sample
++g_change_errors;
if (g_change_errors <= 5) {
syslog(LOG_WARNING, "chown failed on %s: %s", fpath, strerror(errno));
} else if (g_change_errors == 6) {
syslog(LOG_WARNING, "Further chown failures suppressed (many)");
}
}
}
// Set group bits equal to owner bits (g = u)
if (g_do_chmod_g_eq_u) {
mode_t cur = sb->st_mode;
mode_t ubits = (cur & S_IRWXU);
mode_t new_mode = (cur & ~S_IRWXG) | ((ubits >> 3) & S_IRWXG);
if ((cur & S_IRWXG) != (new_mode & S_IRWXG)) {
if (chmod(fpath, new_mode) != 0) {
++g_change_errors;
if (g_change_errors <= 5) {
syslog(LOG_WARNING, "chmod failed on %s: %s", fpath, strerror(errno));
} else if (g_change_errors == 6) {
syslog(LOG_WARNING, "Further chmod failures suppressed (many)");
}
}
}
}
return 0; // continue
}
// ================ Recursively operate on g_cpudev_path safely ===============
bool apply_cpudev_changes() {
if (!lookup_gid(g_cpudev_group.c_str(), g_cpudev_gid)) {
syslog(LOG_ERR, "Group '%s' not found", g_cpudev_group.c_str());
return false;
}
g_change_errors = 0;
// Use 20 file descriptors at the same time.
// NFTW_PHYS prevents following symlinks (equivalent to chmod -P)
if (nftw(g_cpudev_path.c_str(), nftw_callback, 20, FTW_PHYS) != 0) {
syslog(LOG_ERR, "nftw failed on %s: %s", g_cpudev_path.c_str(), strerror(errno));
return false;
}
if (g_change_errors > 0) {
syslog(LOG_WARNING, "Completed with %d change errors under %s", g_change_errors, g_cpudev_path.c_str());
std::clog << "(WW) Completed with " << g_change_errors
<< " change errors under " << g_cpudev_path << std::endl;
} else {
syslog(LOG_INFO, "Successfully updated ownership and permissions under %s", g_cpudev_path.c_str());
}
return (g_change_errors == 0);
}
// ================ main =======================================================
int main() {
// open syslog
openlog("cpudev_setperms", LOG_PID | LOG_CONS, LOG_DAEMON);
syslog(LOG_INFO, "Program start");
std::string user = get_sudo_invoker();
syslog(LOG_INFO, "Invoked by user: %s", user.c_str());
std::string job_id = get_slurm_job_id(user);
if (!job_id.empty()) {
syslog(LOG_INFO, "Found SLURM JobID %s for user %s", job_id.c_str(), user.c_str());
} else {
syslog(LOG_INFO, "No SLURM JobID found for user %s on this node", user.c_str());
}
std::string over_sub;
if (!job_id.empty()) {
over_sub = get_over_subscribe_flag(job_id);
syslog(LOG_INFO, "OverSubscribe for job %s = '%s'", job_id.c_str(), over_sub.c_str());
}
bool local = is_local_user(user);
if (local) syslog(LOG_INFO, "User %s is in local user list", user.c_str());
if (over_sub == "NO" || local) {
syslog(LOG_INFO, "Proceeding to change ownership/permissions for %s", g_cpudev_path.c_str());
if (!apply_cpudev_changes()) {
syslog(LOG_ERR, "Failed to update ownership/permissions under %s", g_cpudev_path.c_str());
std::cerr << "(EE) Failed to update some ownership/permission entries under " << g_cpudev_path << "\n";
closelog();
return 3;
}
syslog(LOG_INFO, "Completed ownership/permission updates for user %s", user.c_str());
std::cout << "(II) Permissions and ownership updated successfully for user: " << user << "\n";
closelog();
return 0;
} else {
syslog(LOG_WARNING, "Node is NOT exclusively allocated and user is not local: aborting for user %s", user.c_str());
std::cerr << "(EE) This node is NOT exclusively allocated via SLURM, you cannot run this program.\n";
closelog();
return 4;
}
}
|
SPANK Plugin to Invoke Restore Script
| /mnt/nfs/software/slurm/lib/spank/cpudev.so |
|---|
| #include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <signal.h>
#include <stdio.h>
#include <syslog.h>
#include <pwd.h>
#include <slurm/spank.h>
#define LOG
#ifdef LOG
#define OPENLOG(...) openlog(__VA_ARGS__)
#define SYSLOG(...) syslog(__VA_ARGS__)
#define CLOSELOG(...) closelog(__VA_ARGS__)
#else
#define OPENLOG(...)
#define SYSLOG(...)
#define CLOSELOG(...)
#endif
/*
* All spank plugins must define this macro for the Slurm plugin loader.
*/
SPANK_PLUGIN(cpudev, 1);
int slurm_spank_task_exit(spank_t spank, int argc, char *argv[])
{
spank_context_t calling_context = spank_context();
OPENLOG("slurm_spank_cpudev", LOG_CONS | LOG_PID | LOG_NDELAY, LOG_DAEMON);
SYSLOG(LOG_NOTICE, "task_exit - program started by SLURM, UID is %d", getuid());
switch (calling_context)
{
case S_CTX_REMOTE: {
char *nodename = getenv("SLURMD_NODENAME");
if (nodename == NULL)
{
slurm_error("cpudev: getenv of \"SLURMD_NODENAME\" failed!");
SYSLOG(LOG_ERR, "task_exit - getenv of \"SLURMD_NODENAME\" failed!");
}
else
{
const char* path_restore_script = "/usr/local/sbin/cpudev_restore";
if (access(path_restore_script, F_OK) == 0)
{
uid_t uid;
/* spank_err_t rc = */ spank_get_item(spank, S_JOB_UID, &uid);
struct passwd *pws;
pws = getpwuid(uid);
char proberctl_script[2048];
const char* path_backup = "/tmp/cpudev_sysfs_backup.txt";
int is_backup_file;
if ((is_backup_file = access(path_backup, F_OK)) == 0)
{
// changes the group to "root" and restore the CPU sysfs values
snprintf(proberctl_script, sizeof(proberctl_script), "%s %s", path_restore_script, path_backup);
}
else
{
// in this case the script just changes the group to "root"
snprintf(proberctl_script, sizeof(proberctl_script), "%s", path_restore_script);
}
if (system(proberctl_script))
{
slurm_error ("cpudev_restore: system command failed!");
SYSLOG(LOG_ERR, "task_exit - fail to run cpudev_restore script (user=%s,backup=%d)", pws->pw_name, is_backup_file);
}
else
{
SYSLOG(LOG_NOTICE, "task_exit - run cpudev_restore script (user=%s,backup=%d)", pws->pw_name, is_backup_file);
}
}
else
{
SYSLOG(LOG_NOTICE, "task_exit - run cpudev_restore script is not installed on this node");
}
}
break;
}
default:
break;
}
CLOSELOG();
return 0;
}
|
cpudev
| /usr/local/sbin/cpudev |
|---|
| #!/usr/bin/python3
import click
import re
import os
import glob
import yaml
import time
import subprocess
def call_cpudev_setperms():
"""
As this script is intented to be used on the Dalek system, each "important" modification
made to the /sys/devices/system/cpu sysfs needs its rights to be updated. This function
serves this purpose.
"""
try:
result = subprocess.run(
["sudo", "cpudev_setperms"],
capture_output=True,
text=True,
check=True
)
except subprocess.CalledProcessError as e:
raise RuntimeError(
f"The following command failed: {e.cmd}\n"
f"return code: {e.returncode}\n"
f"stdout:\n{e.stdout}\n"
f"stderr:\n{e.stderr}"
) from e
def parse_range(range_str: str, subsystem: str = "cpufreq") -> list:
"""Parse a string like '0', '0 9', '0-9', or '*' into a list of integers."""
if range_str == "*":
if subsystem == "cpufreq":
return list(range(len(glob.glob("/sys/devices/system/cpu/cpufreq/policy*"))))
elif subsystem == "cpuidle":
return list(range(len(glob.glob("/sys/devices/system/cpu/cpu[0-9]*"))))
elif subsystem == "states":
return list(range(len(glob.glob("/sys/devices/system/cpu/cpu0/cpuidle/state[0-9]*"))))
items = []
for part in range_str.split():
if '-' in part:
start, end = map(int, part.split('-'))
items.extend(range(start, end + 1))
else:
items.append(int(part))
return items
def load_config(config_file):
"""Load configuration from a YAML file."""
try:
with open(config_file, 'r') as f:
return yaml.safe_load(f)
except Exception as e:
raise RuntimeError(
f"When manipulating '{config_file}' file with 'r' mode"
) from e
def apply_cpufreq_config(cfg):
"""Apply cpufreq configuration."""
governor = cfg['governor'] if 'governor' in cfg else None
policies = cfg['policies'] if 'policies' in cfg else "0"
frequency = cfg['frequency'] if 'frequency' in cfg else None
# Checks if the driver is set to acpi, else it is raises an error
status_path_intel = "/sys/devices/system/cpu/intel_pstate/status"
status_path_amd = "/sys/devices/system/cpu/amd_pstate/status"
status = "inactive"
if os.path.exists(status_path_intel):
with open(status_path_intel, "r") as f:
status = f.read()
if os.path.exists(status_path_amd):
with open(status_path_amd, "r") as f:
status = f.read()
if "active" in status:
raise RuntimeError("Driver intel_pstate or amd_pstate is set, impossible to apply cpufreq config. Change it to acpi, you can use the 'driver' subcommand to do so.")
policy_list = parse_range(policies)
for policy in policy_list:
governor_path = f"/sys/devices/system/cpu/cpufreq/policy{policy}/scaling_governor"
if governor and os.path.exists(governor_path):
try:
with open(governor_path, 'w') as f:
f.write(governor)
except Exception as e:
raise RuntimeError(
f"When manipulating '{governor_path}' file with 'w' mode"
) from e
time.sleep(0.5)
if frequency:
freq_set_path = f"/sys/devices/system/cpu/cpufreq/policy{policy}/scaling_setspeed"
freq_min_path = f"/sys/devices/system/cpu/cpufreq/policy{policy}/scaling_min_freq"
freq_max_path = f"/sys/devices/system/cpu/cpufreq/policy{policy}/scaling_max_freq"
if os.path.exists(freq_set_path):
try:
with open(freq_set_path, 'w') as f:
f.write(str(frequency))
except Exception as e:
raise RuntimeError(
f"When manipulating '{freq_set_path}' file with 'w' mode"
) from e
if os.path.exists(freq_min_path):
try:
with open(freq_min_path, 'w') as f:
f.write(str(frequency))
except Exception as e:
raise RuntimeError(
f"When manipulating '{freq_min_path}' file with 'w' mode"
) from e
if os.path.exists(freq_max_path):
try:
with open(freq_max_path, 'w') as f:
f.write(str(frequency))
except Exception as e:
raise RuntimeError(
f"When manipulating '{freq_max_path}' file with 'w' mode"
) from e
def apply_driver_config(cfg):
"""Apply configuration."""
driver = cfg['driver'] if 'driver' in cfg else "acpi"
reg = re.compile("(intel|amd)_pstate")
if reg.match(driver):
status_path = f"/sys/devices/system/cpu/{driver}/status"
if os.path.exists(status_path):
try:
with open(status_path, 'w') as f:
f.write(str(active))
except Exception as e:
raise RuntimeError(
f"When manipulating '{status_path}' file with 'w' mode"
) from e
elif driver == "acpi":
# Not sure it even exists on ARM
status_path_intel = "/sys/devices/system/cpu/intel_pstate/status"
status_path_amd = "/sys/devices/system/cpu/amd_pstate/status"
if os.path.exists(status_path_intel):
try:
with open(status_path_intel, 'w') as f:
f.write("passive")
except Exception as e:
raise RuntimeError(
f"When manipulating '{status_path_intel}' file with 'w' mode"
) from e
elif os.path.exists(status_path_amd):
try:
with open(status_path_amd, 'w') as f:
f.write("passive")
except Exception as e:
raise RuntimeError(
f"When manipulating '{status_path_amd}' file with 'w' mode"
) from e
# Changing the driver creates new files
call_cpudev_setperms()
def apply_cpuidle_config(cfg):
"""Apply cpuidle configuration."""
policies = cfg['cpus'] if 'cpus' in cfg else "0"
idle_states = cfg['idle_states'] if 'idle_states' in cfg else "0"
disable = cfg['disable'] if 'disable' in cfg else False
enable = cfg['enable'] if 'enable' in cfg else False
policy_list = parse_range(policies, "cpuidle")
state_list = parse_range(idle_states, "states")
for policy in policy_list:
for state in state_list:
state_path = f"/sys/devices/system/cpu/cpu{policy}/cpuidle/state{state}/disable"
if os.path.exists(state_path):
try:
with open(state_path, 'w') as f:
if enable:
f.write("0")
elif disable:
f.write("1")
except Exception as e:
raise RuntimeError(
f"When manipulating '{state_path}' file with 'w' mode"
) from e
@click.group(chain=True)
@click.pass_context
def cli(ctx):
"""Command line interface for CPU settings."""
ctx.ensure_object(dict)
call_cpudev_setperms()
@cli.command()
@click.option('-g', '--governor', type=str, help="Set CPUFreq governor (e.g., performance, powersave)")
@click.option('-p', '--policies', type=str, default="0", help="Specify CPU policies to target (e.g., '0', '0 9', '0-9', '*')")
@click.option('-f', '--frequency', type=int, help="Set fixed CPU frequency (in kHz)")
@click.option('--config', type=click.Path(exists=True), help="YAML configuration file")
@click.pass_context
def cpufreq(ctx, governor, policies, frequency, config):
"""Set CPUFreq settings for specified CPU policies. Currently supported : governor, frequency."""
if config:
cfg = load_config(config)
specific_cfg = cfg['cpufreq'] if 'cpufreq' in cfg else []
for item in specific_cfg:
apply_cpufreq_config(item)
else:
apply_cpufreq_config({'governor': governor, 'policies': policies, 'frequency': frequency})
@cli.command()
@click.argument('driver', type=click.Choice(['intel_pstate', 'amd_pstate', 'acpi']))
@click.pass_context
def driver(ctx, driver):
"""Sets CPUFreq driver to the specified one."""
apply_driver_config({'driver': driver})
@cli.command()
@click.option('-c', '--cpus', type=str, default="0", help="Specify CPU number to target (e.g., '0', '0 9', '0-9', '*')")
@click.option('-s', '--idle-states', type=str, default="0", help="Specify CPU idle states (e.g., '0', '0 9', '0-9', '*')")
@click.option('--disable', is_flag=True, help="Disable the specified CPU idle states")
@click.option('--enable', is_flag=True, help="Enable the specified CPU idle states")
@click.option('--config', type=click.Path(exists=True), help="YAML configuration file")
@click.pass_context
def cpuidle(ctx, cpus, idle_states, disable, enable, config):
"""Enable or disable specific CPU idle states for specified CPUs."""
if config:
cfg = load_config(config)
specific_cfg = cfg['cpuidle'] if 'cpuidle' in cfg else []
for item in specific_cfg:
apply_cpuidle_config(item)
else:
apply_cpuidle_config({'cpus': cpus, 'idle_states': idle_states, 'disable': disable, 'enable': enable})
@cli.command()
@click.argument('config', type=click.Path(exists=True))
@click.pass_context
def apply(ctx, config):
"""Apply configurations from a YAML file."""
if config:
cfg = load_config(config)
for command in cfg:
if command == 'cpufreq':
for item in cfg[command]:
apply_cpufreq_config(item)
elif command == 'driver':
apply_driver_config({"driver":cfg[command]})
elif command == 'cpuidle':
for item in cfg[command]:
apply_cpuidle_config(item)
if __name__ == "__main__":
cli(obj={})
|