Energy Measurement Platform¶
Introduction¶
This page details how to use the hardware platform designed to measure Dalek's energy consumption. The goal is to provide a precise, high-frequency power monitoring system that measures energy usage directly at the power socket. In fact this is not exactly true, to be more precise, the measurements are performed between the PSU DC and the node components. Thus, the energy consumption of the 220–240 V AC to DC conversion is ignored. As the Dalek nodes are equipped with 1000 W 80 PLUS Platinum PSUs, a 6-8% loss is expected when a node is under usage (20-60% load).
While commercial solutions do exist, they often lack the flexibility required for research purposes and are not easily adaptable to custom experimental setups. Our proposed solution is a modular, open-source platform - both in hardware and software - tailored to meet researchers needs. We also believe this platform can be reused in other contexts, unlocking new use cases and potentially being adapted to other clusters in the future. Additionally, since the platform focuses on socket-level measurements, it complements approaches based on MSRs, such as Intel's RAPL or Nvidia SMI.
The core design is based on separating the platform into two main components:
- a main board, responsible for aggregating the collected samples and communicating the measured data to the node, and
- probes, which measure voltage and current between the power supply and the compute node.
This architecture allows for designing various probe circuits depending on the power supply type used by each node, while maintaining standardized interfaces with the main board. Each compute node is equipped with one main board, and multiple probes can be connected to it.
We named the platform: Node-Code Modular or NCM for short. The source code of the software and hardware is available here: https://gitlab.lip6.fr/bouyer/node-conso-modular.
Power Connectors¶
The compute nodes in Dalek can be powered from many different connectors.
For instance, the az4-n4090 and az4-a7900 nodes, have a connector for the
motherboard (the ATX 24-pin connector), a connector for the CPU (EPS12V
connector) and connectors for the GPU (typically PCIe 8-pin connectors or
the recent 12VHPWR 16-pin connector). On the other side, iml-ia770 nodes
are built on the top of a Minisforum AtomMan X7 Ti
mini-PC and an external GPU. The mini-PC is powered via a 19 V coaxial
connector while the eGPU dock uses an ATX 24-pin connector (same as those used
for motherboards!) and the GPU itself (Intel Arc A770 ) is powered via PCIe
8-pin connectors. Finally, the az5-a890m nodes (based on Minisforum
EliteMini AI370
mini-PCs) can be powered either via a 19 V coaxial connector or via an
USB-C connector.
In fact these connectors can sometime even be decomposed in different rails. Understanding them is important because it is what NCM measures. The following sub-sections depict the different connectors and their use cases.
ATX 24-pin - Motherboard Main Power Connector¶
The ATX 24-pin connector provides multiple rails described below. This connector can typically deliver up to 150–200 W.
-
+3.3 V rail - 3 pins
- for RAM (via onboard regulators)
- for NVMe M.2 SSDs
- for chipset / PCH
- for PCIe logic
- for some I/O controllers
-
+5 V rail - 5 pins
- for USB ports
- for SATA logic power (control electronics) - there is no SATA component in Dalek's compute nodes
- for legacy components - there is no legacy component in Dalek's compute nodes
- note: High-current 5 V loads are mostly gone, but USB still depends on it
-
+12 V rail - 2 pins
- to feed onboard DC2DC converters
- for PCIe slot power (partially, up to 75 W)
- for CPU fan and NVMe M.2 SSDs fan
- note: The CPU and GPU do NOT rely on this connector alone - they also use EPS12V and PCIe 8-pin connectors for high power
-
+5 V standby (5VSB) - 1 pin
- for power button logic
- for WoL
- USB charging when PC is off
- note: Always ON when PSU is connected to AC
EPS12V - Motherboard Power Connector for the CPU¶
This connector is also called "CPU power connector" or "ATX12V CPU connector".
Its only purpose is to supply dedicated 12 V power to the CPU via the
motherboard's VRMs. Usually it is a 4-pin, 8-pin, or 4+4-pin connector. Modern
motherboards, like the ones in Dalek az4-n4090 and az4-a7900 nodes,
use the 8-pin connector that can deliver up to 200 W.
Modern CPUs primarily draw power from the EPS12V connector via the VRMs, not directly from the 24-pin connector. In other word, the CPU power consumption should be very well isolated by observing this connector.
az4-n4090 and az4-a7900 node are equipped with
Minisforum BD790i
motherboards that are using this connector. The TDP of the CPU sealed to these
motherboards (AMD Ryzen 9 7945HX)
is 75 W.
PCIe 8-pin - Classic GPU Power Connector¶
The PCIe 8-pin connector is the classic way GPUs get supplemental power from the PSU beyond what the motherboard slot provides. It is also called "6+2 pin PCIe connector" (for PSU compatibility) and it provides additional 12 V power to GPUs that need more than the 75 W from the PCIe slot. One 8-pin connector can deliver up to 150 W and this is why, depending on the GPU, multiple connectors can be used to power it. Here are the nodes that are using this connector in Dalek:
az4-a7900: AMD Radeon RX 7900 RTX GPU where 3 PCIe 8-pin connectors are used (max. 3x150 W + 75 W = 525 W).iml-ia770: Intel Arc A770 GPU where 2 PCIe 8-pin connectors are used (max. 2x150 W + 75 W = 375 W).
Note that the TDP of the AMD Radeon RX 7900 RTX GPU is 355 W and the TDP of the Intel Arc A770 GPU is about 225 W. Clearly these GPUs will not used the maximum power they could drain from the PCIe 8-pin connectors.
12VHPWR - New GPU Power Connector¶
The 12VHPWR connector is the new standard for high-power GPUs, and it's a bit more complex than a traditional PCIe 8-pin. It is made to deliver up to 600 W to modern GPUs from a single connector. It is a 12 V connector like the PCIe 8-pin connector but it has 16 pins (12 for power and 4 for sense). Typically, from the PSU it is presented as new modular cable or an adapter from 4 traditional PCIe 8-pin connectors (4x150 W = 600 W). What is new is the 4-pin sense that communicates with the GPU to inform it with the maximum safe current available, up to 600 W.
This connector is used by the Nvidia RTX 4090 GPUs that can be found in the
az4-n4090 nodes.
Coaxial 19V - Mini-PC and SBC Connector¶
Many mini-PCs are powered via a 19 V coaxial connector. This is typically
true for the iml-ia770 and az5-a890m nodes, that rely on Minisforum
AtomMan X7 Ti and
EliteMini AI370
mini-PCs, respectively.
USB-C PD - Mini-PC and SBC Connector¶
Many current mini-PCs and SBCs can be powered via USB PD. In fact, USB PD 3.1 -
through an USB type C connector - can deliver up to 240 W. For instance,
the az5-a890m nodes support this protocol. Then, the USB PD protocol is in
charge of negotiating the voltage and the current.
For the az5-a890m nodes we preferred to plug the legacy 19 V coaxial connector
because it is simpler and more stable for measurements.
NCM Probes¶
One of the innovative aspect of NCM is its adaptability. Compute nodes have many connectors and rails that need to be powered and where energy consumption can be observed. To address such a diversity, three (plus one) specific probes have been designed (each probe is a different electronic circuit):
- p-ATX: the probe to deal with the ATX 24-pin connector (and its rails) and the CPU EPS12V connector
- p-PCIe: the probe to deal with multiple PCIe 8-pin connectors or with the new 12VHPWR connector
- p-UPD+: the probe to deal with the USB PD connector or with the Coaxial 19 V connector
- p-Sen: a probe to get the current temperature and humidity in the cluster (WIP)
p-ATX¶
This probe measures the 4 rails of the ATX 24-pin connector and the CPU EPS12V connector independently at approximately 1000 samples per second. Then, p-ATX send these samples to the main board (via the I2C bus).
The following 5 channels are available:
- [
p-ATX::c0] channel 0: +5 V rail (motherboard -> USB power) - [
p-ATX::c1] channel 1: +5 V 5VSB rail (motherboard -> power button, WoL, standby) - [
p-ATX::c2] channel 2: +3.3 V rail (motherboard -> RAM, SSD, chipset) - [
p-ATX::c3] channel 3: +12 V rail (motherboard -> DC2DC converters, PCIe) - [
p-ATX::c4] channel 4: +12 V EPS12V connector (CPU exclusively) (can be left unused, for instance for theiml-ia770nodes)
p-PCIe¶
This probe measures multiple PCIe 8-pin connectors (up to 4) or the new 12VHPWR connector at the same time. p-PCIe manages 4 channels, even for the 12VHPWR connector. Indeed, it is split in 4 even if there is only a single connector. For each channel, 1000 samples per second are measured and sent to the main board (via the I2C bus).
The following 4 channels are available:
- [
p-PCIe::c0] channel 0: +12 V PCIe 8-pin connectors of 1/4 of 12VHPWR (GPU) - [
p-PCIe::c1] channel 1: +12 V PCIe 8-pin connectors of 2/4 of 12VHPWR (GPU) - [
p-PCIe::c2] channel 2: +12 V PCIe 8-pin connectors of 3/4 of 12VHPWR (GPU) - [
p-PCIe::c3] channel 3: +12 V PCIe 8-pin connectors of 4/4 of 12VHPWR (GPU)
p-UPD+¶
This probe measures the power from one USB-C connector or from one 19 V coaxial connector. It supports USB PD 3.1 (up to 240 W). p-UPD+ relies on a fast Texas Instruments INA228 digital power monitor. The INA228 is configured to measure 4000 samples per second and p-UPD+ writes 1000 averages samples per second on the I2C bus of the main board.
As a consequence, p-UPD+ provides a single channel: p-UPD+::c0.
p-Sen¶
Work in progress...
NCM Main Board¶
Previous sections help to understand connectors and NCM probes. In this section,
the main board is detailed. There is one main board per node and the
communication is achieved through USB. For now, we installed only one main board
per partition. The following nodes are equipped: az4-n4090-1, az4-a7900-1,
iml-ia770-1 and az5-a890m-1. On the main board, each probe has a unique
identifier (Probe ID) and it is chained to one of the two available I2C buses
(chain 1 or chain 2). The following subsections give the binding between the
main board and the probes depending on the partition.
az4-n4090¶
| I2C chain | Probe Name | Probe ID | Probe Channel | Comments |
|---|---|---|---|---|
| 1 | p-ATX | 0 | 0 | 5 V rail of the motherboard ATX 24-pin (USB power, used to power NCM main board and p-PCIe) |
| 1 | p-ATX | 0 | 1 | 5 V 5VSB rail the motherboard ATX 24-pin (power button, WoL, standby, used to power NCM p-ATX) |
| 1 | p-ATX | 0 | 2 | 3.3 V rail of the motherboard ATX 24-pin (RAM, SSD, chipset) |
| 1 | p-ATX | 0 | 3 | 12 V rail of the motherboard ATX 24-pin (DC2DC converters, PCIe, fans) |
| 1 | p-ATX | 0 | 4 | 12 V CPU EPS12V connector |
| 2 | p-PCIe | 1 | 0 | 12 V GPU 12VHPWR connector (1/4) |
| 2 | p-PCIe | 1 | 1 | 12 V GPU 12VHPWR connector (2/4) |
| 2 | p-PCIe | 1 | 2 | 12 V GPU 12VHPWR connector (3/4) |
| 2 | p-PCIe | 1 | 3 | 12 V GPU 12VHPWR connector (4/4) |
az4-a7900¶
| I2C chain | Probe Name | Probe ID | Probe Channel | Comments |
|---|---|---|---|---|
| 1 | p-ATX | 0 | 0 | 5 V rail of the motherboard ATX 24-pin (USB power, used to power NCM main board and p-PCIe) |
| 1 | p-ATX | 0 | 1 | 5 V 5VSB rail the motherboard ATX 24-pin (power button, WoL, standby, used to power NCM p-ATX) |
| 1 | p-ATX | 0 | 2 | 3.3 V rail of the motherboard ATX 24-pin (RAM, SSD, chipset) |
| 1 | p-ATX | 0 | 3 | 12 V rail of the motherboard ATX 24-pin (DC2DC converters, PCIe, fans) |
| 1 | p-ATX | 0 | 4 | 12 V CPU EPS12V connector |
| 2 | p-PCIe | 1 | 0 | 12 V GPU first PCIe 8-pin |
| 2 | p-PCIe | 1 | 1 | 12 V GPU second PCIe 8-pin |
| 2 | p-PCIe | 1 | 2 | 12 V GPU third PCIe 8-pin |
| 2 | p-PCIe | 1 | 3 | Unused channel |
iml-ia770¶
| I2C chain | Probe Name | Probe ID | Probe Channel | Comments |
|---|---|---|---|---|
| 1 | p-ATX | 0 | 0 | 5 V rail of the eGPU dock ATX 24-pin (used to power NCM p-PCIe, not sure...) |
| 1 | p-ATX | 0 | 1 | 5 V 5VSB rail the eGPU dock ATX 24-pin (power button, standby, used to power NCM p-ATX) |
| 1 | p-ATX | 0 | 2 | 3.3 V rail of the eGPU dock ATX 24-pin (used by the GPU dock, probably for Oculink/USB 4 chipsets) |
| 1 | p-ATX | 0 | 3 | 12 V rail of the eGPU dock ATX 24-pin (DC2DC converters, PCIe) |
| 1 | p-ATX | 0 | 4 | Unused channel |
| 2 | p-PCIe | 1 | 0 | 12 V GPU first PCIe 8-pin |
| 2 | p-PCIe | 1 | 1 | 12 V GPU second PCIe 8-pin |
| 2 | p-PCIe | 1 | 2 | Unused channel |
| 2 | p-PCIe | 1 | 3 | Unused channel |
| 1 | p-UPD+ | 2 | 0 | 19 V coaxial of the mini-PC (used to power NCM main board and p-UPD+) |
az5-a890m¶
| I2C chain | Probe Name | Probe ID | Probe Channel | Comments |
|---|---|---|---|---|
| 1 | p-UPD+ | 0 | 0 | 19 V coaxial of the mini-PC (used to power NCM main board and p-UPD+) |
NCM Software¶
Previous sections described the hardware parts of NCM. This current section
focuses on the software part and more precisely on the node-conso executable
binary that is actually capable of displaying the measured samples to the Dalek
users.
Two versions of node-conso are currently installed, as modules, on Dalek:
ncm/g8b77353: "old" version to use withaz5-a890m-1(the firmware of the main board will be updated soon).ncm/gdcc873f: up to date version foraz4-n4090-1,az4-a7900-1andiml-ia770-1.
The node-conso binary supports the following command line arguments:
-P(1|2): turn on power on the I2C chain 1 or 2-p(1|2): turn off power on the I2C chain 1 or 2-M(1|2): start collecting measures on the I2C chain 1 or 2-m(1|2): stop collecting measures on the I2C chain 1 or 2-t(seconds): reports measures. Ifsecondsis not0, program will exit after that time.
Tutorial on the az4-n4090-1 Node¶
-
Load the NCM module to have
node-consoin thePATH: -
Turn on I2C chains 1 and 2
It is mandatory to do this after each reboot and there is no problem to redo it even if the I2C chains are already turned on. -
Start the measurements on I2C chains 1 and 2
This step won't output the samples, it only tells the main board to start the measurements the chained probes (via I2C bus). Indeed, NCM has an internal memory to keep the energy consumed since the last timenode-conso -M <1|2>has been called. -
Output the samples on the terminal
This will print the samples in real time during 1 second (-t 1parameter). If you want to print the samples indefinitely, you can use-t 0. Then, send the signal interrupt (ctrl+c) to stop the program.The previous command will output something like:
# col1 col2 col3 col4 col5 col6 9953 0.0 0xff 5.061V 1.3907A 64.456J 9953 0.1 0xff 5.053V -0.0457A 0.812J 9953 0.2 0xff 3.353V 0.6719A 21.677J 9953 0.3 0xff 12.070V 0.4519A 59.192J 9955 0.4 0xff 12.100V 3.4833A 103.992J 9955 1.0 0xff 12.092V 0.9871A 121.775J 9955 1.1 0xff 12.091V 0.9615A 119.872J 9955 1.2 0xff 12.093V 0.9374A 118.136J 9955 1.3 0xff 0.123V 0.0089A 0.001J 9957 0.0 0xff 5.061V 1.3839A 64.484J 9957 0.1 0xff 5.049V -0.0337A 0.812J 9957 0.2 0xff 3.353V 0.6734A 21.687J 9957 0.3 0xff 12.094V 0.4774A 59.214J 9958 1.0 0xff 12.100V 1.0343A 121.812J 9958 1.1 0xff 12.103V 1.0010A 119.908J 9958 1.2 0xff 12.107V 0.9759A 118.171J 9958 1.3 0xff 0.119V -0.0237A 0.001J 9960 0.4 0xff 12.080V 3.3875A 104.148J 9961 1.0 0xff 12.113V 1.2181A 121.852J 9961 1.1 0xff 12.133V 1.1303A 119.946J 9961 1.2 0xff 12.147V 1.0778A 118.207J 9961 1.3 0xff 0.121V 0.0168A 0.001J 9962 0.0 0xff 5.053V 1.3432A 64.519J 9962 0.1 0xff 5.040V 0.0251A 0.813J 9962 0.2 0xff 3.346V 0.6991A 21.698J 9962 0.3 0xff 12.053V 0.5464A 59.245J 9964 0.4 0xff 12.035V 3.4698A 104.300J 9964 1.0 0xff 12.109V 1.0451A 121.886J 9964 1.1 0xff 12.113V 1.0247A 119.983J 9964 1.2 0xff 12.119V 0.9759A 118.243J 9964 1.3 0xff 0.124V -0.0355A 0.001J 9966 0.0 0xff 5.060V 1.3688A 64.554J 9966 0.1 0xff 5.055V -0.0043A 0.814J 9966 0.2 0xff 3.351V 0.6727A 21.709J 9966 0.3 0xff 12.104V 0.4504A 59.273J 9967 1.0 0xff 12.100V 0.9910A 121.922J 9967 1.1 0xff 12.098V 0.9812A 120.018J 9967 1.2 0xff 12.104V 0.9591A 118.278J 9967 1.3 0xff 0.124V 0.0158A 0.001Jwhere each line corresponds to a sample and:
col1: is a timestamp since the last time we callednode-conso -M <1|2>col2: the first number is the Probe ID and the second one is the channel of the probe. For instance, on theaz4-n4090-1node,0.4means it is the Probe ID 0 (thus it is p-ATX) and the channel 4 of p-ATX is the CPU EPS12V connectorcol3: is not documented yetcol4: is the the measured tension in Voltcol5: is the measured current in Amperecol6: is the energy consumed in Joule since the last time we rannode-conso -M <1|2>
-
Stop the measurements
Et voilà, that's it! Your are ready to enjoy NCM on Dalek, simple and smooth :-)!