Skip to content

Node-Conso Modular

This page describes how to operate NCM, the hardware platform developed to measure Dalek’s energy consumption. Its purpose is to provide a precise, high-frequency power monitoring system that captures energy usage between the PSU’s DC output and the node components. Note that these measurements do not reflect total energy consumption, as the PSU itself also consumes power during the AC-to-DC conversion process.

While commercial solutions do exist, they often lack the flexibility required for research purposes and are not easily adaptable to custom experimental setups. Our proposed solution is a modular, open-source platform - both in hardware and software - tailored to meet researchers needs. We also believe this platform can be reused in other contexts, unlocking new use cases and potentially being adapted to other clusters in the future. Additionally, since the platform focuses on socket-level measurements, it complements approaches based on MSRs, such as Intel's RAPL or Nvidia SMI.

The core design is based on separating the platform into two main components:

  1. a mainboard, responsible for aggregating the collected samples and communicating the measured data to the node, and
  2. probes, which measure voltage and current between the power supply and the compute node.

This architecture allows for designing various probe circuits depending on the power supply type used by each node, while maintaining standardized interfaces with the mainboard. Each compute node is equipped with one main board, and multiple probes can be connected to it.

We named the platform: Node-Code Modular or NCM for short. The source code of the software and hardware is available here: https://gitlab.lip6.fr/bouyer/node-conso-modular.

NCM Probes

One of the innovative aspect of NCM is its adaptability. Compute nodes have many connectors and rails that need to be powered and where energy consumption can be observed. To address such a diversity, three (plus one) specific probes have been designed (each probe is a different electronic circuit):

p-ATX

This probe measures the 4 rails of the ATX 24-pin connector and the CPU EPS12V connector independently at approximately 1000 samples per second. Then, p-ATX send these samples to the mainboard (via the I2C bus).

The following 5 channels are available:

  • [p-ATX::c0] channel 0: +5 V rail (motherboard -> USB power)
  • [p-ATX::c1] channel 1: +5 V 5VSB rail (motherboard -> power button, WoL, standby)
  • [p-ATX::c2] channel 2: +3.3 V rail (motherboard -> RAM, SSD, chipset)
  • [p-ATX::c3] channel 3: +12 V rail (motherboard -> DC2DC converters, PCIe)
  • [p-ATX::c4] channel 4: +12 V EPS12V connector (CPU exclusively) (can be left unused, for instance for the iml-ia770 nodes)

p-PCIe

This probe measures multiple PCIe 8-pin connectors (up to 4) or the new 12VHPWR connector at the same time. p-PCIe manages 4 channels, even for the 12VHPWR connector. Indeed, it is split in 4 even if there is only a single connector. For each channel, 1000 samples per second are measured and sent to the mainboard (via the I2C bus).

The following 4 channels are available:

  • [p-PCIe::c0] channel 0: +12 V PCIe 8-pin connectors of 1/4 of 12VHPWR (GPU)
  • [p-PCIe::c1] channel 1: +12 V PCIe 8-pin connectors of 2/4 of 12VHPWR (GPU)
  • [p-PCIe::c2] channel 2: +12 V PCIe 8-pin connectors of 3/4 of 12VHPWR (GPU)
  • [p-PCIe::c3] channel 3: +12 V PCIe 8-pin connectors of 4/4 of 12VHPWR (GPU)

p-small

This probe measures the power from one USB-C connector or from one 19 V coaxial connector. It supports USB PD 3.1 (up to 240 W). p-small relies on a fast Texas Instruments INA228 digital power monitor. The INA228 is configured to measure 4000 samples per second and p-small writes 1000 averages samples per second on the I2C bus of the mainboard.

As a consequence, p-small provides a single channel: p-small::c0.

p-temp

Work in progress...

NCM Mainboard

Previous sections help to understand connectors and NCM probes. In this section, the mainboard is detailed. There is one mainboard per node and the communication is achieved through USB. For now, we installed only one mainboard per partition. The following nodes are equipped: az4-n4090-[0-3], az4-a7900-[0-3], iml-ia770-1 and az5-a890m-1. On the mainboard, each probe has a unique identifier (Probe ID) and it is chained to one of the two available I2C buses (chain 1 or chain 2). The following subsections give the binding between the mainboard and the probes depending on the partition.

az4-n4090

I2C chain Probe Name Probe ID Probe Channel Comments
1 p-ATX 0 0 5 V rail of the motherboard ATX 24-pin (USB power, used to power NCM mainboard)
1 p-ATX 0 1 5 V 5VSB rail the motherboard ATX 24-pin (power button, WoL, standby, used to power NCM p-ATX)
1 p-ATX 0 2 3.3 V rail of the motherboard ATX 24-pin (RAM, SSD, chipset)
1 p-ATX 0 3 12 V rail of the motherboard ATX 24-pin (DC2DC converters, PCIe, fans)
1 p-ATX 0 4 12 V CPU EPS12V connector
2 p-PCIe 1 0 12 V GPU 12VHPWR connector (1/4)
2 p-PCIe 1 1 12 V GPU 12VHPWR connector (2/4)
2 p-PCIe 1 2 12 V GPU 12VHPWR connector (3/4)
2 p-PCIe 1 3 12 V GPU 12VHPWR connector (4/4)

Warning

Be aware that sometimes the Probe IDs can be exchanged. It is easy to detect as the number of channels is different between p-ATX and p-PCIe. We are currently working on this issue.

az4-a7900

I2C chain Probe Name Probe ID Probe Channel Comments
1 p-ATX 0 0 5 V rail of the motherboard ATX 24-pin (USB power, used to power NCM mainboard)
1 p-ATX 0 1 5 V 5VSB rail the motherboard ATX 24-pin (power button, WoL, standby, used to power NCM p-ATX)
1 p-ATX 0 2 3.3 V rail of the motherboard ATX 24-pin (RAM, SSD, chipset)
1 p-ATX 0 3 12 V rail of the motherboard ATX 24-pin (DC2DC converters, PCIe, fans)
1 p-ATX 0 4 12 V CPU EPS12V connector
2 p-PCIe 1 0 12 V GPU first PCIe 8-pin
2 p-PCIe 1 1 12 V GPU second PCIe 8-pin
2 p-PCIe 1 2 12 V GPU third PCIe 8-pin
2 p-PCIe 1 3 Unused channel

Warning

Be aware that sometimes the Probe IDs can be exchanged. It is easy to detect as the number of channels is different between p-ATX and p-PCIe. We are currently working on this issue.

iml-ia770

I2C chain Probe Name Probe ID Probe Channel Comments
1 p-ATX 0 0 5 V rail of the eGPU dock ATX 24-pin (USB power, used to power NCM mainboard)
1 p-ATX 0 1 5 V 5VSB rail the eGPU dock ATX 24-pin (power button, standby, used to power NCM p-ATX)
1 p-ATX 0 2 3.3 V rail of the eGPU dock ATX 24-pin (used by the GPU dock, probably for Oculink/USB 4 chipsets)
1 p-ATX 0 3 12 V rail of the eGPU dock ATX 24-pin (DC2DC converters, PCIe)
1 p-ATX 0 4 Unused channel
2 p-PCIe 1 0 12 V GPU first PCIe 8-pin
2 p-PCIe 1 1 12 V GPU second PCIe 8-pin
2 p-PCIe 1 2 Unused channel
2 p-PCIe 1 3 Unused channel
1 p-small 2 0 19 V coaxial of the mini-PC (used to power NCM mainboard and p-small)

Warning

Be aware that sometimes the Probe IDs can be exchanged. It is easy to detect as the number of channels is different between p-ATX, p-PCIe and p-small. We are currently working on this issue.

az5-a890m

I2C chain Probe Name Probe ID Probe Channel Comments
2 p-small 0 0 19 V coaxial of the mini-PC (used to power NCM mainboard and p-small)

NCM Software

Previous sections described the hardware parts of NCM. This current section focuses on the software part and more precisely on the node-conso executable binary that is actually capable of displaying the measured samples to the Dalek users.

The node-conso binary supports the following command line arguments:

  • -P (1|2): turn on power on the I2C chain 1 or 2
  • -p (1|2): turn off power on the I2C chain 1 or 2
  • -M (1|2): start collecting measures on the I2C chain 1 or 2
  • -m (1|2): stop collecting measures on the I2C chain 1 or 2
  • -t (seconds): reports measures. If seconds is not 0, program will exit after that time.

Tutorial on the az4-n4090-1 Node

  1. Load the NCM module to have node-conso in the PATH:

    module load ncm/1.1.2
    

  2. Turn on I2C chains 1 and 2

    node-conso -P 1
    node-conso -P 2
    
    It is mandatory to do this after each reboot and there is no problem to redo it even if the I2C chains are already turned on.

  3. Start the measurements on I2C chains 1 and 2

    node-conso -M 1
    node-conso -M 2
    
    This step won't output the samples, it only tells the mainboard to start the measurements the chained probes (via I2C bus). Indeed, NCM has an internal memory to keep the energy consumed since the last time node-conso -M <1|2> has been called.

  4. Output the samples on the terminal

    node-conso -t 1
    
    This will print the samples in real time during 1 second (-t 1 parameter). If you want to print the samples indefinitely, you can use -t 0. Then, send the signal interrupt (ctrl+c) to stop the program.

    The previous command will output something like:

    # col1 col2 col3    col4     col5     col6
      9953  0.0 0xff  5.061V  1.3907A  64.456J
      9953  0.1 0xff  5.053V -0.0457A   0.812J
      9953  0.2 0xff  3.353V  0.6719A  21.677J
      9953  0.3 0xff 12.070V  0.4519A  59.192J
      9955  0.4 0xff 12.100V  3.4833A 103.992J
      9955  1.0 0xff 12.092V  0.9871A 121.775J
      9955  1.1 0xff 12.091V  0.9615A 119.872J
      9955  1.2 0xff 12.093V  0.9374A 118.136J
      9955  1.3 0xff  0.123V  0.0089A   0.001J
      9957  0.0 0xff  5.061V  1.3839A  64.484J
      9957  0.1 0xff  5.049V -0.0337A   0.812J
      9957  0.2 0xff  3.353V  0.6734A  21.687J
      9957  0.3 0xff 12.094V  0.4774A  59.214J
      9958  1.0 0xff 12.100V  1.0343A 121.812J
      9958  1.1 0xff 12.103V  1.0010A 119.908J
      9958  1.2 0xff 12.107V  0.9759A 118.171J
      9958  1.3 0xff  0.119V -0.0237A   0.001J
      9960  0.4 0xff 12.080V  3.3875A 104.148J
      9961  1.0 0xff 12.113V  1.2181A 121.852J
      9961  1.1 0xff 12.133V  1.1303A 119.946J
      9961  1.2 0xff 12.147V  1.0778A 118.207J
      9961  1.3 0xff  0.121V  0.0168A   0.001J
      9962  0.0 0xff  5.053V  1.3432A  64.519J
      9962  0.1 0xff  5.040V  0.0251A   0.813J
      9962  0.2 0xff  3.346V  0.6991A  21.698J
      9962  0.3 0xff 12.053V  0.5464A  59.245J
      9964  0.4 0xff 12.035V  3.4698A 104.300J
      9964  1.0 0xff 12.109V  1.0451A 121.886J
      9964  1.1 0xff 12.113V  1.0247A 119.983J
      9964  1.2 0xff 12.119V  0.9759A 118.243J
      9964  1.3 0xff  0.124V -0.0355A   0.001J
      9966  0.0 0xff  5.060V  1.3688A  64.554J
      9966  0.1 0xff  5.055V -0.0043A   0.814J
      9966  0.2 0xff  3.351V  0.6727A  21.709J
      9966  0.3 0xff 12.104V  0.4504A  59.273J
      9967  1.0 0xff 12.100V  0.9910A 121.922J
      9967  1.1 0xff 12.098V  0.9812A 120.018J
      9967  1.2 0xff 12.104V  0.9591A 118.278J
      9967  1.3 0xff  0.124V  0.0158A   0.001J
    

    where each line corresponds to a sample and:

    • col1: is a timestamp since the last time we called node-conso -M <1|2>
    • col2: the first number is the Probe ID and the second one is the channel of the probe. For instance, on the az4-n4090-1 node, 0.4 means it is the Probe ID 0 (thus it is p-ATX) and the channel 4 of p-ATX is the CPU EPS12V connector
    • col3: is not documented yet
    • col4: is the the measured tension in Volt
    • col5: is the measured current in Ampere
    • col6: is the energy consumed in Joule since the last time we ran node-conso -M <1|2>
  5. Stop the measurements

    node-conso -m 1
    node-conso -m 2
    

  6. Power off I2C chain 1 and 2

    node-conso -p 1
    node-conso -p 2
    

Et voilà, that's it!