

JUNE 10 – 13, 2025 | HAMBURG, GERMANY

### Analysis of Application Power Characteristics Using Performance Counters on A64FX

### <u>Ryoma Ohara<sup>1</sup></u> Keiji Yamamoto<sup>2</sup> Toshihiro Hanawa<sup>1</sup>

1 The University of Tokyo, Japan 2 R-CCS, RIKEN

- 1. Introduction
- 2. Evaluation of Microbenchmarks
- 3. Analysis of PMU Counter Values
- 4. Estimation of Optimal Power Knob Setting
- 5. Related Work
- 6. Conclusion & Future Work

## Introduction

### **!!** Rising Electricity Costs at Centers

- further increases are not acceptable
- Few-percent energy savings
  - $\rightarrow$  Saves hundreds of thousands of dollars in electricity costs



#### **Execution Time**

#### Specification of the A64FX[1]

# Fugaku

- Total: 158,976 nodes(1 × A64FX per node) C
- TDP:180W
- Experimental Setup: <u>1–2 nodes</u>



| Block diagram of the | A64FX CPU |
|----------------------|-----------|

| Item                    | Description                                                      |
|-------------------------|------------------------------------------------------------------|
| Architecture            | Armv8.2-A SVE 512 bit                                            |
| Number of compute cores | 48 cores                                                         |
| CPU frequency           | 2.0 / 2.2 GHz                                                    |
| Theoretical Performance | Double Precision (2.2GHz) : 3.3792 TFLOPS                        |
| Memory                  | HBM2 32 GiB, 1024 GB/s                                           |
| Interconnect            | Tofu Interconnect D (28 Gbps $\times$ 2 lanes $\times$ 10 ports) |
| I/O                     | PCIe Gen3 $\times$ 16                                            |
| Technology              | 7nm FinFET                                                       |



Supercomputer "Fugaku" (RIKEN R-CCS)

## **Power Knob**

Mechanisms to improve energy efficiency on Fugaku









| Power Knob          | Frequency [GHz] | # of FPU pipelines | retention mode |
|---------------------|-----------------|--------------------|----------------|
| normal              | 2.0             | 2                  | off            |
| boost               | <u>2.2</u>      | 2                  | off            |
| есо                 | 2.0             | <u>1</u>           | off            |
| retention           | 2.0             | 2                  | <u>on</u>      |
| boost + eco         | <u>2.2</u>      | <u>1</u>           | off            |
| boost + retention   | <u>2.2</u>      | 2                  | <u>on</u>      |
| eco + retention     | 2.0             | 1                  | on             |
| boost+eco+retention | 2.2             | 1                  | on             |

# **Energy Reduction of Fugaku**



# **Presentation Outline**

1. Evaluation of microbenchmarks



2. Measurement of PMU counter values for each microbenchmark



3. Estimation of optimal power knob

Determine the optimal power knob based on metrics calculated from PMU counter values.

Investigate the correlation between PMU counter values and power knobs.

Investigate the power characteristics for each benchmark.

#### 1. Introduction

- 2. Evaluation of Microbenchmarks
- 3. Analysis of PMU Counter Values
- 4. Estimation of Optimal Power Knob Setting
- 5. Related Work
- 6. Conclusion & Future Work

## Microbenchmarks



X dgemm, stream, fft, ptrans are selected from [2], IS and EP from [3], and osu benchmarks from [4].

[2] HPC Challenge Benchmark. https://hpcchallenge.org/hpcc/, Online; accessed 19 December 2024.

[3] NAS Parallel Benchmarks. https://www.nas.nasa.gov/software/npb.html, Online; accessed 19 December 2024.

[4] osu-micro-benchmarks, https://github.com/forresti/osu-micro-benchmarks, Online; accessed 28 February 2025

## Microbenchmarks



[2] HPC Challenge Benchmark. https://hpcchallenge.org/hpcc/, Online; accessed 19 December 2024.
[3] NAS Parallel Benchmarks. https://www.nas.nasa.gov/software/npb.html, Online; accessed 19 December 2024.
[4] osu-micro-benchmarks, <u>https://github.com/forresti/osu-micro-benchmarks</u>, Online; accessed 28 February 2025



[5] Sandia power api. http://powerapi.sandia.gov/, Online; accessed 19 December 2024.



## Results – MPI benchma Eco types:

- MPI communication between adjacent 2 nodes
- 4, 8, 16, 32, 48 processes per node

Energy = <u>Power</u>  $\checkmark$  <u>Exec. Time</u>  $\rightarrow$ 

With little increase in exec. time, <u>energy efficiency</u>



- 1. Introduction
- 2. Evaluation of Microbenchmarks
- 3. Analysis of PMU Counter Values
- 4. Estimation of Optimal Power Knob Setting
- 5. Related Work
- 6. Conclusion & Future Work

## **Measurement of PMU counter**

- For each benchmark, PMU counter values were measured using the fapp command.
  - For example, PMU event numbers were specified in fapp as shown below.
  - fapp -C -d ./pmu -Hevent\_raw=0x001b,0x8010,0x8028,0x8034,0x8038,0x0105,0x8043,0x0108 ./dgemm



|       | 0x001b   | 0x8010   | 0x8028   | 0x8034 | 0x8038 | 0x0105   | 0x8043 | 0x0108 |
|-------|----------|----------|----------|--------|--------|----------|--------|--------|
| dgemm | 3.20E+11 | 1.81E+11 | 1.81E+11 | 0      | 524    | 15165304 | 233419 | 209015 |

- Counter values were collected for all 183 events listed in A64FX PMU Events [6].
  - Measured with 8, 16, 32, 48 threads and 16, 32, 64, 96 processes.

e.g.

• Each benchmark was repeated as needed to ensure 15-20 seconds of execution time per measurement.



### Counter values normalized by exec. time



| Event Name       | Description                      | Event Name          | Description                               |
|------------------|----------------------------------|---------------------|-------------------------------------------|
| FP_SPEC          | # of executed FP instructions    | STALL_FRONTEND      | Cycles with no issuable instr. (frontend) |
| L2D_CACHE_REFILL | L2 cache refill count            | BUS_READ_TOTAL_MEM  | CMG: Local memory read count              |
| L2D_CACHE_WB     | L2 cache write-back count        | BUS_WRITE_TOTAL_MEM | CMG: Local memory write count             |
| ST_SPEC          | # of executed store instructions | BUS_READ_TOTAL_TOFU | Reads from Tofu controller to CMG         |

- 1. Introduction
- 2. Evaluation of Microbenchmarks
- 3. Analysis of PMU Counter Values
- 4. Estimation of Optimal Power Knob Setting
- 5. Related Work
- 6. Conclusion & Future Work

# **Optimal power knob estimation**

Based on the above analysis, the most important PMU events for understanding an application's power characteristics are:

Floating-point operations
 L2 cache refill/write-back
 Tofu accesses

We propose a method to compute key metrics from relevant PMU counter values and suggest the optimal power knob.

Metrics used for optimal knob estimation (based on Fujitsu Manual)

| Event Name           | Description                                            |
|----------------------|--------------------------------------------------------|
| INST_SPEC            | # of executed instructions                             |
| FP_SCALE_OPS_SPEC    | # of executed FP instructions                          |
| FP_FIXED_OPS_SPEC    | # of executed Advanced SIMD and scalar FP instructions |
| L2D_CACHE_REFILL     | L2 cache refill count                                  |
| L2D_SWAP_DM          | Demand access hits in prefetch-prepared buffer         |
| L2D_CACHE_MIBMCH_PRF | Prefetch hits in demand-allocated buffer               |
| LD_SPEC              | # of executed load instructions                        |
| ST_SPEC              | # of executed store instructions                       |
| BUS_READ_TOTAL_TOFU  | Reads from Tofu controller to CMG                      |
| BUS_WRITE_TOTAL_TOFU | Writes from Tofu controller to CMG                     |

Floating-point operation rate = 
$$\frac{FP\_SCALE \times 512/128 + FP\_FIXED}{INST\_SP \times 2 \times 2 \times 8}$$
(1)  
L2 miss rate = 
$$\frac{L2\_CA\_REF - L2\_SW - L2\_PRF}{LD\_SP + ST\_SP}$$
(2)  
Tofu access rate = 
$$\frac{BUS\_R\_TOFU + BUS\_W\_TOFU}{INST\_SP}$$
(3)

### Metric calculation results



### Metric calculation results



For each benchmark, measurements were taken for 32 combinations = 4 thread/process patterns  $\times$  8 power knob settings.

|   | benchmark         | optimal power knob  |
|---|-------------------|---------------------|
|   | dgemm             | boost               |
|   | EP                | boost+eco+retention |
|   | stream            | eco                 |
| N | fft eco+retention |                     |
|   | ptrans            | eco+retention       |
|   | IS                | eco+retention       |
|   | osu_mbw_mr        | eco+retention       |
|   | osu_allreduce     | eco+retention       |
|   |                   |                     |

# **Optimal Power Knob**

- <u>Energy</u> consumption normalized by normal value at each thread/process count
- Lowest energy consumption is defined as optimal power knob setting



### Metric calculation results



### Metric calculation results



- 1. Introduction
- 2. Evaluation of Microbenchmarks
- 3. Analysis of PMU Counter Values
- 4. Estimation of Optimal Power Knob Setting
- 5. Related Work
- 6. Conclusion & Future Work

## **Related Work**

|           | Papadimitriou et al. [7]                                                      | Kusaba et al. [8]                                    | Fan et al. [9]                                 | Our Study                                                                          |
|-----------|-------------------------------------------------------------------------------|------------------------------------------------------|------------------------------------------------|------------------------------------------------------------------------------------|
| Target    | <u>CPU</u> : X-Gene 2/3,<br>single node                                       | <u>CPU</u> : A64FX system                            | <u>GPU</u> : AMD MI100,<br>NVIDIA V100 systems | A64FX, 1~2 node                                                                    |
| Approach  | Dynamic DVFS & core<br>alloc. during monitoring                               | <u><b>Node reduction</b></u> under power constraints | Fine-grained freq.<br>prediction by ML         | App analysis for<br><u>dynamic power knob</u><br><u>adjustment</u>                 |
| Method    | Optimize settings based on L3 cache access rate                               | Reduce nodes based on power variation                | Static feature extraction at compile time.     | Determine the optimal<br>power knob <u>using only</u><br><u>PMU counter values</u> |
| Objective | Energy, ED <sup>2</sup> P<br>(ED <sup>2</sup> P=Energy × Delay <sup>2</sup> ) | Peak system perf.                                    | EDP, ED <sup>2</sup> P, etc.                   | <u>Energy</u>                                                                      |

[7] Papadimitriou, G., Chatzidimitriou, A., & Gizopoulos, D. (2019, February). Adaptive voltage/frequency scaling and core allocation for balanced energy and performance on multicore cpus. In 2019 IEEE international symposium on high performance computer architecture (HPCA) (pp. 133-146). IEEE.
[8] Tomoya Kusaba, Yusuke Awaki, Kohei Yoshida, Shinobu Miwa, Hayato Yamaki, Toshihiro Hanawa, and Hiroki Honda. Power-efficiency variation on a64fx supercomputers and its application to system operation. In 2024 IEEE International Conference on Cluster Computing Workshops (CLUSTER Workshops), pp. 55–65. IEEE, 2024.
[9] Fan, K., D'Antonio, M., Carpentieri, L., Cosenza, B., Ficarelli, F., & Cesarini, D. (2023, November). SYnergy: Fine-grained Energy-Efficient Heterogeneous Computing for Scalable Energy Saving. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 1-13).

- 1. Introduction
- 2. Evaluation of Microbenchmarks
- 3. Analysis of PMU Counter Values
- 4. Estimation of Optimal Power Knob Setting
- 5. Related Work
- 6. Conclusion & Future Work

# Conclusion

To improve energy efficiency on supercomputers, we conducted the following:

- <u>Analyzed correlations between PMU counter values, application power</u> <u>characteristics, and power knob settings for each benchmark.</u>
  - Demonstrated that using power knobs can reduce energy consumption by <u>up to</u> <u>28.4% (48 threads) and 53.8% (8 threads)</u> compared to normal settings, highlighting their effectiveness per application.
  - Through PMU-based power analysis, <u>we revealed correlations among PMU values</u>, <u>app characteristics</u>, and power knob choices.

Towards optimal power knob estimation:

Based on the analysis, we selected key PMU events to estimate app characteristics and proposed a method to choose the optimal power knob using only PMU counter values.

# **Future Work**

### **Practical Power Knob Estimation**

- Increase the number of target applications
- Include more complex, real-world applications
- Apply machine learning approaches

#### **Dynamic Power Knob Adjustment**

- Use ML model to change power knobs dynamically during execution
- Build a system for real-time estimation and control

#### Portability

• Explore applicability to other systems, especially GPUs



JUNE 10 – 13, 2025 | HAMBURG, GERMANY

**Thank you for your listening!!**