[rocprof-compute] added f4f6 description and VALU FLOPS split for empirical peaks (#739)

* added f4f6 description and VALU FLOPS split

* changed peak ammolite vars to local

* reverted to dict peak initialization

* ruff check format

* updated VALU descriptions

* updated VALU descriptions

* Update parser.py

* Update parser.py

Added gracefull NameError handling
Moved globals() update to init_metric_evaluation with ammolite__ vars and raw pmc_df

* update formatting
Этот коммит содержится в:
jamessiddeley-amd
2025-08-27 12:46:41 -04:00
коммит произвёл GitHub
родитель 5201efe050
Коммит f0955f5a83
9 изменённых файлов: 420 добавлений и 148 удалений
+33 -11
Просмотреть файл
@@ -3,9 +3,17 @@ Panel Config:
id: 400
title: Roofline
metrics_description:
VALU FLOPs: 'The total floating-point operations executed per second on the VALU.
This is also presented as a percent of the peak theoretical FLOPs achievable
on the specific accelerator. Note: this does not include any floating-point
VALU FLOPs (F16): 'The total 16-bit floating-point operations executed per second
on the VALU. This is presented with the value of the peak empirical F16 FLOPs
achievable on the specific accelerator. Note: this does not include any F16
operations from MFMA instructions.'
VALU FLOPs (F32): 'The total 32-bit floating-point operations executed per second
on the VALU. This is presented with the value of the peak empirical F32 FLOPs
achievable on the specific accelerator. Note: this does not include any F32
operations from MFMA instructions.'
VALU FLOPs (F64): 'The total 64-bit floating-point operations executed per second
on the VALU. This is presented with the value of the peak empirical F64 FLOPs
achievable on the specific accelerator. Note: this does not include any F64
operations from MFMA instructions.'
MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
executed per second. This does not include any 16-bit brain floating point operations
@@ -28,6 +36,11 @@ Panel Config:
per second. Note: this does not include any 64-bit floating point operations
from VALU instructions. The peak empirically measured F64 MFMA operations achievable
on the specific accelerator is displayed alongside for comparison.'
MFMA FLOPs (F6F4): 'The total number of 4-bit and 6-bit floating point MFMA operations
executed per second. Note: this does not include any floating point operations
from VALU instructions. The peak empirically measured F6F4 MFMA operations achievable
on the specific accelerator is displayed alongside for comparison. It is supported
on AMD Instinct MI350 series (gfx950) and later only.'
MFMA IOPs (Int8): 'The total number of 8-bit integer MFMA operations executed
per second. Note: this does not include any 8-bit integer operations from VALU
instructions. The peak empirically measured INT8 MFMA operations achievable
@@ -80,15 +93,24 @@ Panel Config:
unit: Unit
peak: Peak (Empirical)
metric:
VALU FLOPs:
value: AVG((($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+ (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+ SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+ (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+ SQ_INSTS_VALU_TRANS_F64) )) / ((End_Timestamp - Start_Timestamp) / 1e9))
/ 1e9)
VALU FLOPs (F16):
value: AVG((($wave_size * ( SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+ (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16 )) / ((End_Timestamp
- Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
peak: $FP16Flops_empirical_peak
VALU FLOPs (F32):
value: AVG((($wave_size * ( SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32
+ (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32 )) / ((End_Timestamp
- Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: $FP32Flops_empirical_peak
VALU FLOPs (F64):
value: AVG((($wave_size * ( SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64
+ (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64 )) / ((End_Timestamp
- Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: $FP64Flops_empirical_peak
MFMA FLOPs (F64):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)) / ((End_Timestamp - Start_Timestamp)
/ 1e9)) / 1e9)
+33 -11
Просмотреть файл
@@ -3,9 +3,17 @@ Panel Config:
id: 400
title: Roofline
metrics_description:
VALU FLOPs: 'The total floating-point operations executed per second on the VALU.
This is also presented as a percent of the peak theoretical FLOPs achievable
on the specific accelerator. Note: this does not include any floating-point
VALU FLOPs (F16): 'The total 16-bit floating-point operations executed per second
on the VALU. This is presented with the value of the peak empirical F16 FLOPs
achievable on the specific accelerator. Note: this does not include any F16
operations from MFMA instructions.'
VALU FLOPs (F32): 'The total 32-bit floating-point operations executed per second
on the VALU. This is presented with the value of the peak empirical F32 FLOPs
achievable on the specific accelerator. Note: this does not include any F32
operations from MFMA instructions.'
VALU FLOPs (F64): 'The total 64-bit floating-point operations executed per second
on the VALU. This is presented with the value of the peak empirical F64 FLOPs
achievable on the specific accelerator. Note: this does not include any F64
operations from MFMA instructions.'
MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
executed per second. This does not include any 16-bit brain floating point operations
@@ -28,6 +36,11 @@ Panel Config:
per second. Note: this does not include any 64-bit floating point operations
from VALU instructions. The peak empirically measured F64 MFMA operations achievable
on the specific accelerator is displayed alongside for comparison.'
MFMA FLOPs (F6F4): 'The total number of 4-bit and 6-bit floating point MFMA operations
executed per second. Note: this does not include any floating point operations
from VALU instructions. The peak empirically measured F6F4 MFMA operations achievable
on the specific accelerator is displayed alongside for comparison. It is supported
on AMD Instinct MI350 series (gfx950) and later only.'
MFMA IOPs (Int8): 'The total number of 8-bit integer MFMA operations executed
per second. Note: this does not include any 8-bit integer operations from VALU
instructions. The peak empirically measured INT8 MFMA operations achievable
@@ -80,15 +93,24 @@ Panel Config:
unit: Unit
peak: Peak (Empirical)
metric:
VALU FLOPs:
value: AVG((($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+ (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+ SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+ (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+ SQ_INSTS_VALU_TRANS_F64) )) / ((End_Timestamp - Start_Timestamp) / 1e9))
/ 1e9)
VALU FLOPs (F16):
value: AVG((($wave_size * ( SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+ (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16 )) / ((End_Timestamp
- Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
peak: $FP16Flops_empirical_peak
VALU FLOPs (F32):
value: AVG((($wave_size * ( SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32
+ (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32 )) / ((End_Timestamp
- Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: $FP32Flops_empirical_peak
VALU FLOPs (F64):
value: AVG((($wave_size * ( SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64
+ (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64 )) / ((End_Timestamp
- Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: $FP64Flops_empirical_peak
MFMA FLOPs (F64):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)) / ((End_Timestamp - Start_Timestamp)
/ 1e9)) / 1e9)
+33 -11
Просмотреть файл
@@ -3,9 +3,17 @@ Panel Config:
id: 400
title: Roofline
metrics_description:
VALU FLOPs: 'The total floating-point operations executed per second on the VALU.
This is also presented as a percent of the peak theoretical FLOPs achievable
on the specific accelerator. Note: this does not include any floating-point
VALU FLOPs (F16): 'The total 16-bit floating-point operations executed per second
on the VALU. This is presented with the value of the peak empirical F16 FLOPs
achievable on the specific accelerator. Note: this does not include any F16
operations from MFMA instructions.'
VALU FLOPs (F32): 'The total 32-bit floating-point operations executed per second
on the VALU. This is presented with the value of the peak empirical F32 FLOPs
achievable on the specific accelerator. Note: this does not include any F32
operations from MFMA instructions.'
VALU FLOPs (F64): 'The total 64-bit floating-point operations executed per second
on the VALU. This is presented with the value of the peak empirical F64 FLOPs
achievable on the specific accelerator. Note: this does not include any F64
operations from MFMA instructions.'
MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
executed per second. This does not include any 16-bit brain floating point operations
@@ -28,6 +36,11 @@ Panel Config:
per second. Note: this does not include any 64-bit floating point operations
from VALU instructions. The peak empirically measured F64 MFMA operations achievable
on the specific accelerator is displayed alongside for comparison.'
MFMA FLOPs (F6F4): 'The total number of 4-bit and 6-bit floating point MFMA operations
executed per second. Note: this does not include any floating point operations
from VALU instructions. The peak empirically measured F6F4 MFMA operations achievable
on the specific accelerator is displayed alongside for comparison. It is supported
on AMD Instinct MI350 series (gfx950) and later only.'
MFMA IOPs (Int8): 'The total number of 8-bit integer MFMA operations executed
per second. Note: this does not include any 8-bit integer operations from VALU
instructions. The peak empirically measured INT8 MFMA operations achievable
@@ -80,15 +93,24 @@ Panel Config:
unit: Unit
peak: Peak (Empirical)
metric:
VALU FLOPs:
value: AVG(($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+ (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+ SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+ (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+ SQ_INSTS_VALU_TRANS_F64) ) / ((End_Timestamp - Start_Timestamp) / 1e9))
/ 1e9)
VALU FLOPs (F16):
value: AVG((($wave_size * ( SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+ (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16 )) / ((End_Timestamp
- Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
peak: $FP16Flops_empirical_peak
VALU FLOPs (F32):
value: AVG((($wave_size * ( SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32
+ (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32 )) / ((End_Timestamp
- Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: $FP32Flops_empirical_peak
VALU FLOPs (F64):
value: AVG((($wave_size * ( SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64
+ (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64 )) / ((End_Timestamp
- Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: $FP64Flops_empirical_peak
MFMA FLOPs (F64):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)) / ((End_Timestamp - Start_Timestamp)
/ 1e9)) / 1e9)
+33 -11
Просмотреть файл
@@ -3,9 +3,17 @@ Panel Config:
id: 400
title: Roofline
metrics_description:
VALU FLOPs: 'The total floating-point operations executed per second on the VALU.
This is also presented as a percent of the peak theoretical FLOPs achievable
on the specific accelerator. Note: this does not include any floating-point
VALU FLOPs (F16): 'The total 16-bit floating-point operations executed per second
on the VALU. This is presented with the value of the peak empirical F16 FLOPs
achievable on the specific accelerator. Note: this does not include any F16
operations from MFMA instructions.'
VALU FLOPs (F32): 'The total 32-bit floating-point operations executed per second
on the VALU. This is presented with the value of the peak empirical F32 FLOPs
achievable on the specific accelerator. Note: this does not include any F32
operations from MFMA instructions.'
VALU FLOPs (F64): 'The total 64-bit floating-point operations executed per second
on the VALU. This is presented with the value of the peak empirical F64 FLOPs
achievable on the specific accelerator. Note: this does not include any F64
operations from MFMA instructions.'
MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
executed per second. This does not include any 16-bit brain floating point operations
@@ -28,6 +36,11 @@ Panel Config:
per second. Note: this does not include any 64-bit floating point operations
from VALU instructions. The peak empirically measured F64 MFMA operations achievable
on the specific accelerator is displayed alongside for comparison.'
MFMA FLOPs (F6F4): 'The total number of 4-bit and 6-bit floating point MFMA operations
executed per second. Note: this does not include any floating point operations
from VALU instructions. The peak empirically measured F6F4 MFMA operations achievable
on the specific accelerator is displayed alongside for comparison. It is supported
on AMD Instinct MI350 series (gfx950) and later only.'
MFMA IOPs (Int8): 'The total number of 8-bit integer MFMA operations executed
per second. Note: this does not include any 8-bit integer operations from VALU
instructions. The peak empirically measured INT8 MFMA operations achievable
@@ -80,15 +93,24 @@ Panel Config:
unit: Unit
peak: Peak (Empirical)
metric:
VALU FLOPs:
value: AVG(($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+ (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+ SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+ (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+ SQ_INSTS_VALU_TRANS_F64) ) / ((End_Timestamp - Start_Timestamp) / 1e9))
/ 1e9)
VALU FLOPs (F16):
value: AVG((($wave_size * ( SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+ (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16 )) / ((End_Timestamp
- Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
peak: $FP16Flops_empirical_peak
VALU FLOPs (F32):
value: AVG((($wave_size * ( SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32
+ (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32 )) / ((End_Timestamp
- Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: $FP32Flops_empirical_peak
VALU FLOPs (F64):
value: AVG((($wave_size * ( SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64
+ (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64 )) / ((End_Timestamp
- Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: $FP64Flops_empirical_peak
MFMA FLOPs (F64):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)) / ((End_Timestamp - Start_Timestamp)
/ 1e9)) / 1e9)
+33 -11
Просмотреть файл
@@ -3,9 +3,17 @@ Panel Config:
id: 400
title: Roofline
metrics_description:
VALU FLOPs: 'The total floating-point operations executed per second on the VALU.
This is also presented as a percent of the peak theoretical FLOPs achievable
on the specific accelerator. Note: this does not include any floating-point
VALU FLOPs (F16): 'The total 16-bit floating-point operations executed per second
on the VALU. This is presented with the value of the peak empirical F16 FLOPs
achievable on the specific accelerator. Note: this does not include any F16
operations from MFMA instructions.'
VALU FLOPs (F32): 'The total 32-bit floating-point operations executed per second
on the VALU. This is presented with the value of the peak empirical F32 FLOPs
achievable on the specific accelerator. Note: this does not include any F32
operations from MFMA instructions.'
VALU FLOPs (F64): 'The total 64-bit floating-point operations executed per second
on the VALU. This is presented with the value of the peak empirical F64 FLOPs
achievable on the specific accelerator. Note: this does not include any F64
operations from MFMA instructions.'
MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
executed per second. This does not include any 16-bit brain floating point operations
@@ -28,6 +36,11 @@ Panel Config:
per second. Note: this does not include any 64-bit floating point operations
from VALU instructions. The peak empirically measured F64 MFMA operations achievable
on the specific accelerator is displayed alongside for comparison.'
MFMA FLOPs (F6F4): 'The total number of 4-bit and 6-bit floating point MFMA operations
executed per second. Note: this does not include any floating point operations
from VALU instructions. The peak empirically measured F6F4 MFMA operations achievable
on the specific accelerator is displayed alongside for comparison. It is supported
on AMD Instinct MI350 series (gfx950) and later only.'
MFMA IOPs (Int8): 'The total number of 8-bit integer MFMA operations executed
per second. Note: this does not include any 8-bit integer operations from VALU
instructions. The peak empirically measured INT8 MFMA operations achievable
@@ -80,15 +93,24 @@ Panel Config:
unit: Unit
peak: Peak (Empirical)
metric:
VALU FLOPs:
value: AVG((($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+ (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+ SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+ (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+ SQ_INSTS_VALU_TRANS_F64) )) / ((End_Timestamp - Start_Timestamp) / 1e9))
/ 1e9)
VALU FLOPs (F16):
value: AVG((($wave_size * ( SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+ (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16 )) / ((End_Timestamp
- Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
peak: $FP16Flops_empirical_peak
VALU FLOPs (F32):
value: AVG((($wave_size * ( SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32
+ (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32 )) / ((End_Timestamp
- Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: $FP32Flops_empirical_peak
VALU FLOPs (F64):
value: AVG((($wave_size * ( SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64
+ (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64 )) / ((End_Timestamp
- Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: $FP64Flops_empirical_peak
MFMA FLOPs (F64):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)) / ((End_Timestamp - Start_Timestamp)
/ 1e9)) / 1e9)
+33 -11
Просмотреть файл
@@ -3,9 +3,17 @@ Panel Config:
id: 400
title: Roofline
metrics_description:
VALU FLOPs: 'The total floating-point operations executed per second on the VALU.
This is also presented as a percent of the peak theoretical FLOPs achievable
on the specific accelerator. Note: this does not include any floating-point
VALU FLOPs (F16): 'The total 16-bit floating-point operations executed per second
on the VALU. This is presented with the value of the peak empirical F16 FLOPs
achievable on the specific accelerator. Note: this does not include any F16
operations from MFMA instructions.'
VALU FLOPs (F32): 'The total 32-bit floating-point operations executed per second
on the VALU. This is presented with the value of the peak empirical F32 FLOPs
achievable on the specific accelerator. Note: this does not include any F32
operations from MFMA instructions.'
VALU FLOPs (F64): 'The total 64-bit floating-point operations executed per second
on the VALU. This is presented with the value of the peak empirical F64 FLOPs
achievable on the specific accelerator. Note: this does not include any F64
operations from MFMA instructions.'
MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
executed per second. This does not include any 16-bit brain floating point operations
@@ -28,6 +36,11 @@ Panel Config:
per second. Note: this does not include any 64-bit floating point operations
from VALU instructions. The peak empirically measured F64 MFMA operations achievable
on the specific accelerator is displayed alongside for comparison.'
MFMA FLOPs (F6F4): 'The total number of 4-bit and 6-bit floating point MFMA operations
executed per second. Note: this does not include any floating point operations
from VALU instructions. The peak empirically measured F6F4 MFMA operations achievable
on the specific accelerator is displayed alongside for comparison. It is supported
on AMD Instinct MI350 series (gfx950) and later only.'
MFMA IOPs (Int8): 'The total number of 8-bit integer MFMA operations executed
per second. Note: this does not include any 8-bit integer operations from VALU
instructions. The peak empirically measured INT8 MFMA operations achievable
@@ -80,15 +93,24 @@ Panel Config:
unit: Unit
peak: Peak (Empirical)
metric:
VALU FLOPs:
value: AVG((($wave_size * ( (SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+ (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_ADD_F32
+ SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32)
+ (SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64)
+ SQ_INSTS_VALU_TRANS_F64) )) / ((End_Timestamp - Start_Timestamp) / 1e9))
/ 1e9)
VALU FLOPs (F16):
value: AVG((($wave_size * ( SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16
+ (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16 )) / ((End_Timestamp
- Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
peak: $FP16Flops_empirical_peak
VALU FLOPs (F32):
value: AVG((($wave_size * ( SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32
+ (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32 )) / ((End_Timestamp
- Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: $FP32Flops_empirical_peak
VALU FLOPs (F64):
value: AVG((($wave_size * ( SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64
+ (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64 )) / ((End_Timestamp
- Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: $FP64Flops_empirical_peak
MFMA FLOPs (F64):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)) / ((End_Timestamp - Start_Timestamp)
/ 1e9)) / 1e9)
+80 -33
Просмотреть файл
@@ -771,7 +771,7 @@ def build_metric_value_string(dfs, dfs_type, normal_unit, profiling_config):
def init_metric_evaluator(
raw_pmc_df: Union[pd.DataFrame, dict], ammolite_vars: dict
raw_pmc_df: Union[pd.DataFrame, dict], ammolite_vars: dict, empirical_peaks: dict
) -> None:
if isinstance(raw_pmc_df, dict):
raw_pmc_df_keys = set(raw_pmc_df.keys())
@@ -790,6 +790,7 @@ def init_metric_evaluator(
# The process-local globals are used for performance optimization.
globals().update(raw_pmc_df_items)
globals().update(ammolite_vars)
globals().update(empirical_peaks)
def run_metric_evaluator(row_expr: str) -> str:
@@ -821,6 +822,38 @@ def run_metric_evaluator(row_expr: str) -> str:
console_error("analysis", str(ae))
def create_empirical_peaks_dict(empirical_peaks_df):
"""Create empirical peaks dictionary"""
empirical_peaks = {}
if not empirical_peaks_df.empty:
peak_data_row = empirical_peaks_df.iloc[0]
for col in empirical_peaks_df.columns:
empirical_peaks[f"ammolite__{col}_empirical_peak"] = peak_data_row[col]
else:
peak_names = [
"FP16Flops",
"FP32Flops",
"FP64Flops",
"MFMAF64Flops",
"MFMAF32Flops",
"MFMAF16Flops",
"MFMABF16Flops",
"MFMAF8Flops",
"MFMAI8Ops",
"HBMBw",
"L2Bw",
"L1Bw",
"LDSBw",
"MFMA_FLOPs_F6F4",
]
# initialize peaks to 0
for peak_name in peak_names:
empirical_peaks[f"ammolite__{peak_name}_empirical_peak"] = 0
return empirical_peaks
@demarcate
def eval_metric(dfs, dfs_type, sys_info, empirical_peaks_df, raw_pmc_df, debug, config):
"""
@@ -927,32 +960,10 @@ def eval_metric(dfs, dfs_type, sys_info, empirical_peaks_df, raw_pmc_df, debug,
"wave_size is not available in sysinfo.csv, please provide the correct "
"value using --specs-correction"
)
if not empirical_peaks_df.empty:
peak_data_row = empirical_peaks_df.iloc[0]
for metric_name in empirical_peaks_df.columns:
var_name = f"ammolite__{metric_name}_empirical_peak"
locals()[var_name] = peak_data_row[metric_name]
else:
default_peaks = [
"MFMAF64Flops",
"MFMAF32Flops",
"MFMAF16Flops",
"MFMABF16Flops",
"MFMAF8Flops",
"MFMAI8Ops",
"HBMBw",
"L2Bw",
"L1Bw",
"LDSBw",
"MFMA_FLOPs_F6F4",
]
# set values to 0 if no no empirical peaks from roofline.csv are provided
for peak_name in default_peaks:
var_name = f"ammolite__{peak_name}_empirical_peak"
exec(f"{var_name} = 0", globals(), locals())
empirical_peaks = create_empirical_peaks_dict(empirical_peaks_df)
# TODO: fix all $normUnit in Unit column or title
# build and eval all derived build-in global variables
ammolite__build_in = {}
@@ -966,6 +977,8 @@ def eval_metric(dfs, dfs_type, sys_info, empirical_peaks_df, raw_pmc_df, debug,
ammolite__build_in[key] = eval(compile(s, "<string>", "eval"))
except TypeError:
ammolite__build_in[key] = None
except NameError:
ammolite__build_in[key] = None
except KeyError:
ammolite__build_in[key] = None
except AttributeError as ae:
@@ -1022,12 +1035,32 @@ def eval_metric(dfs, dfs_type, sys_info, empirical_peaks_df, raw_pmc_df, debug,
)
if matched_vars:
for v in matched_vars:
print(
"Var ",
v,
":",
eval(compile(v, "<string>", "eval")),
)
try:
value = eval(
compile(v, "<string>", "eval")
)
print("Var ", v, ":", value)
except NameError:
if "_empirical_peak" in v:
if v in empirical_peaks:
print(
"Var ",
v,
":",
empirical_peaks[v],
)
else:
print(
"Var ",
v,
": [empirical peak not found]", # noqa
)
else:
print(
"Var ",
v,
": [not available in main thread]", # noqa
)
matched_cols = re.findall(
r"raw_pmc_df\['\w+'\]\['\w+'\]", row[expr]
)
@@ -1063,6 +1096,21 @@ def eval_metric(dfs, dfs_type, sys_info, empirical_peaks_df, raw_pmc_df, debug,
eval(compile(row[expr], "<string>", "eval"))
)
print("~" * 40)
except NameError as ne:
if "empirical_peak" in str(ne):
console_warning(
"Skipping debug evaluation. Empirical peak variables " # noqa
"not available in main thread: {}".format( # noqa
str(ne)
)
)
else:
console_warning(
"Skipping debug evaluation. Variable not available: {}".format( # noqa
str(ne)
)
)
print("~" * 40)
except TypeError:
console_warning(
"Skipping entry. Encountered a missing "
@@ -1100,7 +1148,6 @@ def eval_metric(dfs, dfs_type, sys_info, empirical_peaks_df, raw_pmc_df, debug,
ammolite_vars = {
key: val for key, val in locals().items() if key.startswith("ammolite__")
}
# Empirically, 16 is about as much as we need.
processes = min(16, multiprocessing.cpu_count() // 2)
@@ -1108,7 +1155,7 @@ def eval_metric(dfs, dfs_type, sys_info, empirical_peaks_df, raw_pmc_df, debug,
with multiprocessing.Pool(
processes=processes,
initializer=init_metric_evaluator,
initargs=(raw_pmc_df, ammolite_vars),
initargs=(raw_pmc_df, ammolite_vars, empirical_peaks),
) as pool:
outs = pool.map(run_metric_evaluator, row_exprs)
+6 -6
Просмотреть файл
@@ -23,12 +23,12 @@ src/rocprof_compute_soc/analysis_configs/gfx940/0300_memory_chart.yaml: 249e9ae0
src/rocprof_compute_soc/analysis_configs/gfx941/0300_memory_chart.yaml: 249e9ae0445de0927827ec14d20f946a07d50d92fd56e1993bbe0c17eb65bd51
src/rocprof_compute_soc/analysis_configs/gfx942/0300_memory_chart.yaml: 249e9ae0445de0927827ec14d20f946a07d50d92fd56e1993bbe0c17eb65bd51
src/rocprof_compute_soc/analysis_configs/gfx950/0300_memory_chart.yaml: 643b31ffa43bc3613d6f90b0c23d95093d0d0aa5bc8e72d9a0fbc1b739a08b67
src/rocprof_compute_soc/analysis_configs/gfx908/0400_roofline.yaml: 6406ce67cd55064f0d2db2a3511c6536cc1625314ddb31366900fbf3c60ed523
src/rocprof_compute_soc/analysis_configs/gfx90a/0400_roofline.yaml: 100d555cf9e70b892e22f92ddd9c0a5d1f914d07077c4a8d35941e8ad62b5b30
src/rocprof_compute_soc/analysis_configs/gfx940/0400_roofline.yaml: f8bf66f43c9afede4fd1f17c279050cc27cc6fbc1cdb53a71ae8ceb0eb84dc37
src/rocprof_compute_soc/analysis_configs/gfx941/0400_roofline.yaml: 6fae04dcf4bcabe4a71f5d9eefc379a38d30cdf05fbb14e2c276e1c272fdb3f6
src/rocprof_compute_soc/analysis_configs/gfx942/0400_roofline.yaml: c8dfe7df24f94dfa229ffa2035b802c6833ce98f7710e0889bc5710f2167d4c0
src/rocprof_compute_soc/analysis_configs/gfx950/0400_roofline.yaml: 734fdfa818bfd8a87e01a0dd795c502a567c72158ca9b7bfe01e99451e8aa537
src/rocprof_compute_soc/analysis_configs/gfx908/0400_roofline.yaml: f346786a63056303a0c6137f82bcea0baa3bb5bf63bfbef792df531725f64d1d
src/rocprof_compute_soc/analysis_configs/gfx90a/0400_roofline.yaml: 92aced41d421faf00ed542da38f47ad22c2fbf1f0383c550e255ed9bab95b0d8
src/rocprof_compute_soc/analysis_configs/gfx940/0400_roofline.yaml: 867a31d84eeb5b5aa64606fcdb67732347acc43e88121fcc70af79169f44fd83
src/rocprof_compute_soc/analysis_configs/gfx941/0400_roofline.yaml: 3f7bef2487df1ece7302de3f413a14032ba35d44fcf4e5ddee5186e2fc223797
src/rocprof_compute_soc/analysis_configs/gfx942/0400_roofline.yaml: 58272f5d1136489255a7bf9c6ade720a0567b17ab58dc58ae796597ae4d73ce5
src/rocprof_compute_soc/analysis_configs/gfx950/0400_roofline.yaml: 4362a8120d70cc0e62abd4367a09207897bbc3be44805092dabaefda41803391
src/rocprof_compute_soc/analysis_configs/gfx908/0500_command_processor_cpc_cpf.yaml: da1c2997d42d66da2aa069caa741cf9eade124357c56e4290333de2f3e0412bb
src/rocprof_compute_soc/analysis_configs/gfx90a/0500_command_processor_cpc_cpf.yaml: da1c2997d42d66da2aa069caa741cf9eade124357c56e4290333de2f3e0412bb
src/rocprof_compute_soc/analysis_configs/gfx940/0500_command_processor_cpc_cpf.yaml: da1c2997d42d66da2aa069caa741cf9eade124357c56e4290333de2f3e0412bb
+136 -43
Просмотреть файл
@@ -2802,14 +2802,24 @@ panels:
peak: Peak (Empirical)
metric:
gfx90a:
VALU FLOPs:
VALU FLOPs (F16):
value: AVG((($wave_size * (
(SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
(SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
(SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16
)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
peak: $FP16Flops_empirical_peak
VALU FLOPs (F32):
value: AVG((($wave_size * (
SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32
)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: $FP32Flops_empirical_peak
VALU FLOPs (F64):
value: AVG((($wave_size * (
SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64
)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: $FP64Flops_empirical_peak
MFMA FLOPs (F64):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
@@ -2855,14 +2865,24 @@ panels:
unit: GB/s
peak: $LDSBw_empirical_peak
gfx908:
VALU FLOPs:
VALU FLOPs (F16):
value: AVG((($wave_size * (
(SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
(SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
(SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16
)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
peak: $FP16Flops_empirical_peak
VALU FLOPs (F32):
value: AVG((($wave_size * (
SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32
)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: $FP32Flops_empirical_peak
VALU FLOPs (F64):
value: AVG((($wave_size * (
SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64
)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: $FP64Flops_empirical_peak
MFMA FLOPs (F64):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
@@ -2908,14 +2928,24 @@ panels:
unit: GB/s
peak: $LDSBw_empirical_peak
gfx940:
VALU FLOPs:
value: AVG(($wave_size * (
(SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
(SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
(SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
VALU FLOPs (F16):
value: AVG((($wave_size * (
SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16
)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
peak: $FP16Flops_empirical_peak
VALU FLOPs (F32):
value: AVG((($wave_size * (
SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32
)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: $FP32Flops_empirical_peak
VALU FLOPs (F64):
value: AVG((($wave_size * (
SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64
)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: $FP64Flops_empirical_peak
MFMA FLOPs (F64):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
@@ -2965,14 +2995,24 @@ panels:
unit: GB/s
peak: $LDSBw_empirical_peak
gfx941:
VALU FLOPs:
value: AVG(($wave_size * (
(SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
(SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
(SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
VALU FLOPs (F16):
value: AVG((($wave_size * (
SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16
)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
peak: $FP16Flops_empirical_peak
VALU FLOPs (F32):
value: AVG((($wave_size * (
SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32
)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: $FP32Flops_empirical_peak
VALU FLOPs (F64):
value: AVG((($wave_size * (
SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64
)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: $FP64Flops_empirical_peak
MFMA FLOPs (F64):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
@@ -3022,14 +3062,24 @@ panels:
unit: GB/s
peak: $LDSBw_empirical_peak
gfx942:
VALU FLOPs:
VALU FLOPs (F16):
value: AVG((($wave_size * (
(SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
(SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
(SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16
)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
peak: $FP16Flops_empirical_peak
VALU FLOPs (F32):
value: AVG((($wave_size * (
SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32
)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: $FP32Flops_empirical_peak
VALU FLOPs (F64):
value: AVG((($wave_size * (
SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64
)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: $FP64Flops_empirical_peak
MFMA FLOPs (F64):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
@@ -3079,14 +3129,24 @@ panels:
unit: GB/s
peak: $LDSBw_empirical_peak
gfx950:
VALU FLOPs:
VALU FLOPs (F16):
value: AVG((($wave_size * (
(SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16) +
(SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32) +
(SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64)
SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16 + (2 * SQ_INSTS_VALU_FMA_F16) + SQ_INSTS_VALU_TRANS_F16
)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
peak: $FP16Flops_empirical_peak
VALU FLOPs (F32):
value: AVG((($wave_size * (
SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32 + (2 * SQ_INSTS_VALU_FMA_F32) + SQ_INSTS_VALU_TRANS_F32
)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: $FP32Flops_empirical_peak
VALU FLOPs (F64):
value: AVG((($wave_size * (
SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64 + (2 * SQ_INSTS_VALU_FMA_F64) + SQ_INSTS_VALU_TRANS_F64
)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
peak: $FP64Flops_empirical_peak
MFMA FLOPs (F64):
value: AVG((((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512)) / ((End_Timestamp - Start_Timestamp) / 1e9)) / 1e9)
unit: GFLOP/s
@@ -3578,15 +3638,35 @@ panels:
) / 1e9
unit: GFLOP/s
metrics_description:
VALU FLOPs:
plain: 'The total floating-point operations executed per second on the VALU.
This is also presented as a percent of the peak theoretical FLOPs achievable
on the specific accelerator. Note: this does not include any floating-point
operations from MFMA instructions.'
rst: 'The total floating-point operations executed per second on the :ref:`VALU
<desc-valu>`. This is also presented as a percent of the peak theoretical
FLOPs achievable on the specific accelerator. Note: this does not include
any floating-point operations from :ref:`MFMA <desc-mfma>` instructions.'
VALU FLOPs (F16):
plain: 'The total 16-bit floating-point operations executed per second on the VALU.
This is presented with the value of the peak empirical F16 FLOPs achievable
on the specific accelerator. Note: this does not include any F16 operations
from MFMA instructions.'
rst: 'The total 16-bit floating-point operations executed per second on the :ref:`VALU
<desc-valu>`. This is presented with the value of the peak empirical F16 FLOPs achievable
on the specific accelerator. Note: this does not include any F16 operations
from :ref:`MFMA <desc-mfma>` instructions.'
unit: GFLOPs
VALU FLOPs (F32):
plain: 'The total 32-bit floating-point operations executed per second on the VALU.
This is presented with the value of the peak empirical F32 FLOPs achievable
on the specific accelerator. Note: this does not include any F32 operations
from MFMA instructions.'
rst: 'The total 32-bit floating-point operations executed per second on the :ref:`VALU
<desc-valu>`. This is presented with the value of the peak empirical F32 FLOPs achievable
on the specific accelerator. Note: this does not include any F32 operations
from :ref:`MFMA <desc-mfma>` instructions.'
unit: GFLOPs
VALU FLOPs (F64):
plain: 'The total 64-bit floating-point operations executed per second on the VALU.
This is presented with the value of the peak empirical F64 FLOPs achievable
on the specific accelerator. Note: this does not include any F64 operations
from MFMA instructions.'
rst: 'The total 64-bit floating-point operations executed per second on the :ref:`VALU
<desc-valu>`. This is presented with the value of the peak empirical F64 FLOPs achievable
on the specific accelerator. Note: this does not include any F64 operations
from :ref:`MFMA <desc-mfma>` instructions.'
unit: GFLOPs
MFMA FLOPs (F8):
plain: The total number of 8-bit brain floating point MFMA operations executed
@@ -3646,6 +3726,19 @@ panels:
measured F64 MFMA operations achievable on the specific accelerator is
displayed alongside for comparison.'
unit: GFLOPs
MFMA FLOPs (F6F4):
plain: 'The total number of 4-bit and 6-bit floating point MFMA operations executed
per second. Note: this does not include any floating point operations from
VALU instructions. The peak empirically measured F6F4 MFMA operations
achievable on the specific accelerator is displayed alongside for comparison.
It is supported on AMD Instinct MI350 series (gfx950) and later only.'
rst: 'The total number of 4-bit and 6-bit floating point :ref:`MFMA <desc-mfma>`
operations executed per second. Note: this does not include any floating point
operations from :ref:`VALU <desc-valu>` instructions. The peak empirically
measured F6F4 MFMA operations achievable on the specific accelerator is
displayed alongside for comparison. It is supported on AMD Instinct MI350
series (gfx950) and later only.'
unit: GFLOPs
MFMA IOPs (Int8):
plain: 'The total number of 8-bit integer MFMA operations executed per second.
Note: this does not include any 8-bit integer operations from VALU instructions.