diff --git a/README.md b/README.md index 239ee5f180..eac09a9d3e 100644 --- a/README.md +++ b/README.md @@ -304,59 +304,58 @@ Usage: Tool used to collect fine-grained hardware metrics. Provides ISA-level instruction hotspot analysis via hardware tracing. - Install plugin package. See Plugin Support section for installation - - Run the following to view the trace. Att-specific options must come right after the assembly file + - Run the following to view the trace. Att-specific options must come right after the assembly file. + - On ROCm 6.0, ATT enables automatic capture of the ISA during kernel execution, and does not require recompiling. It is recommeneded to leave at "auto". ```bash - rocprofv2 -i input.txt --plugin att --mode network + rocprofv2 -i input.txt --plugin att auto --mode csv + # This is the 5.7 version: + # rocprofv2 -i input.txt --plugin att --mode csv ``` - - app_assembly_file: - On ROCm 6.0, ATT enables automatic capture of the ISA during kernel execution, and does not require recompiling. It is recommeneded to leave at "auto". - app_relative_path Path for the running application - ATT plugin optional parameters - - --depth [n]: How many waves per slot to parse (maximum). - - --mpi [proc]: Parse with this many mpi processes, for greater analysis speed. Does not change results. Requires mpi4py. - - --att_kernel "filename": Kernel filename to use (instead of ATT asking which one to use). - - --trace_file "files": glob (wildcards allowed) of traces files to parse. Requires quotes for use with wildcards. - - --mode [network, file, csv, off (default)] - - network + - --att_kernel "filename": Kernel filename(s) (glob) to use. A CSV file (or UI folder) will be generated for each kernel.txt file. Default: all in current folder. + - --mode [csv, network, file, off (default)] + - off + Runs trace collection but not analysis, so it can be analyzed at a later time. Run rocprofv2 ATT with the same parameters (+ --mode csv), removing the application binary, to analyze previously generated traces. + - csv + Dumps the analyzed assembly into a CSV format, with the hitcount and total cycles cost. Recommended mode for most users. + - network (deprecated) Opens the server with the browser UI. att needs 2 ports available (e.g. 8000, 18000). There is an option (default: --ports "8000,18000") to change these. In case rocprofv2 is running on a different machine, use port forwarding `ssh -L 8000:localhost:8000 ` so the browser can be used locally. For docker, use --network=host --ipc=host -p8000:8000 -p18000:18000 - - file + - file (deprecated) Dumps the analyzed json files to disk for vieweing at a later time. Run python3 httpserver.py from within the generated ui/ folder to view the trace, similarly to network mode. The folder can be copied to another machine, and will run without rocm. - - csv - Dumps the analyzed assembly into a CSV format, with the hitcount and total cycles cost. - Use rocprofv2's -o option to specify output file name (default "att_output.csv"). - - off - Runs trace collection but not analysis, so it can be analyzed at a later time. Run rocprofv2 ATT [network, file] with the same parameters, removing the application binary, to analyze previously generated traces. - input.txt Required. Used to select specific compute units and other trace parameters. - For first time users, we recommend compiling and running vectorAdd with + For first time users, using the following input file: - ```string + ```bash + # vectoradd att: TARGET_CU=1 SE_MASK=0x1 - SIMD_MASK=0x3 + SIMD_SELECT=0x3 ``` - and histogram with - - ```string + ```bash + # histogram att: TARGET_CU=0 SE_MASK=0xFF - SIMD_MASK=0xF // 0xF for GFX9, SIMD_MASK=0 for Navi + SIMD_SELECT=0xF // 0xF for GFX9, SIMD_MASK=0 for Navi ``` Possible contents: - - att: TARGET_CU=1 //or some other CU [0,15] - WGP for Navi [0,8] + - att: TARGET_CU=1 // or some other CU [0,15] - WGP for Navi [0,8] - SE_MASK=0x1 // bitmask of shader engines. The fewer, the easier on the hardware. Default enables 1 out of 4 shader engines. - - SIMD_MASK=0xF // GFX9: bitmask of SIMDs. Navi: SIMD Index [0-3]. - - DISPATCH=ID,RN // collect trace only for the given dispatch_ID and MPI rank RN. RN ignored for single processes. Multiple lines with varying combinations of RN and ID can be added. + - SIMD_SELECT=0xF // GFX9: bitmask of SIMDs. Navi: SIMD Index [0-3]. Recommended 0xF for GFX9 and 0x0 for Navi. + - DISPATCH=ID // collect trace only for the given dispatch_ID. Multiple lines for can be added. + - DISPATCH=ID,RN // collect trace only for the given dispatch_ID and MPI rank RN. Multiple lines with varying combinations of RN and ID can be added. - KERNEL=kernname // Profile only kernels containing the string kernname (c++ mangled name). Multiple lines can be added. - - PERFCOUNTERS_COL_PERIOD=0x3 // Multiplier period for counter collection [0~31]. 0=fastest (usually once every 16 cycles). GFX9 only. Counters will be shown in a graph over time in the browser UI. - - PERFCOUNTER=counter_name // Add a SQ counter to be collected with ATT; period defined by PERFCOUNTERS_COL_PERIOD. GFX9 only. + - PERFCOUNTERS_CTRL=0x3 // Multiplier period for counter collection [0~31]. 0=fastest. GFX9 only. + - PERFCOUNTER_MASK=0xFFF // Bitmask for perfcounter collection. GFX9 only. + - PERFCOUNTER=counter_name // Add a SQ counter to be collected with ATT; period defined by PERFCOUNTERS_CTRL. GFX9 only. - BUFFER_SIZE=[size] // Sets size of the ATT buffer collection, per dispatch, in megabytes (shared among all shader engines). - ISA_CAPTURE_MODE=[0,1,2] // Set capture mode during kernel dispatch. - 0 = capture symbols only. @@ -366,23 +365,21 @@ Tool used to collect fine-grained hardware metrics. Provides ISA-level instructi - Example for vectoradd. ```bash - # -g add debugging symbols to the binary. Required only for tracking disassembly back to c++. - hipcc -g --save-temps vectoradd_hip.cpp -o vectoradd_hip.exe - # "auto" means to use the automatically captured ISA. "csv" dumps result to "my.csv". - rocprofv2 -i input.txt -o my.csv --plugin att auto --mode csv ./vectoradd_hip.exe + # -g adds debugging symbols to the binary. Required only for tracking disassembly back to c++. + hipcc -g vectoradd_hip.cpp -o vectoradd_hip.exe + # "auto" means to use the automatically captured ISA. "csv" dumps result to "vectoradd_float_v0.csv". + rocprofv2 -i input.txt --plugin att auto --mode csv ./vectoradd_hip.exe ``` - Instruction latencies will be in my.csv + Instruction latencies will be in vectoradd_float_v0.csv - Example with symbolic ISA (as in ROCm 5.7 or previous). ```bash hipcc -g --save-temps vectoradd_hip.cpp -o vectoradd_hip.exe # A custom ISA can be used such as vectoradd_hip-hip-amdgcn-amd-amdhsa-gfx1100.s # Special attention to the correct architecture for the ISA, such as "gfx1100" (navi31). - rocprofv2 -i input.txt --plugin att vectoradd_hip-hip-amdgcn-amd-amdhsa-gfx1100.s --mode network ./vectoradd_hip.exe + rocprofv2 -i input.txt --plugin att vectoradd_hip-hip-amdgcn-amd-amdhsa-gfx1100.s --mode csv ./vectoradd_hip.exe ``` - Then open the browser at `http://localhost:8000` - *** Note: For MPI or long running applications, we recommend to run collection, and later run the parser with already collected data: Run only collection: The assembly file is not used. Use mpirun [...] rocprofv2 [...] if needed. @@ -394,7 +391,13 @@ Tool used to collect fine-grained hardware metrics. Provides ISA-level instructi Remove the binary/application from the command line. ```bash # Only runs the parser on previously collected data. - rocprofv2 -i input.txt --plugin att auto --mode network + rocprofv2 -i input.txt --plugin att auto --mode csv + ``` + + Note 2: By default, ATT only collects a SINGLE kernel dispatch for the whole application, which is the first dispatch matching the given filters (DISPATCH= or KERNEL=). To collect multiple dispatches in a single application run, use: + + ```bash + export ROCPROFILER_MAX_ATT_PROFILES= ``` ***