* Implement AMDGPU driver info and GPU VRAM attributes in system info.
section of analysis report.
* Backward compatibility for rocprofiler-sdk avail module path migration
* Fix roofline calculation where AI data points are N/A
* Add WIP workflow step to delete untagged images older than 1 week
* Formatting fix for rocprofiler-systems-ghcr.yml
* Move step to new workflow
* Remove needs parameter from cleanup-rocprofiler-images
* Remove expand-packages option
* Expand cleanup for every OS
* Revert spacing change to rocprofiler-systems-ghcr.yml
* Turn off dry-run to do an initial clean
* Switch dry-run to be only on PR
* Added comment about schedule
* Refactor papi enumeration to fix a hang on Intel systems
- Add an exclude argument to available_events_info() for
perf_event_uncore causing hang like case on Intel systems with large
number of uncore events.
- Enumerate papi available events only when papi events are specified by
users inside early initialization logic
- Move papi available event query for ROCPROFSYS_SAMPLING_OVERFLOW_EVENT
config setting to the avail component, to move the heavy logic outside
initialization.
- Make category option for rocprof-sys-avail -H -c case insensitive
- Provide new option to query available overflow events that can be
specified for ROCPROFSYS_SAMPLING_OVERFLOW_EVENT using new command
option rocprof-sys-avail -H -c overflow
* Update projects/rocprofiler-systems/source/bin/rocprof-sys-avail/common.cpp
Co-authored-by: Milan Radosavljevic <milan.radosavljevic@amd.com>
* Update timemory submodule pointer
Signed-off-by: David Galiffi <David.Galiffi@amd.com>
* Fix errors on compile
* Change 1: Optimization for the category matching lambda
Optmization changes.
* Modify the rocprof-sys-avail -c option for overflow
Overflow should not be displayed as a device in rocprof-sys-avail -H -c CPU
Users can instead do regex on summary where overflow is appended in description
User can do rocprof-sys-avail -H -c CPU -d -r overflow
* Revert change to column width
---------
Signed-off-by: David Galiffi <David.Galiffi@amd.com>
Co-authored-by: Milan Radosavljevic <milan.radosavljevic@amd.com>
Co-authored-by: David Galiffi <David.Galiffi@amd.com>
- Redesigned buffer_storage with a flush_worker pattern for better thread management and resource cleanup
- Introduced type-safe abstractions through new components: cacheable.hpp, cache_type_traits.hpp, sample_processor.hpp, and type_registry.hpp
- Optimized type erasure implementation in sample processor to reduce runtime overhead
- Renamed rocpd_post_processing to rocpd_processor and restructured the processing pipeline
- Removed storage_parser.cpp and integrated functionality into header-based template implementation
- Enhanced cache_manager with improved processing workflow and better separation of concerns
* add relaxed_ordering option
add an environment variable that allows to control setting the
IBV_ACCESS_RELAXED_ORDERING flag when registering memory with the
ibv_reg_mr* functions.
* missed a spot
[ROCm/rocshmem commit: 2ae2033648]
* add relaxed_ordering option
add an environment variable that allows to control setting the
IBV_ACCESS_RELAXED_ORDERING flag when registering memory with the
ibv_reg_mr* functions.
* missed a spot
* Removing FP8 from peak VALU datatypes - PEAK_OPS_DATATYPES.
* Similar change for BF16.
* Roofline binaries from rocm-amdgpu-bench generated 10/22.
https://github.com/ROCm/rocm-amdgpu-bench/commit/2113ef1f5eada8a4a6e44e6d07fd6abac9b0a3f8
Bins include change that removes FP8 and BF16 peak VALU benchmarks.
Built and tested on rhel8, azl3, ubuntu22.04, sles15sp6.
Signed-off-by: Carrie Fallows <Carrie.Fallows@amd.com>
* Re-committing the bins
accidentally copied over bins from the wrong folder earlier, caught by Gowthami during testing.
Signed-off-by: Carrie Fallows <Carrie.Fallows@amd.com>
* Updated changelog
* gersemi fix
* Changelog corrected.
* Changelog fix.
* Adding this to the 7.2.0 section to be picked up in an RC build.
* Moving changelog entry into unreleasesd section - team reconfirmed cutoff date after I requested this change so I am just quickly correcting my mistake in my ask.
Signed-off-by: Carrie Fallows <Carrie.Fallows@amd.com>
---------
Signed-off-by: Carrie Fallows <Carrie.Fallows@amd.com>
Co-authored-by: Carrie Fallows <Carrie.Fallows@amd.com>
Co-authored-by: JeniferC99 <150404595+JeniferC99@users.noreply.github.com>
* refactor: duplicated path helpers into common/path.hpp
* update rocprof-sys-instrument to use shared path utility
* Add path::realpath(std::string[, std::string*]) helper function in common/path.hpp for binaries
* common: centralize remove_env implementation in environment.hpp
* remove unused includes from rocprof-sys binaries and argparse
* changing set to unordered_set wherever sorting is not required and additional cleanup
* review comment incorporated
* Apply suggestion from @Copilot
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* copilot review for remove_env incorporated
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Remove testing of data types
As the collective is templated, we are just testing if sizeof(T) works
* Added single threaded varients
* Applied thread puts optimization to barrier
* Apply single threaded optimization to alltoall
* This optimization only works on bnxt, so place a switch to protect it
* Handle the edge case where the thread count is smaller than the number of PEs
[ROCm/rocshmem commit: 1347d5d628]
* Remove testing of data types
As the collective is templated, we are just testing if sizeof(T) works
* Added single threaded varients
* Applied thread puts optimization to barrier
* Apply single threaded optimization to alltoall
* This optimization only works on bnxt, so place a switch to protect it
* Handle the edge case where the thread count is smaller than the number of PEs