* Use one side stream per process
* Handle multiple GPUs per process
* Reset stream when not found
* Address review comments
* Fix missing mutex initializer
[ROCm/rccl commit: 185e78a8f0]
* Use one side stream per process
* Handle multiple GPUs per process
* Reset stream when not found
* Address review comments
* Fix missing mutex initializer
* kfdtest: Replace pthread with std::thread
Modify concurrent kfdtest to use std::thread
instead of pthread, eventually modify KFDTestLaunch
to take in a member function of test instance
instead of static function.
Convert KFDQMTest to pass in member function for
multi-gpu kfdtest.
* kfdtest: Convert KFDPerfCountersTest to use std::thread
Convert KFDPerfCountersTest to use std::thread for
multi-gpu kfdtest.
* kfdtest: Convert KFDGraphicsInterop to use std::thread
Convert KFDGraphicsInterop to use std::thread for
multi-gpu kfdtest.
* kfdtest: Convert KFDGWSTest to use std::thread
Convert KFDGWSTest to use std::thread for
multi-gpu kfdtest.
* kfdtest: Convert KFDCWSRTest to use std::thread
Convert KFDCWSRTest to use std::thread for
multi-gpu kfdtest.
* kfdtest: Convert KFDEventTest to use std::thread
Convert KFDEventTest to use std::thread for
multi-gpu kfdtest.
* kfdtest: Convert KFDExceptionTest to use std::thread
Convert KFDExceptionTest to use std::thread for
multi-gpu kfdtest.
* kfdtest: Convert KFDLocalMemoryTest to use std::thread
Convert KFDLocalMemoryTest to use std::thread for
multi-gpu kfdtest.
* kfdtest: Convert KFDMemoryTest to use std::thread
Convert KFDMemoryTest to use std::thread for
multi-gpu kfdtest.
* kfdtest: Convert KFDSVMRangeTest to use std::thread
Convert KFDSVMRangeTest to use std::thread for
multi-gpu kfdtest.
* kfdtest: Convert KFDHWSTest to use std::thread
Convert KFDHWSTest to use std::thread for
multi-gpu kfdtest.
* kfdtest: Remove pthread multigpu test structure
Remove older multi-gpu test framework which
uses pthread.
Core dumps are not supporetd for gfx110x, but should be possible for
gfx115x. The current code disables core dumps completly for all gfx11xx
agents, relax this to allow gfx115x.
* clr: Use graph segment scheduling to process HIP Graphs
* Add a broader path to use capture packet capture for all topologies
* Refactor code
* Use DEBUG_HIP_GRAPH_SEGMENT_SCHEDULING to toggle new vs classic path,
Enabled by default
* clr: Few fixes and improvements
* clr: Detect complex graphs to take classic path
* Use DEBUG_HIP_GRAPH_SEGMENT_SCHEDULING=2 to force segment scheduling
path
* clr: Fix a cornercase stack corruption
* clr: Track commands of segments instead of snapshots
* clr: Fix Batch dispatch logic
* Track fence_dirty_ flag for command of other streams
* Dependency resolution markers can now accomodate dirty fence on cross
streams
---------
Co-authored-by: Ioannis Assiouras <Ioannis.Assiouras@amd.com>
Co-authored-by: Godavarthy Surya, Anusha <agodavar@amd.com>
* gda: add check for active interfaces when selecting the GDA backend
* fix __func__ maco in rocshmem_ctx_pe_quiet
* gda: switch to more generic RDMA NIC term in has_active_ib_interface
* gda: add active MLX5 and Pensando vendor ID checks for backend selection
[ROCm/rocshmem commit: 29000a5644]
* gda: add check for active interfaces when selecting the GDA backend
* fix __func__ maco in rocshmem_ctx_pe_quiet
* gda: switch to more generic RDMA NIC term in has_active_ib_interface
* gda: add active MLX5 and Pensando vendor ID checks for backend selection
## Motivation
The idea is to unify the way and place where we store our traces. Current implementation uses `trace_cache` for rocpd traces, but perfetto is in lined inside of each module. This change allows us to have a single point in code where we will collect data, process it and store it in the desired format. This means that we can declutter the code further and have single point of responsibility and single point of failure.
## Technical Details
New `processor` (perfetto_post_processing.cpp) is added to the `trace_cache` which purpose is to use the cached data to populate perfetto tracks. Cache manager is responsible for keeping the instance of this processor and for its lifetime.
When doing this ticket, I also noticed the program would SEGFAULT when ROCPROFSYS_ROCM_DOMAINS=roctx even though the docs tell us we can do this. Went ahead and fixed that.
Also noticed that timemory push/pop in rocprofiler-sdk.cpp was always using category::rocm_marker_api instead of CategoryT. Fixed that as well.
* Enable HOST ompvv runtime-instrumentation ctests
* Fix rocprofiler-systems-avail-regex-negation test failure
* Exclude problematic function from instrumentation
* Make push pop skip an env option for ctests
* Remove SKIP_PUSH_POP_CHECK from argument parse
Co-authored-by: David Galiffi <David.Galiffi@amd.com>
---------
Co-authored-by: David Galiffi <David.Galiffi@amd.com>
- Added `binary-rewrite-cleanup` and `runtime-instrument-cleanup` tests that remove instrumented binaries and output directories using `cmake -E rm -rf`
- Implemented CMake test fixtures (`FIXTURES_SETUP` and `FIXTURES_CLEANUP`) to establish proper test ordering:
- `binary-rewrite` sets up the `binary-rewrite-fixture`
- `binary-rewrite-run` and validation tests require this fixture
- `binary-rewrite-cleanup` performs cleanup for this fixture
- Same pattern applied for `runtime-instrument`
- Extended `ROCPROFILER_SYSTEMS_ADD_PYTHON_TEST` to accept `FIXTURES_REQUIRED` parameter
- Updated validation tests to require appropriate cleanup fixtures based on test name pattern matching
- Added fixture requirements to Python code-coverage tests
Update the format script to use absolute path for clang-format-diff.py
instead of relative path. This ensures the script works correctly
regardless of the current working directory when executed.
- Change from './clang-format-diff.py' to '${root}/projects/rocr-runtime/clang-format-diff.py'
- Improves script reliability and portability
Signed-off-by: Honglei Huang <honghuan@amd.com>
* libhsakmt/virtio: change shmem size to 80
Some DGPU props have a lot of information,
so it is necessary to increase the size of shmem.
Signed-off-by: Honglei Huang <honghuan@amd.com>
* libhsakmt/virtio: use BO handle instead of pointer in memory registration
Change vhsakmt_map_to_gpu() return type from void* to vhsakmt_bo_handle
to properly handle buffer object information. This allows access to
both the host address and resource ID needed for memory registration.
Signed-off-by: Honglei Huang <honghuan@amd.com>
* libhsakmt/virtio: Improve memory mapping logic
- Update vhsakmt_mappable() to check NoAddress flag and require HostAccess
- Remove mappable checks in cpu_map/unmap to allow all BOs to be mapped
- Set BO flags properly in vhsakmt_alloc_memory and scratch memory creation
- Ensure scratch memory is correctly flagged for proper handling
Signed-off-by: Honglei Huang <honghuan@amd.com>
* libhsakmt/virtio: add no svm mode for libhsakmt virtio
Add no svm mode for libhsakmt virtio driver, in no svm mode userptrs
need UMD to manage, so add interval tree to manage them.
New Features:
- Add augmented red-black tree based interval tree implementation
* Implement RB-tree insertion, deletion, and color balancing
* Provide interval query for fast overlapping range lookup
* Based on Linux kernel's augmented rbtree implementation
- Improve userptr memory management
* Use interval tree to efficiently track userptr memory regions
* Support finding registered memory within given address ranges
* Optimize memory mapping and unmapping performance
Signed-off-by: Honglei Huang <honghuan@amd.com>
---------
Signed-off-by: Honglei Huang <honghuan@amd.com>