* Implement `rocshmem_ptr` in IPC conduit
* tests: add functional test for `rocshmem_ptr`
- Add safety check for pointer access and condition check before printing results for `rocshmem_ptr` test
- Use `rocshmem_put` to store `rocshmem_ptr` availability for data validation
* Refactor `Barrier_all` and `Sync_all` to use default context
- Removed context-specific implementations of barrier_all and sync_all
- Added barrier_all and sync_all to the default context implementation
- Updated functional tests to use the default context for barrier_all and sync_all
* Update `Barrier_all` and `Sync_all` API usage in documentation
* Update `CHANGELOG`
---------
Co-authored-by: Yiltan <ytemucin@amd.com>
* Add dlmalloc_strat allocator strategy
- Use mspace variant to ease encapsulation
- Make pow2bins and dlmalloc cmake selectable
* Add unit tester for dlmalloc, rework single_heap, pow2bins unit testers
accordingly
- add dlmalloc get_used/get_avail, and have all strats allocators also have a get_used
- Rework memallocator unit tests: bin size is per strat, alignment is verified in singleheap
* bugfix: dlmalloc exposed that the pingpong test would write past end of
allocation with -w 32
* iostream leakage/mixed usage of cerr and fprintf(stderr
---------
Signed-off-by: Aurelien Bouteiller <aurelien.bouteiller@amd.com>
Show and log what the functional test driver is running
* Log errors in the log file
* list all failed tests at the end
* pretty colors :x
* Print stderr when the test has failed
---------
Signed-off-by: Aurelien Bouteiller <aurelien.bouteiller@amd.com>
* Add thread, wavefront, and workgroup-level `barrier` APIs in IPC and RO conduits; remove collectives on default context
- Implemented `barrier` APIs for thread, wavefront, and workgroup scopes
- Added support into both IPC and RO conduits
- Added functional tests to cover all `barrier` APIs
- Removed collective operations on default context
* Add thread, wavefront, and workgroup-level `sync` APIs in IPC and RO conduits.
- Implemented `sync` APIs for thread, wavefront, and workgroup scopes
- Added support into both IPC and RO conduits
- Added functional tests to cover all `sync` APIs
* update naming convention for context-based `barrier` APIs
* Fix deadlock in `rocshmem_ctx_wg_barrier_all` API in IPC conduit by adding per-context pSync buffers and context IDs
- Added separate pSync buffers for each device context
- Resolved deadlock when invoking barrier API (`rocshmem_ctx_wg_barrier_all`) concurrently from multiple contexts
* Update barrier_all functional tests for multi-context support
* Add thread, wavefront, and workgroup-level barrier_all APIs in IPC and RO conduits
- Implemented barrier_all APIs at thread, wavefront, and workgroup granularity
- Added support in both IPC and RO conduits
- Updated functional tests to cover all `barrier_all` APIs
* Add thread, wavefront, and workgroup-level sync_all APIs in IPC and RO conduits
- Implemented sync_all APIs for thread, wavefront, and workgroup scopes
- Added support into both IPC and RO conduits
- Added functional tests to cover all `sync_all` APIs
* Update HIP version check for compatibility with versions >= 5.5
* Update memory allocator for context BlockHandle
- Replaced `HIPAllocator` with `HIPDefaultFinegrainedAllocator` for context `BlockHandle`.
* Update run commands for `rocshmem_g` and `rocshmem_p` functional tests
* add team-barrier implementation
add a team-barrier API and implementation in the IPC and RO conduit.
Clean up some of the logic in the RO Conduit to distinguish between
sync, sync_all, barrier, and barrier_all.
* add team_barrier_tests to functional tests
* Update primitive tests for multi-workgroup support
* Update workgroup primitive tests for multi-workgroup support
* Update workfront primitive tests for multi-workgroup support
* Update team based primitive tests for multi-workgroup support
* Update RMA functional tests to capture timing after quiet call
- Modified RMA functional tests to record the time after a `quiet` call in thread, wavefront, and workgroup RMA calls.
* Improve error handling and memory management
- Replaced `cout` with `cerr` for improved error reporting.
- Ensured all allocated memory is freed when `rocshmem_malloc` fails.
* Update start time in primitive tests and latency calculations
- Modified primitive tests to capture the earliest start time.
- Updated latency calculations in functional tests.
* Remove `GetSwarmTester`
* Update start time in team primitive tests
* Invoke quiet call from a single thread within a block on a rocshmem context
- Remove hdp and ipc pointers from BlockHandle, align RO stats with RO contexts
- Add run commands for `rocshmem_g` and `rocshmem_p` API tests in driver.sh
- Allocate rocshmem API return buffers based on number of device contexts.
- Associate status flag address with blocking calls and remove threadId dependency
- Associated the status flag address with each blocking call request to notify the GPU thread.
- Removed dependency on threadId for determining the appropriate status flag index.
- Move status flag buffer allocation to backend.
- Initialize allocated memeory to zero
* Implemented tiled version of put*_wave and get*_wave functions
* Maintain single kernel that supports both tiled and untiled versions
* Disable IPC in the default RO build script