* [SRC] Enable unroll=1 for gfx950
* Fix typo from rebase in generate.py
* Support for unroll=1 and gfx90a when building for all GPU targets
---------
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
* Fix collective trace
* Use nontemporal for st_global
* Fix previous commit
* Add HDP flush to data receive path
* Fix previous commit
* Control flushing by NCCL_NET_FORCE_FLUSH and RCCL_NET_HDP_FLUSH
* Introduce RCCL_NET_HDP_FLUSH and RCCL_NET_GDR_FLUSH
Both are on by default. Turn both off will skip all flush will likely
result in data error.
* Enable GDR copy by default
* Remove GDR flush env var because it is disabled by GDC flush
* Output kernel collective trace at comm destroy by default
* Limit kernel timeout messages to 100
* Use system relaxed atomic for loadInt
* Refine timeout messages and use atomic for setting offset from CPU
* Add kernel trace for barrier timeout
* Add backup barrier to avoid race in atomicAdd
* Use different counters for different warps
* Rework barrier implementation
* Fix for other GFX
* Use __hip_atomic_store and __hip_atomic_load
* Fix bug in previous commit
* Don't reset barrier values in running kernel
* Update trace format
* Fix typo
* Switch back to hip_atomic_fetch_add
* Use same barrier implementation for all GFX
* Remove extra threadfence
* Turn off HDP flush by default
Please use RCCL_NET_HDP_FLUSH=1 to switch on HDP flush
* Remove unnecessary changes from alterative barrier implementation
* Added back __threadfence_block
* Revert back to threadfence for gfx other than gfx94x
Add support for alternating rings, allow for cross-nic rings without
cross-rail communication.
Add support for user buffer registration for network send/recv.
Optimize aggregated operations to better utilize all channels.
Add flattening for BCM PCI gen5 switches.
Add support for inter-node NVLink communication
Add support for port fusion in NET/IB.
Add support for ReduceScatter and AllGather using Collnet.
Update net API to v8.
Fix hang during A2A connection.
Add local user buffer registration for NVLink SHARP.
Add tuning plugin support.
Increase net API to v7 to allow for device-side packet reordering;
remove support for v4 plugins.
Add support for RoCE ECE.
Add support for C2C links.
Better detect SHM allocation failures to avoid crash with Bus Error.
Fix missing thread unlocks in bootstrap (Fixes#936).
Disable network flush by default on H100.
Move device code from src/collectives/device to src/device.