diff --git a/projects/clr/CHANGELOG.md b/projects/clr/CHANGELOG.md index cad78cc1ce..7a44af016c 100644 --- a/projects/clr/CHANGELOG.md +++ b/projects/clr/CHANGELOG.md @@ -2,6 +2,13 @@ Full documentation for HIP is available at [rocm.docs.amd.com](https://rocm.docs.amd.com/projects/HIP/en/latest/index.html) +## HIP 8.0 for ROCm 8.0 + +### Added + +* New HIP APIs + - `hipKernelGetParamInfo` returns the offset and size of a kernel parameter + ## HIP 7.2 for ROCm 7.2 ### Added @@ -17,15 +24,27 @@ Full documentation for HIP is available at [rocm.docs.amd.com](https://rocm.docs - `hipLibraryGetKernelCount` gets kernel count in library - `hipStreamCopyAttributes` copies attributes from source stream to destination stream - `hipOccupancyAvailableDynamicSMemPerBlock` returns dynamic shared memory available per block when launching numBlocks blocks on CU. - - `hipKernelGetParamInfo` returns the offset and size of a kernel parameter -* New HIP flags - - `hipMemLocationTypeHost`, enables handling virtual memory management in host memory location, in addition to device memory. - - `hipHostRegisterIoMemory` is supported in `hipHostRegister`, used to register I/O memory with HIP runtime so it can be accessed by the GPU. +* New HIP flags + - `hipMemLocationTypeHost`, enables handling virtual memory management in host memory location, in addition to device memory. - Support for flags in `hipGetProcAddress`, enables searching for the per-thread version symbols. - `HIP_GET_PROC_ADDRESS_DEFAULT` - `HIP_GET_PROC_ADDRESS_LEGACY_STREAM` - `HIP_GET_PROC_ADDRESS_PER_THREAD_DEFAULT_STREAM` +### Resolved issues + +* Corrected the calculation of the value of maximum shared memory per multiprocessor, in HIP device properties. + +### Optimized + +* Graph node scaling: +HIP runtime implements optimized doorbell ring mechanism for certain topologies of graph execution. It enables efficient batching of graph nodes. This enhancement provides better alignment with CUDA Graph optimizations. +HIP also adds a new performance test for HIP graphs with programmable topologies to measure graph performance across different structures. The test evaluates graph instantiation time, first launch time, repeat launch times, and end-to-end execution for various graph topologies. The test implements comprehensive timing measurements including CPU overhead and device execution time. +* Back memory set (`memset`) optimization: +HIP runtime now implements a back memory set (memset) optimization to improve how `memset` nodes are processed during graph execution. This enhancement specifically handles varying number of AQL (Architected Queue Language) packets for `memset` graph node due to graph node set params for AQL batch submission approach. +* Async handler performance improvement: +HIP runtime has removed the lock contention in async handler enqueue path. This enhancement reduces runtime overhead and maximizes GPU throughput, for asynchronous kernel execution, especially in multi-threaded applications. + ## HIP 7.1.1 for ROCm 7.1.1 ### Added