During a process tear-down we wait on all signals before releasing them:
VirtualGPU::HwQueueTracker::~HwQueueTracker() {
for (auto& signal : signal_list_) {
CpuWaitForSignal(signal);
signal->release();
}
[...]
}
In the case where we exit the process after a GPU error that did not
cause an abort (ulimit -c == 0), waiting for the signal can be skipped.
With the device on the error state, no progress is made, and the signal
is probably never going to be modified again:
inline bool WaitForSignal(hsa_signal_t signal, bool active_wait = false, bool yield = false) {
[...]
if (HIP_SKIP_ABORT_ON_GPU_ERROR && amd::Device::IsGPUInError()) {
ClPrint(amd::LOG_ERROR, amd::LOG_SIG,
"Device not Stable, while waiting for Signal ="
"(0x%lx) for %d ns",
signal.handle, kTimeout4Secs);
return true;
}
[...]
}
However, after calling CpuWaitForSignal, when calling "release", we can
end-up on a signal dtor which also tries to wait on the signal. Because
the GPU is the error state, we never receive the signal, and hang the
process during tear down. This happens with the ProfilingSignal dtor:
ProfilingSignal::~ProfilingSignal() {
if (signal_.handle != 0) {
if (hsa_signal_load_relaxed(signal_) > 0) {
LogError("Runtime shouldn't destroy a signal that is still busy!");
if (hsa_signal_wait_scacquire(signal_, HSA_SIGNAL_CONDITION_LT, kInitSignalValueOne,
kUnlimitedWait, HSA_WAIT_STATE_BLOCKED) != 0) {
}
}
hsa_signal_destroy(signal_);
}
}
This dtor should check that the GPU is not in the error state before
trying to wait, which is what this patch implements.
Bug: SWDEV-555043
Bug: SWDEV-553435
Bug: SWDEV-553679
Bug: SWDEV-555119
ROCm Systems
Welcome to the ROCm Systems super-repo. This repository consolidates multiple ROCm systems projects into a single repository to streamline development, CI, and integration. The first set of projects focuses on requirements for building PyTorch.
Super-repo Status and CI Health
This table provides the current status of the migration of specific ROCm systems projects as well as a pointer to their current CI health.
Key:
- Completed: Fully migrated and integrated. This super-repo should be considered the source of truth for this project. The old repo may still be used for release activities.
- In Progress: Ongoing migration, tests, or integration. Please refrain from submitting new pull requests on the individual repo of the project, and develop on the super-repo.
- Pending: Not yet started or in the early planning stages. The individual repo should be considered the source of truth for this project.
Tentative migration schedule
| Component | Tentative Date |
|---|
*Remaining schedule to be determined.
TheRock CI Status
Note TheRock CI performs multi-component testing on top of builds leveraging TheRock build system.
Nomenclature
Project names have been standardized to match the casing and punctuation of released packages. This removes inconsistent camel-casing and underscores used in legacy repositories.
Structure
The repository is organized as follows:
projects/
amdsmi/
aqlprofile/
clr/
hip/
hipother/
hip-tests/
rccl/
rdc/
rocm-core
rocminfo/
rocmsmilib/
rocprofiler/
rocprofiler-compute/
rocprofiler-register/
rocprofiler-sdk/
rocprofiler-systems/
rocrruntime/
rocshmem/
roctracer/
- Each folder under
projects/corresponds to a ROCm systems project that was previously maintained in a standalone GitHub repository and released as distinct packages. - Each folder under
shared/contains code that existed in its own repository and is used as a dependency by multiple projects, but does not produce its own distinct packages in previous ROCm releases.
Goals
- Enable unified build and test workflows across ROCm libraries.
- Facilitate shared tooling, CI, and contributor experience.
- Improve integration, visibility, and collaboration across ROCm library teams.
Getting Started
To begin contributing or building, see the CONTRIBUTING.md guide. It includes setup instructions, sparse-checkout configuration, development workflow, and pull request guidelines.
License
This super-repo contains multiple subprojects, each of which retains the license under which it was originally published.
📁 Refer to the LICENSE, LICENSE.md, or LICENSE.txt file within each projects/ or shared/ directory for specific license terms.
📄 Refer to the header notice in individual files outside projects/ or shared/ folders for their specific license terms.
Note
: The root of this repository does not define a unified license across all components.
Questions or Feedback?
We're happy to help!