Doc update - describe debug techniques

Also tweak sample to remove unneeded HIP_KERNEL_NAME.
Comment update
This commit is contained in:
Ben Sander
2017-01-19 12:33:06 -06:00
والد 2ffc9f4e22
کامیت ca1cef4e06
3فایلهای تغییر یافته به همراه49 افزوده شده و 2 حذف شده
@@ -315,7 +315,7 @@ libhsa-runtime64.so.1->hsaKmtDeregisterMemory(0x7f7776d3e010, 0x7f7776d3e010, 0x
```
Some key information from the trace above.
- The trace snippet shows the execution of a hipMemcpy API, bracketed by the first and last message in the trace output. The messages show the thread id and API sequence number (`1.17`). ltrace output intermixes messages from all threads, so the HIP debug information can be useful to determine which threads are executing.
- Thy trace snippet shows the execution of a hipMemcpy API, bracketed by the first and last message in the trace output. The messages show the thread id and API sequence number (`1.17`). ltrace output intermixes messages from all threads, so the HIP debug information can be useful to determine which threads are executing.
- The code flows through HIP APIs into ROCr (HSA) APIs (hsa*) and into the thunk (hsaKmt*) calls.
- The HCC runtime is "libmcwamp_hsa.so" and the HSA/ROCr runtime is "libhsa-runtime64.so".
- In this particular case, the memory copy is for unpinned memory, and the selected copy algorithm is to pin the host memory "in-place" before performing the copy. The signaling APIs and calls to pin ("lock", "register") the memory are readily apparent in the trace output.
@@ -362,3 +362,49 @@ TargetAddress:0x5ec7e9000
...
-->0x5ec7e9000-0x5f7e28fff:: allocSeqNum:488 hostPointer:(nil) devicePointer:0x5ec7e9000 sizeBytes:191102976 isInDeviceMem:1 isAmManaged:1 appId:0 appAllocFlags:0 appPtr:(nil)
```
- Debugging GPUVM fault.
For example:
```
Memory access fault by GPU node-1 on address 0x5924000. Reason: Page not present or supervisor privilege.
Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffdffb5700 (LWP 14893)]
0x00007ffff2057c37 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0 0x00007ffff2057c37 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007ffff205b028 in __GI_abort () at abort.c:89
#2 0x00007ffff6f960eb in ?? () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#3 0x00007ffff6f99ea5 in ?? () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#4 0x00007ffff6f78107 in ?? () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#5 0x00007ffff744f184 in start_thread (arg=0x7fffdffb5700) at pthread_create.c:312
#6 0x00007ffff211b37d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) info threads
Id Target Id Frame
4 Thread 0x7fffdd521700 (LWP 14895) "caffe" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
3 Thread 0x7fffddd22700 (LWP 14894) "caffe" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
* 2 Thread 0x7fffdffb5700 (LWP 14893) "caffe" 0x00007ffff2057c37 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
1 Thread 0x7ffff7fa6ac0 (LWP 14892) "caffe" 0x00007ffff6f934d5 in ?? () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
(gdb) thread 1
[Switching to thread 1 (Thread 0x7ffff7fa6ac0 (LWP 14892))]
#0 0x00007ffff6f934d5 in ?? () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
(gdb) bt
#0 0x00007ffff6f934d5 in ?? () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#1 0x00007ffff6f929ba in ?? () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#2 0x00007fffe080beca in HSADispatch::waitComplete() () from /opt/rocm/hcc/lib/libmcwamp_hsa.so
#3 0x00007fffe080415f in HSADispatch::dispatchKernelAsync(Kalmar::HSAQueue*, void const*, int, bool) () from /opt/rocm/hcc/lib/libmcwamp_hsa.so
#4 0x00007fffe080238e in Kalmar::HSAQueue::dispatch_hsa_kernel(hsa_kernel_dispatch_packet_s const*, void const*, unsigned long, hc::completion_future*) () from /opt/rocm/hcc/lib/libmcwamp_hsa.so
#5 0x00007ffff7bb7559 in hipModuleLaunchKernel () from /opt/rocm/hip/lib/libhip_hcc.so
#6 0x00007ffff2e6cd2c in mlopen::HIPOCKernel::run (this=0x7fffffffb5a8, args=0x7fffffffb2a8, size=80) at /root/MIOpen/src/hipoc/hipoc_kernel.cpp:15
...
```
Some general tips:
- The fault will be caught by the runtime but was actually generated by an asynchronous command running on the GPU. So, the GDB backtrace will show a path in the runtime, ie inside "GI_Raise" as shown in the example above.
- To determine the true location of the fault, force the kernels to execute synchronously by seeing the environment variables HCC_SERIALIZE_KERNEL=3 HCC_SERIALIZE_COPY=3. This will force HCC to wait for the kernel to finish executing before retuning. If the fault occurs during the execution of a kernel, you can see the code which launched the kernel inside the backtrace. A bit of guesswork is required to determine which thread is actually causing the issue - typically it will the thread which is waiting inside the libhsa-runtime64.so.
- VM faults inside kernels can be caused byi:
- incorrect code (ie a for loop which extends past array boundaries), i
- memory issues - kernel arguments which are invalid (null pointers, unregistered host pointers, bad pointers).
- synchronization issues
- compiler issues (incorrect code generation from the compiler)
- runtime issues
@@ -81,7 +81,7 @@ int main(int argc, char *argv[])
const unsigned threadsPerBlock = 256;
printf ("info: launch 'vector_square' kernel\n");
hipLaunchKernel(HIP_KERNEL_NAME(vector_square), dim3(blocks), dim3(threadsPerBlock), 0, 0, C_d, A_d, N);
hipLaunchKernel(vector_square, dim3(blocks), dim3(threadsPerBlock), 0, 0, C_d, A_d, N);
printf ("info: copy Device2Host\n");
CHECK ( hipMemcpy(C_h, C_d, Nbytes, hipMemcpyDeviceToHost));
+1
مشاهده پرونده
@@ -338,6 +338,7 @@ void ihipStream_t::locked_wait()
};
// Causes current stream to wait for specified event to complete:
// Note this does not require any kind of host serialization.
void ihipStream_t::locked_waitEvent(hipEvent_t event)
{
LockedAccessor_StreamCrit_t crit(_criticalData);