1. Create a set of mini numa interface.
In Linux, the interface is based on system call rather than libnuma.
In Windows, the interface can also work, but the policy class is dummy.
Different from Linux, Windows doesn't provide numactl tool or numa lib to setup numa policy, thus
the default policy is followed in Windows, that is, using the closest host numa node to allocate
pinned host memory in hipHostMalloc().
To get the closest host numa node of a GPU device, you need query the new attribute
hipDeviceAttributeHostNumaId. Then you can create a thread with CPU affinity on the numa node.
For example, reference the test in hip-tests/catch/perftests/memory/hipPerfHostNumaAllocWin.cc.
2. Remove pfnSetThreadGroupAffinity and pfnGetNumaNodeProcessorMaskEx as the functions have been exposed since Win7 and Win server 2008.
3. Other minor fixes.
This change ensures that shared memory objects (e.g., files in /dev/shm)
are unlinked once all related IPC events have been destroyed.
[ROCm/clr commit: dc34af61d7]
The "optimized" version of memcpy is outdated and
was used in win32 only.
Change-Id: I7f2e0e9051e37cec95438266824b5b0025c324c6
[ROCm/clr commit: 7448113cfc]
Set affinity to the closest node of the current GPU. This reduces
the latency to fetch kernel args since device would query the CPU cache
of core which did the dispatch. This behavior is controlled with
AMD_CPU_AFFINITY env var(disabled by default)
Change-Id: I65afba62cb818ea25a311b88d1c0dd5c51330292
[ROCm/clr commit: b192beea52]
Setting AMD_CPU_AFFINITY=1 will keep Async Handler thread within the
bounds set by numactl.
Change-Id: Id01b30df5127d65c29ac072bf74a04986b7128de
[ROCm/clr commit: cd21af757e]
Replace amd::Atomic with std::atomic. Remove make_atomic uses by
converting the variable to std::atomic and making sure the memory
order is relaxed when synchronizes-with is not needed.
Delete utils/atomic.hpp.
Change-Id: I0b36db8d604a8510ac6e36b32885fd16a1b8ccfa
[ROCm/clr commit: 5d4b6f74d3]