1. Create a set of mini numa interface.
In Linux, the interface is based on system call rather than libnuma.
In Windows, the interface can also work, but the policy class is dummy.
Different from Linux, Windows doesn't provide numactl tool or numa lib to setup numa policy, thus
the default policy is followed in Windows, that is, using the closest host numa node to allocate
pinned host memory in hipHostMalloc().
To get the closest host numa node of a GPU device, you need query the new attribute
hipDeviceAttributeHostNumaId. Then you can create a thread with CPU affinity on the numa node.
For example, reference the test in hip-tests/catch/perftests/memory/hipPerfHostNumaAllocWin.cc.
2. Remove pfnSetThreadGroupAffinity and pfnGetNumaNodeProcessorMaskEx as the functions have been exposed since Win7 and Win server 2008.
3. Other minor fixes.
This change ensures that shared memory objects (e.g., files in /dev/shm)
are unlinked once all related IPC events have been destroyed.
[ROCm/clr commit: dc34af61d7]
The "optimized" version of memcpy is outdated and
was used in win32 only.
Change-Id: I7f2e0e9051e37cec95438266824b5b0025c324c6
[ROCm/clr commit: 7448113cfc]
- Clean up detection by using visual studio macros to detect arch; I
didn't list all possible ARM platforms (can be done later if desired)
- Fixed two incorrect uses of !defined(ATI_ARCH_ARM) to instead use
defined(ATI_ARCH_X86), as they contain X86 specific code
- Fixed one use of __ARM_ARCH_7A__ to use ATI_ARCH_ARM instead
This is an improvement to the fixes in the last patch for SWDEV-323669
Signed-off-by: Jeremy Newton <Jeremy.Newton@amd.com>
Change-Id: I8568167293c34ad5331902105877f3ab6e25acb3
[ROCm/clr commit: 00efdc1cd6]
- Fix a crash with AMD_CPU_AFFINITY=1 as numa_bitmask_alloc isnt the
right api to allocate bitmask
- Do not set affinity for ROCr thread. It worsens performance rather
than any improvement.
- Fix regression from my previous change for event handler.
Change-Id: I3ea75adc2a6333f29752283eddd5b555e9b58cc5
[ROCm/clr commit: 802c2c8a9f]
Set affinity to the closest node of the current GPU. This reduces
the latency to fetch kernel args since device would query the CPU cache
of core which did the dispatch. This behavior is controlled with
AMD_CPU_AFFINITY env var(disabled by default)
Change-Id: I65afba62cb818ea25a311b88d1c0dd5c51330292
[ROCm/clr commit: b192beea52]
Setting AMD_CPU_AFFINITY=1 will keep Async Handler thread within the
bounds set by numactl.
Change-Id: Id01b30df5127d65c29ac072bf74a04986b7128de
[ROCm/clr commit: cd21af757e]
Setting AMD_CPU_AFFINITY = 1 will make runtime honor core affinity that
the process may set. This is disabled by default as it can prevent
worker thread or any thread that runtime creates from getting scheduled
thus affecting performance.
Change-Id: Ibe4cc95e7b99caee5ce750b7bf66e09e999cc9a3
[ROCm/clr commit: 1398719b0d]