Графік комітів

73 Коміти

Автор SHA1 Повідомлення Дата
David Yat Sin da2607024b rocr: Perform memcpy for small code-object loads
On large BAR systems, for small-sized code-objects, we get performance
using direct memcpy due to latencies when doing the blit-copy.
2025-05-22 18:39:19 -04:00
Aaron Liu 166b0fa45a rocr/dtif: add dtif environment variable
Using HSA_ENABLE_DTIF to control dtif/native thunk code path

Signed-off-by: Aaron Liu <aaron.liu@amd.com>
Reviewed-by: David Yat Sin <David.YatSin@amd.com>
2025-05-13 16:44:31 -04:00
Tony Gutierrez 6e3c375bf1 rocr: Flags to alloc queue buf/struct in dev mem
This builds on a prior change that allowed for allocating
a user-mode queue's packet buffer in device memory to also
allocate the queue struct in device memory. This provides
additional latency benefits particularly for cases where
dispatches are performed from the GPU itself. Flags are
added to support the various use cases.
2025-04-23 15:53:29 -04:00
lyndonli c34a2798ce rocr: Remove redundant Refresh() call
The initial call to Refresh() in the constructor is
unnecessary as it's handled in Runtime::Load().

Signed-off-by: lyndonli <Lyndon.Li@amd.com>
2025-03-25 09:13:59 -04:00
David Yat Sin 02b38d0614 rocr: Put back scratch_backing_memory_byte_size
The scratch_backing_memory_byte_size is not used by CP, but it is
currently used by rocgdb. Putting the field back, but we need to find a
solution for alt_scratch_backing_memory_byte_size.

Also, completely disabling alternate scratch as we need some changes to
support debugger.
2025-03-06 16:23:38 -05:00
David Yat Sin 9a950ab788 rocr: Temporarily disable alternate scratch memory
Temporarily disable alternate scratch memory usage by default due to
some stability issues.
2025-03-03 09:27:29 -05:00
David Yat Sin aa2f98e6f9 rocr: Update for new async scratch reclaim
Updating ROCr code to match new handshake protocol with CP FW for
asynchronous scratch reclaim.
Increase previous limits when scratch reclaim feature is available.
2025-02-19 21:02:00 -05:00
Shweta Khatri 6361466baa rocr: Use view3dAs2dArray flag, for thick/3D swizzle modes.
Added HSA_IMAGE_ENABLE_3D_SWIZZLE_DEBUG environment flag to
enable/disable this. Default value is false (view3dAs2dArray = 1)
Enabling this flag will enable support for swizzles that do 3D
interleaving. Note that all features of 3D images are supported
with 2D swizzles,it's just that the access patterns are different
and therefore cache hit-rates may be better or worse, depending
on how it's used. Volumetric algorithms do better with 3D and apps
that tend to access a single slice at a time do better with 2D.

Change-Id: Id8574a6710fe4333a1ee331e5ce9195a81434198
2025-01-27 09:28:33 -05:00
David Yat Sin 7ea25ebb85 rocr: Add thread priority for AsyncEventHandler
Set priority to maximum for signal event handler and minimum for
exceptions event handler.

Change-Id: I1b982d3c2e4c880fafc073fe1a542d01692a6fdc
2025-01-24 10:08:12 -05:00
David Yat Sin 4ec730f1dc rocr: Add HSA_SIGNAL_WAIT_ABORT_TIMEOUT
Add support for abort timeout when hsa_signal_wait_relaxed is called and
signal does not clear within timeout.
timeout is in seconds

Change-Id: If1db5a8af33c82ddc4b48968c3d8eceb97d0ea6d
2024-11-13 21:57:02 -05:00
German Andryeyev 0fc7369ba5 rocr: Disable WaitAny() in AsyncEventsLoop()
- Add the new path to avoid WaitAny() calls  in AsyncEventsLoopp() with
HSA_WAIT_ANY_DEBUG key. The new path is selected by default.
The optimizaiton combines all logic of WaitAny() in a single processing loop
and avoids extra memory allocations or ref counting.  Also it won't spin
on the CPU if all events are busy.

Change-Id: I197ce60d0d023fbb672f700d6e87702686f1f55a
2024-10-25 14:37:02 -04:00
Jonathan Kim 32bb0764b7 rocr: Fix IPC DMA Buf fragment handling and enable for development
Discarding blocks for reallocation on IPC export for better memory
performance trigger memory violations with DMA BUF exports so bypass
this for now as application performance drops haven't been observed
with the bypass.

The raw fragment should be passed to the DMA Buf export call as well
since offsets will be implicitly applied in the Thunk/KFD for
export/import calls.

Also, use the agent information directly from the pointer
information so that the export call doesn't have to scan memory to find
this.  Pass the node ID in the handle so that the import call doesn't
have to make two thunk imports to fetch the node ID for GPU memory
imports.

Finally, allow the user to use DMA Buf IPC via
HSA_ENABLE_IPC_MODE_LEGACY=0 for developer testing as legacy mode will
be applied by default.

Change-Id: Ie8fe267f8768fa5df37126078406f7065f69ff4e
2024-09-27 14:40:42 -04:00
Saleel Kudchadker 3baaa6e9c0 rocr: Allocate AQL queue on device memory
- Use HSA_ALLOCATE_QUEUE_DEV_MEM=1 to create AQL queue in device
memory.
- Before writing AQL packet header to the queue use an SFENCE to ensure
that there is no reodering of the writes over PCIE

Change-Id: I5eacdc35108c4a1e245c75ae349b7495451aa60d
2024-09-05 17:48:02 -04:00
Jonathan Kim eb30a5bbc7 rocr: Memory copy based on recommended SDMA engines
Recommended SDMA engines for DMA copies are now exposed for better
GPU-GPU performance. ROCr can now select those DMA engines.

Also lock-in host-device copies to SDMA0 and device-host copies to
SDMA1 for better stability and performance.

Change-Id: Ideff2e13daf537104efecb8b837bd49ee5096cb5
2024-08-20 16:22:32 -04:00
Jonathan Kim ea646cf958 Disable DMABUF IPC iplementation
Current DMABUF implemenation is unstable.  Switch back to legacy
support for now.

Change-Id: I3be871f38c6524b0bcc9225bab61de4e57771efb
2024-08-12 13:14:14 -04:00
David Yat Sin 8d666dea01 PC Sampling: Allocate resources to retrieve data from trap handler
Allocate required device and host buffers to be able to interact with
the 2nd level trap handler.

Change-Id: If99de5aacf956ca57ecafc7b04b797be9c9decaa
2024-04-11 12:53:00 -04:00
David Yat Sin 0bc244e10a PC Sampling: Create PC Sampling interfaces
Create new interface group for PC Sampling

Change-Id: I59b4cfe9f8d1ae313dc28be1d2ed49f750d8212b
2024-04-11 12:52:23 -04:00
Jonathan R. Madsen 7ce263b0e4 Update rocprofiler-register support
- add rocprofiler-register to CPACK_DEBIAN_BINARY_PACKAGE_DEPENDS when found
- add rocprofiler-register to CPACK_RPM_BINARY_PACKAGE_REQUIRES when found
- remove report_tool_load_failures_explicit_
- add HSA_TOOLS_DISABLE_REGISTER flag
- add HSA_TOOLS_REPORT_REGISTER_FAILURE
- use HSA_TOOLS_REPORT_REGISTER_FAILURE instead of HSA_TOOLS_REPORT_LOAD_FAILURE
- changed rocprofiler-register message to not include the word "error"

Change-Id: Ib7fd7f14c42758a54c347874018281bb1b5477a6
2024-02-22 11:55:25 -05:00
Jonathan Kim 62f3f250ce Optimize and fix SDMA gang copies
Optimizations include:
- Greedy gang by placing gang leaders on first D2D sdma blit context
to avoid dead locking with other gang leaders and items.  Note that
this is fine since we can't avoid an oversubscription problem when
there is only 1 xGMI link anyways, so treat all xGMI links as a single
pipe for ganging.
- Non-leader gang items don't have to poll on dependency signals so this
opens up more non-blocking SDMA channels.
- unlock gang lock when gangs are not needed.
- Change gang factor lookup from vector pair to map and register all
gpus in gang factor lookup regardless of link type so that we can take
advantage of the O(logN) direct key/value lookup time.

Fixes include:
- HSA_PAGE_SIZE_4KB was an incorrect macro to use for gang size limit.
As a result, small copies ended up ganging and hitting latency limit.
Use hardcoded 4096 bytes instead.
- Cap auxillary gang factor to the number of non-XGMI SDMA engines.

Change-Id: Ic23fde131502906a807134a04599aa6d012e8cbb
2024-01-25 10:42:27 -05:00
Jonathan R. Madsen 8f0ea44c09 Suppress reporting no tools were found with rocprofiler-register
Change-Id: If853517d40e073202d12e2a6b16fb54be5529650
2024-01-17 01:01:19 -05:00
Jonathan Kim e20f41df62 Enable IPC DMA buf
Set HSA_ENABLE_IPC_MODE_LEGACY off (i.e. use DMA bufs implementation
by default).

Change-Id: I7b1c6cb7d19310adf6f0bfe060736f4adbf7adc2
2024-01-16 22:43:27 -05:00
Jonathan Kim 5dfebdbca9 Change IPC implementation to use DMA Bufs
As the KFD IPC IOCTLs will not be upstreamed, change runtime
implementation to use DMA bufs.

DMA buf fds will be passed over abstract unix domain sockets.
The exporter spins a thread that creates a socket server.
The importer connects to the server to fetch the fd.

libDRM will be required to do a manual import and GPU map for
memory that is not already imported and mapped.

For now, use the legacy IPC implementation by default as a
follow on patch will disable the HSA_ENABLE_IPC_MODE_LEGACY
environment variable.

Change-Id: Ifd8469e9adfc81f8a1ea78d6010fb10b515ba1b4
2024-01-16 22:43:00 -05:00
David Yat Sin a7a3358067 Implement alternate scratch
The alternate scratch memory is used for dispatches that have a low
number of waves but relatively large wave size.
This allows us to keep the tmpring_size.bits.WAVES field of the main
scratch to full occupancy.

Change-Id: I32d240fac4b7d38200d1eebc1b0fdc8a823920d3
2023-12-04 15:05:22 +00:00
David Yat Sin dca8f3a21d Implement async scratch reclaim
For devices where the CP FW supports asynchronous scratch reclaim, ROCr
is able to claw-back scratch memory that was assigned to an AQL queue.
With that ability, ROCr does not have to rely on using USO
(use-scratch-once) when assigning large amounts of memory to a queue.
If we reach a situation where we are running low on device memory, ROCr
will attempt to claw-back the scratch memory.

Change-Id: Iddf8ec84e37ab8b9fdc58bafbe2b61fe2acb6eb7
2023-12-04 15:05:22 +00:00
Jonathan Kim 81c64228e0 Increase SDMA copy size
SDMA4.4 and SDMA5.2+ has increased it's available copy size to 2^30 bytes
represented by exponent as bits set in the COUNT field of the
linear copy.

Also note that the full 2^22 byte limit is available from SDMA4 onwards
as it has corrected the 0x3fffe0 HW limitation from SDMA3.

As copy limit has increase, this can change system performance
so provide env var HSA_ENABLE_SDMA_COPY_SIZE_OVERRIDE=0 to fall
back to the original 0x3fffe0 limit for debugging purposes.

Change-Id: I0fb6e5378f68e5b8a00ff559271691a943ee06ee
2023-12-04 15:03:31 +00:00
Jonathan Kim 7df0167821 Enable D2D SDMA Ganging over xGMI
Use all available SDMA engines capped by xGMI bandwith for
all D2D copies within a hive.

By default, set the latency boundary copy size as 4KB and below.
Any copy size in within this boundary will not gang.

Avoid oversubscribing engines by not ganging on engines with
pending non-ganged work.

An enviroment variable HSA_ENABLE_SDMA_GANG has been provided
to override default ganging behaviour.

Change-Id: Iccde76aa1af1d47ea2a151789432c9db4f0ffa8d
2023-07-27 08:58:26 -04:00
David Yat Sin a397373cea Add HSA_ENABLE_PEER_SDMA env variable
Add support for HSA_ENABLE_PEER_SDMA env variable that can be used to
disable use of SDMA engines for device-to-device transfers. Note that
setting HSA_ENABLE_SDMA=0 will disable all SDMA transfers and override
HSA_ENABLE_PEER_SDMA values.

Change-Id: I737b3c2b2efcf3ff237f98bc748f49b8252ed24a
2023-05-18 00:10:20 +00:00
David Yat Sin a180c9ee78 Add env var to override SRAM ECC
Add HSA_ENABLE_SRAMECC environment variable that can be used to
override SRAM ECC mode reported by KFD

Change-Id: I2b95511820a2d3d146a76b03070659c0695b61fd
2023-04-27 16:16:05 -04:00
David Yat Sin 8ebf5f9c48 Adding scratch memory reservation
Some applications will keep trying to allocate device memory until the
allocation fails. This causes all device memory to be used up and we are
then unable to allocate scratch memory for dispatches. Reserve enough
memory for 1 small scratch allocation.

Change-Id: I968400d41540ba1aca8f28581f229693eec02225
2023-04-06 15:13:36 +00:00
Shweta Khatri 83a307c449 By default, disable mwaitx feature.
This can be enabled by setting HSA_ENABLE_MWAITX=1

Change-Id: I4be00892780beeb8b14c3c5f34aa10b158921bff
2023-03-15 19:57:25 -04:00
David Yat Sin cc48dfdbff Use mwaitx when busy-waiting signals
Use mwaitx instructions when busy waiting for signals to reduce CPU
energy usage.
This can be disabled by setting HSA_ENABLE_MWAITX=0

Change-Id: Ic207895a491b2bf6dacba47ef0921df3faad5b5a
2023-02-22 16:55:43 +00:00
David Yat Sin a4f898ad15 Add env variable to print image SRD contents
Add environment variable HSA_IMAGE_PRINT_SRD to print contents of SRD
registers for image functions

Change-Id: Ifb47a73dcfad8745ee7445e20de96e1021b80bd6
2023-01-13 11:01:04 -05:00
David Yat Sin df3fe8c2fb Add env variable to disable CPU affinity override
New environment variable HSA_OVERRIDE_CPU_AFFINITY_DEBUG to
enable/disable overriding CPU affinity.

Default value is enabled(1).

This is a temporary variable and may be removed in the future.

Change-Id: Id6a7c611730471ddc276ca333fde1e57046bf32a
2022-08-19 11:07:49 -04:00
Sean Keely 965df6eef7 Basic SVM profiler.
Mostly a demo at this point.  Logs SVM (aka HMM) info to
HSA_SVM_PROFILE if set.

Example: HSA_SVM_PROFILE=log.txt SomeApp

Change-Id: Ib6fd688f661a21b2c695f586b833be93662a15f4
2022-06-23 19:30:06 -05:00
Sean Keely 3ebe99f96d Add experimental option to force discovery of all copy agents.
Discards all user provided async copy agent info and relies on
pointer info discovery.

Change-Id: Ife3e708a49ffccbede4983ab47d5ed0032970857
2022-05-14 18:08:57 -05:00
Sean Keely 37942c982a Add HSA_AMD_AGENT_INFO_COOPERATIVE_COMPUTE_UNIT_COUNT.
On gfx90a only a reduced number of CUs must be used for cooperative
dispatches due to CWSR and launcher interactions with asymetric
harvest.  We must use one fewer CUs per SE than the lowest count of
CUs on any SE.

Also adds env var HSA_COOP_CU_COUNT which enables the cooperative
CU count computation.  Set to 1 to enable the new computation.
This is an opt-in feature that will become enabled by default (opt-out)
in a future release.

Change-Id: Ifbb75ced3bbc15876eef44922c6a4f6fde8c4c28
2022-01-31 15:22:07 -05:00
Sean Keely 19c1e92b4c Remove io_link workarounds.
KFD topology has been corrected and the defaults used by this
workaround are no longer true for all chips.

Change-Id: I0242d8077e9666ed1cf0dc3985244258ae5c0924
2021-10-11 19:15:07 -05:00
Sean Keely a8c3ea82a4 Add debug option to skip setting the initial cu mask.
Adds debug variable HSA_CU_MASK_SKIP_INIT.

Change-Id: I5c742d1184a36fdef818bc50c3b780b859b68560
2021-09-16 23:43:49 -05:00
Sean Keely 2aa0795b33 Improve HSA_CU_MASK parsing efficiency.
Delay parsing until after GPU discovery.  Use the surfaced
GPU count and maximum phyiscal CU count to limit parsed bit masks.

This prevents pathological input such as
HSA_CU_MASK=0-8000000:0-8000000 from attempting to consume 7TiB.

Change-Id: I3773d2db3740c2023b0f6275d1818b69119b0495
2021-08-27 20:05:18 -04:00
Sean Keely 4455250be1 Add HSA_CU_MASK
New environment variable HSA_CU_MASK allows users to
specify a cu mask to every queue allocated from any
GPU.  hsa_amd_queue_cu_set_mask is restricted from
escaping this mask.

A new API hsa_amd_queue_cu_get_mask is added to query
the current cu mask.

Change-Id: I846c03a5faaca9b95067c31db84b59cc9fce2f03
2021-07-29 02:23:34 -05:00
Sean Keely 206e87d28b Support debugging hw exceptions.
Change-Id: I9780147294af2e9457fa54693580735452ee2ae6
2021-07-16 18:03:26 -05:00
Sean Keely 77046a1aaa Revert "Revert SVM and XNACK support."
This reverts commit 5bd153974d.

Conflicts:
	opensrc/hsa-runtime/core/util/flag.h

Change-Id: I16daf41588e6139126d66af54b0693de2e7e39f3
2021-04-21 14:49:43 -05:00
Sean Keely 243e29ba8e Remove emulator SRAMECC override controls.
Change-Id: Iea9e7870dbf517032f34cebec673c90226b96960
2021-04-02 02:11:05 -04:00
Sean Keely 5bd153974d Revert SVM and XNACK support.
KFD is not ready yet.

Change-Id: I61deb292ddb92185d33504c2115169888d56e211
2021-04-02 02:10:59 -04:00
Sean Keely 7333c77e22 Squash merge of cfreehil/amd-temp-gfx90a onto amd-staging.
Includes some workarounds and HMM.
Conflicts:
	opensrc/hsa-runtime/core/runtime/amd_topology.cpp
	opensrc/hsa-runtime/core/util/flag.h

Change-Id: I22976f07964a43dbb228a6231777dbd599112b8d
2021-04-02 02:10:15 -04:00
Sean Keely 45fbe5b192 Block ROCm 4.1+ running against 4.0 and prior kfd.
Sramecc is misreported in kfd 4.0 and prior.  To prevent possible
corruption due to d16 instructions, deny use of gfx906 with older
kfds and correct misreport for gfx908.  Denial of gfx906 may be
overridden by setting HSA_IGNORE_SRAMECC_MISREPORT=1.

Change-Id: I7d5c3a716fad01c348f8b88cd508cedbf914c989
2021-04-01 00:03:32 -04:00
Sean Keely b51f68b535 Style update for SDMA enable flag.
Updated to match xnack flag's style.

Change-Id: I6115c0b53660d789e698de1606a9388ae1789866
2020-11-20 15:06:02 -05:00
Sean Keely 2a0c6774fb Use SDMA for small copies in VRAM.
For small copies cache flush latency is larger than data transfer
latency in local VRAM.  Select SDMA for small copies.

Environment key HSA_FORCE_SDMA_SIZE is added for easy adjustment
of the small copy size.  This may be removed after tuning is done.

Change-Id: I733fa0ae01c616617c5de50e71226b51fd589ef2
2020-09-03 03:11:57 -05:00
Tony 91cb98dab6 Code object reader improvements
- Make code object reader use mmap when loading from a file on Linux.
- Support computing code object URI for memory either fro the loaded
  host executables, or from all mmapped files. Define the environment
  variable HSA_LOADER_ENABLE_MMAP_URI to non 0 to search the mmap
  files, otherwise only the loaded executables will be seatched.
- For mmap search, determine file size and ommit offset and size URI
  fragment when the code object is the whole file even when specifying
  a file size explicitly or specifying memory that has been mmaped.
- Always return a non-empty code object URI.
- When a code object reader is created, complete all fields to ensure
  it can be used in a multi-threaded manner using only const
  operations.
- Add missing exception handlers in the AMD vendor extentions.
- More rigorous checking for errors.

Change-Id: I07797b1dc60c5c64245142d77becf9f7c9643395
2020-06-25 12:18:50 -04:00
Ramesh Errabolu fa13208698 Add rocr namespace to core header and impl files
Change-Id: I1e1b33f9bba1078d049bc19797889988c3e43360
2020-06-19 22:34:21 -04:00