Commit Graph

2959 Commitit

Tekijä SHA1 Viesti Päivämäärä
David Yat Sin ce0244ac03 Revert rocr: Only expose ext-fine-grain pool on xgmi-hive systems
This reverts commit 6dac90c89a.
2025-03-18 16:28:36 -04:00
jordans d4b85b6bf5 hsakmt: Initial Commit for the HSA KMT Model
The over arching goal it so provide an API that pre-silicon models can latch into for software bring up.# Please enter the commit message for your changes. Lines starting
2025-03-18 16:22:17 -04:00
David Yat Sin 6903a41b1d rocr: Workaround for SDMA POLL_REGMEM on gfx9.0
Poll the dependent signals twice on all gfx9.0 GPUs except gfx90a.
This is needed as a work-around for a rare issue where SDMA_POLL_REGMEM
may return before the memory is actually cleared.
2025-03-17 17:59:15 -04:00
Mallya, Ameya Keshava 5d254c6fb0 Added release trigger for further releases
Signed-off-by: Mallya, Ameya Keshava <AmeyaKeshava.Mallya@amd.com>
2025-03-14 13:52:00 -07:00
Stella Laurenzo c36ccaaf4b rocr: Search for libnuma with find_package before find_library.
This avoids a false dependence on a system library when not desired.
2025-03-14 08:16:13 -07:00
Hila, Nino 98a5ebc3f1 Update palamida.yml
Signed-off-by: Hila, Nino <Nino.Hila@amd.com>
2025-03-13 20:08:56 -04:00
Hila, Nino 0e2064e6a7 Create palamida.yml
Signed-off-by: Hila, Nino <Nino.Hila@amd.com>
2025-03-13 16:07:18 -04:00
Benjamin Welton d2a89a467b rocr: Reset event_age when signals move
Resets event_age when signals move. Prior to this PR, event_age
can become unaligned with hsa_event, causing hangs if the event_age
exceeds the true hsa_event age.
2025-03-13 11:32:16 -04:00
Emily Deng 42f79776cd kfdtest: Fix the childStatus is 0x7f error for KFDDBGTest.HitMemoryViolation
For the case parent goes faster then child, and child hasn't call the second
raise(SIGSTOP), then parent's "waitpid(childPid, &childStatus, 0)" will return,
and the childStatus will be 0x137f, which is SIGSTOP signal id.

Signed-off-by: Emily Deng <Emily.Deng@amd.com>
2025-03-13 13:38:46 +08:00
Emily Deng 91ef44d3ec kfdtest: Fix DeviceSnapshot return fail error for KFDDBGTest.HitMemoryViolation
For the case that the child goes to the second raise(SIGSTOP),
and parent sends PTRACE_CONT, than child exits. Parent will assert at
DeviceSnapshot, as in kfd_ioctl, couldn't get the mm from child pid.

Signed-off-by: Emily Deng <Emily.Deng@amd.com>
2025-03-13 13:38:46 +08:00
Apurv Mishra 85c4b0020a kfdtest: limit GFX VRAM allocation to 1/4 sys mem
reduce the allocated memory for GFX VRAM as
KFD Evict test faced intermittent page faults,
which can be due to larger GFX CS BO size
2025-03-12 13:54:04 -04:00
Yiannis Papadopoulos c7936334cf rocr/aie: Changing variable names 2025-03-11 19:35:21 -04:00
Yiannis Papadopoulos fb33e2e724 rocr/aie: Handle non-HSA_STATUS_SUCCESS during VisitRegion 2025-03-11 19:35:21 -04:00
Apurv Mishra de8f8f076d kfdtest: add blacklist for RHEL9 system
add tests for exclusion when running kfdtest
on RHEL9 system, tested with Navi 31

Signed-off-by: Apurv Mishra <apurv.mishra@amd.com>
2025-03-11 16:40:25 -04:00
Longlong Yao a254e35fd6 rocr: export pointer type for OnlyAddress
Signed-off-by: Longlong Yao <Longlong.Yao@amd.com>
2025-03-11 10:16:58 -04:00
Longlong Yao 5916467552 libhsakmt: set node_id to 0 for OnlyAddress
Signed-off-by: Longlong Yao <Longlong.Yao@amd.com>
2025-03-11 10:16:58 -04:00
Amber Lin fcf3f91379 kfdtest: Temporarily blacklist KFDNegativeTest
Blacklist KFDNegativeTest.BasicPipeReset from gfx950 until MEC can
support pipe reset on GC 9.5.0.

Signed-off-by: Amber Lin <Amber.Lin@amd.com>
2025-03-10 10:37:19 -07:00
zichguan-amd 3415a500c7 Throw exception when runtime not initialized for hsa_amd_signal_wait_*
Signed-off-by: zichguan-amd <zichuan.guan@amd.com>
2025-03-07 15:17:10 -05:00
zichguan-amd e4d027191c rocr: Allow 0/NULL/invalid signal handles for wait operations to be no-op
Remove hard assertions for signal validation on hsa_amd_signal_wait_* operations, instead ignore 0/NULL/invalid signals in the dependency condition evaluation to align with HSA specs for barrier-AND and barrier-OR packets.

Signed-off-by: zichguan-amd <zichuan.guan@amd.com>
2025-03-07 15:17:10 -05:00
David Yat Sin 02b38d0614 rocr: Put back scratch_backing_memory_byte_size
The scratch_backing_memory_byte_size is not used by CP, but it is
currently used by rocgdb. Putting the field back, but we need to find a
solution for alt_scratch_backing_memory_byte_size.

Also, completely disabling alternate scratch as we need some changes to
support debugger.
2025-03-06 16:23:38 -05:00
Jonathan Kim c879fdefcf kfdtest: Add KFD SDMA queue reset testing
The KFD can per-SDMA queue reset similar to compute queue reset.
Add test.
2025-03-06 14:04:42 -05:00
Jonathan Kim ee890e7d2b kfdtest: Add KFD SDMA queue reset testing
The KFD can per-SDMA queue reset similar to compute queue reset.
Add test.
2025-03-06 14:04:42 -05:00
Jonathan Kim d047708317 kfdtest: Allow user to modify packet size for SDMA write packets
This is primarily used for debug and negative testing for SDMA queue
reset and shouldn't be used for normal run cases.
2025-03-06 14:04:42 -05:00
Jonathan Kim 9e57ce48e8 kfdtest: Add create SDMA queue by target engine
KFD supports SDMA queue creation by target engine.
Enable this for testing.
2025-03-06 14:04:42 -05:00
Jonathan Kim a957b24153 kfdtest: Add SDMA poll memory register packet support
The SDMA can wait on poll user memory.  This is being added to
support per-SDMA queue reset testing.
2025-03-06 14:04:42 -05:00
Jonathan Kim e3d09e30dc hsakmt: Expose per-SDMA queue reset capabilities
Expose new capabilities field that flags per-sdma queue reset
support.
2025-03-06 14:04:42 -05:00
Su, Daniel 70b44c576c External CI: change trigger from amd-master to amd-mainline
Signed-off-by: Su, Daniel <Daniel.Su@amd.com>
2025-03-05 16:24:29 -05:00
David Yat Sin 6dac90c89a rocr: Only expose ext-fine-grain pool on xgmi-hive systems
We cannot guarrantee system-scope coherency on systems with only PCIe
connections, so do not expose extended fine-grain memory pool on these
systems.
2025-03-05 10:41:38 -05:00
Lao, Darren 0cd46b6582 rocr: Change grid dimensions
Signed-off-by: Lao, Darren <Darren.Lao@amd.com>
2025-03-04 16:19:51 -05:00
David Yat Sin 4cb6a6d45d rocrtst: Disable RLIMIT for negative queue tests
The negative queue tests generate an exception which triggers a coredump
generation. Disable RLIMIT so that the coredumps are not generated for
these tests.
2025-03-04 10:29:34 -05:00
David Yat Sin d031af9eb5 rocr: Check RLIMIT_CORE before generating coredump
Check for RLIMIT_CORE before collecting data for coredump. If the
current limit is 0, then we can return early without spending time
collecting coredump data.
2025-03-04 10:29:34 -05:00
David Yat Sin 3944da1d76 rocr:Only set asan flag on GPU agents 2025-03-03 14:51:19 -05:00
David Yat Sin 9a950ab788 rocr: Temporarily disable alternate scratch memory
Temporarily disable alternate scratch memory usage by default due to
some stability issues.
2025-03-03 09:27:29 -05:00
David Belanger 3ceb131df5 kfdtest: Fix ExtendedCuMasking test case
Modify test case to support XL cards.

Change-Id: I6ad45a290d50a5238804ce7417bcdb33a3912872
Signed-off-by: David Belanger <david.belanger@amd.com>
2025-02-27 21:25:19 -05:00
Khatri, Shweta 0984a1f0fd rocr: GFX9, GFX10, GFX11: Use view3dAs2dArray flag, for thick/3D swizzle modes. (#58)
A HSA_IMAGE_ENABLE_3D_SWIZZLE_DEBUG environment flag exists already to
enable/disable this. Default value is false (view3dAs2dArray = 1)
Enabling this flag will enable support for swizzles that do 3D
interleaving on GFX9, GF10 and GFX11. By default support for swizzles that
do 3D interleaving is disabled.
2025-02-26 09:38:17 -05:00
Tony Gutierrez d3a4dc9687 rocr: Remove KMT usage from AMD ext
Use the core Driver in AMD's HSA extension API to make it
agnostic to the underlying OS and kernel-mode driver.
2025-02-25 21:51:52 -05:00
James Zhu f8d8b8011f kfdtest: fix resource leakage
Resource allocated in SetUp/HsaNodeInfo::Init,
needs be delete in TearDown/HsaNodeInfo::Delete.

Signed-off-by: James Zhu <James.Zhu@amd.com>
2025-02-24 19:38:59 -05:00
Khatri, Shweta 322a794cf6 rocr: Adding support for Stochastic PC Sampling for gfx94x (#47)
Change-Id: Ide4c2e25b88f1f25ea4ce35a619b93963c0355ee
2025-02-22 00:13:08 -05:00
Tony Gutierrez a9f6bc8d0e rocr: Remove KMT usage from CPU agent
Use the core Driver object in the CPU agent to make it OS/driver
agnostic.

Implement the GetMemoryProperties() and GetCacheProperties methods
for the KFD driver.
2025-02-21 10:00:38 -05:00
Cheruvally, Aravindan 20e6c87a09 Enable/Disable rocprofiler-register pkg dependency based on build type (#30)
Co-authored-by: Yat Sin, David <David.YatSin@amd.com>
2025-02-20 11:07:35 -05:00
David Yat Sin 107b48fb15 rocr: Add queries for async scratch reclaim
Add support for these 2 new queries:
- HSA_AMD_AGENT_INFO_SCRATCH_LIMIT_MAX
  Maximum amount of scratch memory allowed on this agent

- HSA_AMD_AGENT_INFO_SCRATCH_LIMIT_CURRENT
  Current limit for scratch memory on this agent
2025-02-19 21:02:00 -05:00
David Yat Sin aa2f98e6f9 rocr: Update for new async scratch reclaim
Updating ROCr code to match new handshake protocol with CP FW for
asynchronous scratch reclaim.
Increase previous limits when scratch reclaim feature is available.
2025-02-19 21:02:00 -05:00
David Yat Sin 2f8a9b28d0 rocr: Remove unused fields in amd_queue_t
scratch_wave64_lane_byte_size and alt_scratch_wave64_lane_byte_size are
not used by CP FW.
2025-02-19 21:02:00 -05:00
David Yat Sin 13c591d250 rocr: Remove gfx940 and gfx941 support 2025-02-19 12:16:24 -05:00
David Yat Sin 806ddfc8eb rocrtst: extend IPC test to support async_handler 2025-02-19 11:19:09 -05:00
David Yat Sin fa8be44df9 rocr: Allow IPC signals in hsa_amd_signal_async_handler
Allow IPC signals to be registered with hsa_amd_signal_async_handler.
This forces AsyncEventsLoop to switch to polling instead of interrupts.
2025-02-19 11:19:09 -05:00
Longlong Yao 26f001d3cb libhsakmt: allocate va in host path
Change-Id: I40a4395aca99ea8dfd8ff0ecde64eb2c3840d867
Signed-off-by: Longlong Yao <Longlong.Yao@amd.com>
2025-02-15 07:56:45 -05:00
Adel Johar b4f8b5c202 Docs: Update environment variables page 2025-02-14 10:15:20 -05:00
Harish Kasiviswanathan 2a64fa5e06 libhsakmt: gfx950: Add option to enable HIGH_PRECISION
Environment variable HSA_HIGH_PRECISION_MODE can be used to control MFMA
precision

Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Change-Id: Ib78dd9dd8867025e090a3cca96ab6db4f65dea12
2025-02-10 16:05:25 -05:00
Ranjith Ramakrishnan 3be9c49b63 CMake: Add package conflict for the deprecated package hsakmt
For debian use cases, package conflict is required to remove the
deprecated package during package upgrade Also removed the duplicate
setting of package obseletes in RPM usecase.
2025-02-07 11:57:32 -05:00