Query ROCr to see if we have the proper lower-level support for
cooperative groups -- GWS support through the firmware, driver,
thunk, and ROCr. ROCr does these checks for us, and presents a
query that allows us to see if GWS entries are available for use.
If so, then we have all the lower-level technologies needed, and
we should enable cooperative groups support for HIP.
The maxSharedMemoryPerMultiProcessor attribute is meant to describe
the number of bytes of shared memory (LDS space in AMD terminology)
in each SM (CU in AMD terminology). For instance, on AMD GPUs this
is often 64KB per CU, and some Nvidia GPUs it's 96KB per SM.
This shared memory is a different address space from the normal
global memory. However, the current HIP-HCC properties fill this
in with a size that matches the totalGlboalMem property. This gives
a drastically too-high calculation for the amount of LDS space that
each CU has -- tens of GBs vs. 10s of KBs.
This patch fixes this by pulling the maxSharedMemoryPerMultiProcessor
property from the HSA pool that describes how much workgroup-local
space is available on each CU. The HSA runtime eventually pulls
this from the topology information about LDSSizeInKB, defined as
"Size of Local Data Store in Kilobytes per SIMD".
Previously, this HSA query was used to fill in the value of the
sharedMemPerBlock property. On today's AMD GPUs, we know that
the amount of LDS avaialble to the workgroup is identical to the
amount of LDS space in the CU. However, in the future this may
differ. As such, this patch changes around the order and fills
in the "PerMultiProcessor" property from the HSA query (since
what's what the query is defined to return), and then separately
fills in the "PerBlock" property as we know it.
* Add missing texturePitchAlignment member to the hipDeviceProp_t struct.
* Add missing hipDeviceAttributeTexturePitchAlignment enumerator to the hipDeviceAttribute_t enum.
* Initialize texturePitchAlignment to 256. This works for gfx9+, but is technically overaligned in most cases for pre-gfx9.
* Add the texturePitchAlignment property to the NVCC path.
* add default visibility to most APIs in program_state
* remove unwanted C++ headers
* Add symbol visibility pragmas and compiler flags
* Add visibility attribute to APIs in channel_descriptor and hip_hcc
* remove unused headers
* simplify build flags with hcc
* add pragma visibility hidden to functional_grid_launch
* [CMake] add gfx908 back
* all thread local access now through single struct
* clean up old commented-out code, more use of GET_TLS()
* fewer calls to GET_TLS by passing tls as a funtion argument
* revert unnecessary change to printf
* fix failing tests due to TLS change
* fix merge conflicts in ihipOccupancyMaxActiveBlocksPerMultiprocessor
* Add Max Texture 1D,2D,3D device properties
* Corrected testcase to use enums defined in hipDeviceAttribute_t
* Added texture 1D,2D and 3D support for NVIDIA path
* [hip] implement the hipExtLaunchMultiKernelMultiDevice API
* add a guard to check the HCC version for acquire_locked_hsa_queue() API which was introdued in HCC for ROCm 2.5
* modified code based on the requested changes
* changes to lock all streams before launching kernels for each device and unlock them after the dispatches
* check each stream to be valid before starting to lock all the streams
* Revert "Revert "Use COMgr to read Kernel Args Metadata (#1006)""
This reverts commit a3d118eaa8.
* Revert "Use COMgr to read Kernel Args Metadata (#1006)"
This reverts commit 8a548bf40b.
* Revert "improve program state commentary"
This reverts commit 7aada87cbd.
* Revert "load program state once per agent"
This reverts commit c9117de8eb.
* start moving function_names() into the hip shared lib
* start moving code_object_blobs to a new "state" object
* Consolidate various program state related static objects into a
single program_state object
* minor clean up
* move more stuffs from functional_grid_launch into program_state
* debug make_kernarg
* moving lookup for kernargs size_align into program_state
* clean up old code for kernarg size and alignment
* update hip_module to use newer api in program_state
* Create public member functions for program_state
* move most program state functions into shared library
* Pass the data buffer size to load_executable
Otherwise, it can't figure what the data size is
just from the char* (since the data is not really a string)
* turning free functions in program state into members of program_state_impl
* change the free function globals() into a member of program_state_impl
* replace the static mutex used for populating globals
* moving associate_code_object_symbols_with_host_allocation into
program_state_impl
* move load_code_object_and_freeze_executable into program_state_impl
* moving executables and functions_names into program_state_impl
* moving kernels() into program_state_impl
* moving functions() into program_state_impl
* move get_kernargs into program_state_impl
* moving kernel_descriptor into program_state_impl
* moving kernargs_size_align calculation into program_state_impl
* Changing the handle to program_state_impl to a pointer
* moving program_state_impl into a separate inline source file
* fixing/cleaning up some header file includes
* moving member function for kernargs_size_align into program_state.cpp
* moving Kernel_descriptor into program_state.inl
* add a new class to manage agent globals
* moving all agent globals processing functions into agent_globals_impl
* load program state once per agent
re-merging PR991 against other program state changes
* fix per-agent program state member initialization
* cache executables based on elf name, isa, and agent.
This avoids program state reloading executables after a shared library is dlopened.
re-merging PR1057 against other program state changes
* protect executables cache by a global mutex
* return ref to executables cache
* adapt PR#981 Make hipModuleGetGlobal be in HIP runtime
* Return hipErrorInsufficientDriver status when CPU device not found - no exception thrown
* Return hipErrorInsufficientDriver status when CPU device not found
* In hipFree, if memory is associated with a device, synchronize that device's streams.
This changes the behavior from synchronizing the currently set TLS device.
* All devices sync in hipFree for _appId=-1 case.
* Revert "All devices sync in hipFree for _appId=-1 case."
This reverts commit 1efb34d6a8426661e45bc5f763422a1147aeac10.
* add HIP_SYNC_FREE env var
* reimplement HIP_INIT as a function, expose it as hip_impl::hip_init()
so that it could be called from hipLaunchKernelGGL and other inlined
HIP functions
* Don't call hip_init from ihipPreLaunchKernel