diff --git a/docs/allapi.rst b/docs/allapi.rst deleted file mode 100644 index ca48fa77c6..0000000000 --- a/docs/allapi.rst +++ /dev/null @@ -1,7 +0,0 @@ -======= -All API -======= - -.. doxygenindex:: - - diff --git a/docs/api-library.rst b/docs/api-library.rst new file mode 100644 index 0000000000..b9458a6772 --- /dev/null +++ b/docs/api-library.rst @@ -0,0 +1,11 @@ +.. meta:: + :description: RCCL is a stand-alone library that provides multi-GPU and multi-node collective communication primitives optimized for AMD GPUs + :keywords: RCCL, ROCm, library, API + +.. _api-library: + +============= +API library +============= + +.. doxygenindex:: diff --git a/docs/index.rst b/docs/index.rst index aacc95593b..a9960b2019 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -1,11 +1,28 @@ -**** -RCCL -**** +.. meta:: + :description: RCCL is a stand-alone library that provides multi-GPU and multi-node collective communication primitives optimized for AMD GPUs + :keywords: RCCL, ROCm, library, API -The ROCm Collective Communication Library (RCCL) is a stand-alone library which provides multi-GPU and multi-node collective communication primitives optimized for AMD GPUs. +.. _index: -RCCL (pronounced “Rickel”) implements routines such as all-reduce, all-gather, reduce, broadcast, reduce-scatter, gather, scatter, all-to-allv, and all-to-all as well as direct point-to-point (GPU-to-GPU) send and receive operations. +=========================== +RCCL documentation +=========================== -The provided collective communication routines are implemented using Ring and Tree algorithms. They are optimized to achieve high bandwidth and low latency by leveraging topology awareness, high-speed interconnects, RDMA based collectives. RCCL utilizes PCIe and xGMI high-speed interconnects for intra-node communication as well as InfiniBand, RoCE, and TCP/IP for inter-node communication. +Welcome to the ROCm Collective Communication Library (RCCL) docs home page! To learn more, see :ref:`what-is-rccl`. -RCCL supports an arbitrary number of GPUs installed in a single-node or multi-node platform. It can be easily integrated into either single- or multi-process (e.g., MPI) applications. +Our documentation is structured as follows: + + +.. grid:: 2 + :gutter: 3 + + .. grid-item-card:: API reference + + * :ref:`Library specification` + * :ref:`api-library` + +To contribute to the documentation refer to +`Contributing to ROCm `_. + +Licensing information can be found on the +`Licensing `_ page. diff --git a/docs/api.rst b/docs/library-specification.rst similarity index 84% rename from docs/api.rst rename to docs/library-specification.rst index 0b4cdafed3..280c88ecb6 100644 --- a/docs/api.rst +++ b/docs/library-specification.rst @@ -1,14 +1,16 @@ -.. toctree:: - :maxdepth: 4 - :caption: Contents: +.. meta:: + :description: RCCL is a stand-alone library that provides multi-GPU and multi-node collective communication primitives optimized for AMD GPUs + :keywords: RCCL, ROCm, library, API -=== -API -=== +.. _library-specification: -This section provides details of the library API +============================ +RCCL library specification +============================ -Communicator Functions +This document provides details of the API library. + +Communicator functions ---------------------- .. doxygenfunction:: ncclGetUniqueId @@ -27,7 +29,7 @@ Communicator Functions .. doxygenfunction:: ncclCommUserRank -Collective Communication Operations +Collective communication operations ----------------------------------- Collective communication operations must be called separately for each communicator in a communicator clique. @@ -58,7 +60,7 @@ Since they may perform inter-CPU synchronization, each call has to be done from .. doxygenfunction:: ncclAllToAll -Group Semantics +Group semantics --------------- When managing multiple GPUs from a single thread, and since NCCL collective calls may perform inter-CPU synchronization, we need to "group" calls for @@ -78,7 +80,7 @@ of ncclGroupStart/ncclGroupEnd. .. doxygenfunction:: ncclGroupEnd -Library Functions +Library functions ----------------- .. doxygenfunction:: ncclGetVersion @@ -108,7 +110,3 @@ This section provides all the enumerations used. .. doxygenenum:: ncclRedOp_t .. doxygenenum:: ncclDataType_t - - - - diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in index 55cfb98019..ce1d262a08 100644 --- a/docs/sphinx/_toc.yml.in +++ b/docs/sphinx/_toc.yml.in @@ -1,10 +1,15 @@ root: index subtrees: - entries: - - file: api - - file: allapi - - file: attributions + - file: what-is-rccl +- caption: API reference + entries: + - file: library-specification + title: Library specification + - file: api-library + - caption: About entries: - file: license + - file: attributions diff --git a/docs/what-is-rccl.rst b/docs/what-is-rccl.rst new file mode 100644 index 0000000000..110b4651c8 --- /dev/null +++ b/docs/what-is-rccl.rst @@ -0,0 +1,16 @@ +.. meta:: + :description: RCCL is a stand-alone library that provides multi-GPU and multi-node collective communication primitives optimized for AMD GPUs + :keywords: RCCL, ROCm, library, API + +.. _what-is-rccl: + +===================== +What is RCCL? +===================== + +RCCL (pronounced “Rickel”) is a stand-alone library that provides multi-GPU and multi-node collective communication primitives optimized for AMD GPUs. +It implements routines such as `all-reduce`, `all-gather`, `reduce`, `broadcast`, `reduce-scatter`, `gather`, `scatter`, `all-to-allv`, and `all-to-all` as well as direct point-to-point (GPU-to-GPU) send and receive operations. +The provided collective communication routines are implemented using Ring and Tree algorithms. They are optimized to achieve high bandwidth and low latency by leveraging topology awareness, high-speed interconnects, and RDMA based collectives. + +RCCL utilizes PCIe and xGMI high-speed interconnects for intra-node communication as well as InfiniBand, RoCE, and TCP/IP for inter-node communication. +It supports an arbitrary number of GPUs installed in a single-node or multi-node platform and can be easily integrated into single- or multi-process (e.g., MPI) applications.