* Added python test runner to execute rccl tests
* Disabled capture output to avoid hangs
* Add RCCL_TEST_MPI_HOSTFILE env var to get the hostfile
* Converted test_type to boolean gtest flag
* Removed unused return values
* Added custom rccl library usage
* Removed json output
* Updates to test_runner: added num_gpus field
* Address review comments
* Prepend env vars for single node, single process executions
* Added separate enums for exit and result codes
* Update configuration files
* Moved configurations to its own dir
* Address review comments
* Update tools/scripts/test_runner/README.md
Co-authored-by: Corey Derochie <161367113+corey-derochie-amd@users.noreply.github.com>
---------
Co-authored-by: Corey Derochie <161367113+corey-derochie-amd@users.noreply.github.com>
* create dir regardless of default or user-provided path if it doesn't exist
* Fix npkit_dump_dir on npkit_trace_generator.py
---------
Co-authored-by: BertanDogancay <bertan.dogancay@gmail.com>
* Increased max stack size to 640
* Added new binary for executing unit tests
Added new unit tests for argcheck.cc and alt_rsmi.cc files
Modified the method to execute unit tests to cover static methods
by using a bash script to convert static to non-static functions
and variables on the fly restricted to debug build type.
* added rccl version using rccl-tests
* Added function to get rccl version from rccl-tests
* removed whitespace
* Added rccl version
* Updated readme and fixed formatting
* removed debug prints
* Initial Script ready for review
* Added RCCL-tests and RCCL versions
* Added output folder and README
* Base format built
* Added ROCm version
* Added function to center titles and Vram information
* Added HIP version
* Cleaned formatting
* UCX version and MPI version
* Added NUMA balancing
* Added rocminfo
* Removed notes
* Changed regex for broadcom Nic
* Removed note by the ACS info
* Added Hostname to summary and details
* Print summary to terminal
* Added argparse
* Added flags and readme
* Added GPU ID
* fixed spelling
* renamed script again
* Added file descriptor and locked mem checks
* Added file descriptor and locked mem checks
* Removed extra spaces from summary table
* printing output file location
* Removed sudo in code and ACS flag
* Add another rome model and override
* Fix bug
* Fix typo
* Add ring
* Update ring
* Fix model matching
* Clean up
* Clean up
* Reverse rings for NCCL_RINGS input
* Only reverse NCCL_RINGS for ring graph
* Fix mapping issue when using NCCL_RINGS
* Add NCCL_RINGS_REMAP to handle inconsistant net names
* adding rocprof parser script
* adding the support for multiple json files
* adding pytorch profiler script
* remove filtering from pytorch log
* adding the addressing the comments and add the feature to parse all kernels
* completing the report for torch profiler
---------
Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>
* Add 1H16P GPU model
* Implement NIC identification and remapping
* Revert "Sort IB devices based on device name (#413)"
This reverts commit 2d0ed8dff6.
* Fix permute and check order
* Correction on IB speed reporting
* Revert "Allow user to link layer with RCCL_IB_HCA_SKIP_LINK_LAYER (#361)"
This reverts commit caf5c9992a.