UCX Debugging

InfiniBand

System Configuration

ibdev2netdev – check to ensure at least one IB controller is configured for IPoIB

user@mlnx:~$ ibdev2netdev
mlx5_0 port 1 ==> ib0 (Up)
mlx5_1 port 1 ==> ib1 (Up)
mlx5_2 port 1 ==> ib2 (Up)
mlx5_3 port 1 ==> ib3 (Up)

ucx_info -d and ucx_info -p -u t are helpful commands to display what UCX understands about the underlying hardware. For example, we can check if UCX has been built correctly with RDMA and if it is available.

user@pdgx:~$ ucx_info -d | grep -i rdma
# Memory domain: rdmacm
#     Component: rdmacm
# Connection manager: rdmacm


user@dgx:~$ ucx_info -b | grep -i rdma
#define HAVE_DECL_RDMA_ESTABLISH  1
#define HAVE_DECL_RDMA_INIT_QP_ATTR 1
#define HAVE_RDMACM_QP_LESS       1
#define UCX_CONFIGURE_FLAGS       "--disable-logging --disable-debug --disable-assertions --disable-params-check --prefix=/gpfs/fs1/user/miniconda3/envs/ucx-dev --with-sysroot --enable-cma --enable-mt --enable-numa --with-gnu-ld --with-rdmacm --with-verbs --with-cuda=/gpfs/fs1/SHARE/Utils/CUDA/10.2.89.0_440.33.01"
#define uct_MODULES               ":cuda:ib:rdmacm:cma"

InfiniBand Performance

ucx_perftest should confirm InfiniBand bandwidth to be in the 10+ GB/s range

CUDA_VISIBLE_DEVICES=0 UCX_NET_DEVICES=mlx5_0:1 UCX_TLS=rc,cuda_copy ucx_perftest -t tag_bw -m cuda -s 10000000 -n 10 -p 9999 & \
CUDA_VISIBLE_DEVICES=1 UCX_NET_DEVICES=mlx5_1:1 UCX_TLS=rc,cuda_copy ucx_perftest `hostname` -t tag_bw -m cuda -s 100000000 -n 10 -p 9999

+--------------+-----------------------------+---------------------+-----------------------+
|              |       latency (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
| # iterations | typical | average | overall |  average |  overall |   average |   overall |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
+------------------------------------------------------------------------------------------+
| API:          protocol layer                                                             |
| Test:         tag match bandwidth                                                        |
| Data layout:  (automatic)                                                                |
| Send memory:  cuda                                                                       |
| Recv memory:  cuda                                                                       |
| Message size: 100000000                                                                  |
+------------------------------------------------------------------------------------------+
            10     0.000  9104.800  9104.800   10474.41   10474.41         110         110

-c option is NUMA dependent and sets the CPU Affinity of process for a particular GPU. CPU Affinity information can be found in nvidia-smi topo -m

user@mlnx:~$  nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    mlx5_0  mlx5_1  mlx5_2  mlx5_3  CPU Affinity
GPU0     X      NV1     NV1     NV2     NV2     SYS     SYS     SYS     PIX     PHB     SYS     SYS     0-19,40-59
GPU1    NV1      X      NV2     NV1     SYS     NV2     SYS     SYS     PIX     PHB     SYS     SYS     0-19,40-59
GPU2    NV1     NV2      X      NV2     SYS     SYS     NV1     SYS     PHB     PIX     SYS     SYS     0-19,40-59
GPU3    NV2     NV1     NV2      X      SYS     SYS     SYS     NV1     PHB     PIX     SYS     SYS     0-19,40-59
GPU4    NV2     SYS     SYS     SYS      X      NV1     NV1     NV2     SYS     SYS     PIX     PHB     20-39,60-79
GPU5    SYS     NV2     SYS     SYS     NV1      X      NV2     NV1     SYS     SYS     PIX     PHB     20-39,60-79
GPU6    SYS     SYS     NV1     SYS     NV1     NV2      X      NV2     SYS     SYS     PHB     PIX     20-39,60-79
GPU7    SYS     SYS     SYS     NV1     NV2     NV1     NV2      X      SYS     SYS     PHB     PIX     20-39,60-79
mlx5_0  PIX     PIX     PHB     PHB     SYS     SYS     SYS     SYS      X      PHB     SYS     SYS
mlx5_1  PHB     PHB     PIX     PIX     SYS     SYS     SYS     SYS     PHB      X      SYS     SYS
mlx5_2  SYS     SYS     SYS     SYS     PIX     PIX     PHB     PHB     SYS     SYS      X      PHB
mlx5_3  SYS     SYS     SYS     SYS     PHB     PHB     PIX     PIX     SYS     SYS     PHB      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NVLink

System Configuration

The NVLink connectivity on the system above (DGX-1) is not homogenous, some GPUs are connected by a single NVLink connection (NV1, e.g., GPUs 0 and 1), others with two NVLink connections (NV2, e.g., GPUs 1 and 2), and some not connected at all via NVLink (SYS, e.g., GPUs 3 and 4).”

NVLink Performance

ucx_perftest should confirm NVLink bandwidth to be in the 20+ GB/s range

CUDA_VISIBLE_DEVICES=0 UCX_TLS=cuda_ipc,cuda_copy,tcp ucx_perftest -t tag_bw -m cuda -s 10000000 -n 10 -p 9999 -c 0 & \
CUDA_VISIBLE_DEVICES=1 UCX_TLS=cuda_ipc,cuda_copy,tcp ucx_perftest `hostname` -t tag_bw -m cuda -s 100000000 -n 10 -p 9999 -c 1
+--------------+-----------------------------+---------------------+-----------------------+
|              |       latency (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
| # iterations | typical | average | overall |  average |  overall |   average |   overall |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
+------------------------------------------------------------------------------------------+
| API:          protocol layer                                                             |
| Test:         tag match bandwidth                                                        |
| Data layout:  (automatic)                                                                |
| Send memory:  cuda                                                                       |
| Recv memory:  cuda                                                                       |
| Message size: 100000000                                                                  |
+------------------------------------------------------------------------------------------+
            10     0.000  4163.694  4163.694   22904.52   22904.52         240         240

Experimental Debugging

A list of problems we have run into along the way while trying to understand performance issues with UCX/UCX-Py:

System-wide settings environment variables. For example, we saw a system with UCX_MEM_MMAP_HOOK_MODE set to none. Unsetting this env var resolved problems: https://github.com/rapidsai/ucx-py/issues/616 . One can quickly check system wide variables with env|grep ^UCX_.
sockcm_iface.c:257 Fatal: sockcm_listener: unable to create handler for new connection. This is an error we’ve seen when limits are place on the number

of file descriptors and occurs when SOCKCM is used for establishing connections. User have two choices for resolving this issue: increase the open files limit (check ulimit configuration) or use RDMACM when establishing a connection UCX_SOCKADDR_TLS_PRIORITY=rdmacm. RDMACM is only available using InfiniBand devices.