Monitoring Transports
Below is a list of commonly used tools and commands to monitor InfiniBand and CUDA IPC messages:
Infiniband
Monitor InfiniBand packet counters – this number should dramatically increase when there’s InfiniBand traffic:
watch -n 0.1 'cat /sys/class/infiniband/mlx5_*/ports/1/counters/port_xmit_data'
CUDA IPC/NVLink
Monitor traffic over all GPUs
nvidia-smi nvlink -gt d
Monitor traffic over all GPUs on counter 0
Note
nvidia-smi nvlink -g is now deprecated
# set counters
nvidia-smi nvlink -sc 0bz
watch -d 'nvidia-smi nvlink -g 0'
Stats Monitoring of GPUs
dcgmi dmon -e 449