The NCCL Inspector is available on demand for Soperator clusters deployed in Nebius AI Cloud. For details, contact support or your personal manager.
Enabling NCCL Inspector for jobs
On a Soperator cluster prepared with NCCL Inspector support, enable it in your Slurm job, for example, a batch job, by setting theNCCL_INSPECTOR_ENABLE environment variable:
sbatch or srun. You don’t need to change the training application code.
To collect data for NCCL point-to-point (P2P) operations, the NCCL Inspector requires NCCL 2.30.3 or higher.
srun call, add the --snccliprecon-enabled=0 parameter to the srun command.
Accessing Grafana dashboards
The following NCCL Inspector dashboards are available in Grafana:- NCCL Inspector Job Performance: primary per-job view.
- NCCL Inspector Metrics: metric-level overview.
- NCCL Inspector Raw Metrics: raw metrics view.
See also
- Monitoring metrics of Soperator clusters
- Enhancing Communication Observability of AI Workloads with NCCL Inspector in the NVIDIA Technical Blog
- NCCL Inspector in NCCL repository on GitHub
The Grafana Labs Marks are trademarks of Grafana Labs, and are used with Grafana Labs’ permission. We are not affiliated with, endorsed or sponsored by Grafana Labs or its affiliates.