Health management in Soperator
Built-in health checks
For all deployment types, Soperator runs built-in health checks on a schedule for each worker node with GPUs. Most of these checks are considered critical: if a worker node fails a critical check, Soperator marks it as requiring further action. Critical checks include, but are not limited to, the following checks:-
GPU checks:
- AllReduce, an NCCL test (with and without InfiniBand™, outside and inside Docker containers)
- CUDA samples, such as vectorAdd, simpleMultiGPU, deviceQuery and p2pBandwidthLatencyTest
- NVIDIA® Data Center GPU Manager (DCGM) diagnostics
- GPU stress test
- RAM checks: bandwidth and latency
- Active Checks – Health and system checks framework: description of checks’ architecture and implementation
-
soperator-activechecksHelm chart:- values.yaml: list of checks
- scripts/: scripts for each check
Node isolation
When a critical check fails, Soperator performs the extensive check procedure:- Drains the node, waiting for running Slurm jobs to finish. The drain reason has the
[node_problem]prefix. - Moves the node into the suspicious reservation, preventing new jobs from being scheduled on it.
- Runs extensive checks on the node, which include hardware-level tests and re-runs of most critical checks.
Node replacement
If the extensive checks fail, Soperator drains the node. The drain reason now has the[hardware_problem] prefix — Soperator marks all worker nodes with this prefix as unhealthy Kubernetes nodes, which triggers automatic re-creation of the node.
If the node passes the extensive checks, Soperator removes it from the suspicious reservation, and jobs can run on the node again.
In Managed Service for Soperator and Pro Solution for Soperator, Compute may schedule maintenance for an underlying virtual machine (VM) of the worker node during the extensive check procedure. This typically indicates a hardware issue already detected by Compute. In this case, Soperator immediately stops the checks, and then drains and recreates the node.
Custom health checks (Slurm prolog and epilog programs)
All Soperator deployment types support Slurm prolog and epilog programs for job steps. You can configure them by using--task-prolog and --task-epilog parameters of srun, either in batch scripts or in direct srun calls. The prolog and epilog programs specified in --task-prolog and --task-epilog run on each worker node before and after the job step that is launched by the srun call. You can use them to run custom health checks on worker nodes.
For example, you can runFor more details about prolog and epilog programs, see Slurm documentation. By default, Soperator doesn’t auto-heal worker nodes that fail custom health checks. To set up custom auto-healing in your Managed Soperator or Pro Solution for Soperator clusters, contact support or your personal manager.nvidia-smibefore and after the training step in your batch script (my_ml_job.sh) to check the GPU utilization and health:
Upstream health checks in Managed Soperator
In Managed Soperator, worker nodes are Compute virtual machines that serve as nodes in a Managed Service for Kubernetes cluster. Both Compute and Managed Kubernetes run their own health checks on worker nodes with GPUs, and Managed Soperator uses these health checks to automatically heal worker nodes, in addition to the built-in health management system.Compute
Compute continuously monitors hardware problems on VMs. When such a problem is detected on a VM, Compute issues a maintenance event for it. If a VM with a maintenance event is associated with a GPU worker node in a Managed Soperator cluster, Managed Soperator drains the node, waiting for running Slurm jobs to finish, and then re-creates the node.Kubernetes
When Managed Service for Kubernetes signals a Kubernetes-specific maintenance condition that was not triggered by Compute, Managed Soperator drains the worker node, waiting for running Slurm jobs to finish, and then restarts the node. For more details about maintenance events and automatic recovery of nodes, see Managed Kubernetes documentation.InfiniBand and InfiniBand Trade Association are registered trademarks of the InfiniBand Trade Association.