Monitoring Jobs and Resources

Checking Slurm job status

Slurm’s squeue command can be used to keep track of jobs that have been submitted to the job queue.

squeue

You can use optional flags, such as --user and --partition to filter results based on username or compute partition associated with each job.

squeue --user=USERNAME --partition=PARTITION

Slurm jobs have a status code associated with them which change during the lifespan of the job

  • CF | The job is in a configuring state. Typically this state is seen when autoscaling compute nodes are being provisioned to execute work.

  • PD | The job is in a pending state.

  • R | The job is in a running state.

  • CG | The job is in a completing state and the associated compute resources are being cleaned up.

  • (Resources) | There are insufficient resources available to schedule your job at the moment.

Checking Slurm compute node status

Slurm’s sinfo command can be used to keep track of the compute nodes and partitions available for executing workloads.

sinfo

Compute nodes have a status code associated with them that change during the lifespan of each node. A few common state codes are shown below. A more detailed list can be found in SchedMD’s documentation.

  • idle | The compute node is in an idle state and can receive work.

  • down | The compute node is in a down state and may need to be drained and placed back in down state. Downed nodes are also symptomatic of other issues on your cluster, such as insufficient quota or improperly configured machine blocks.

  • mixed | A portion of the compute nodes resources have been allocated, but additional resources are still available for work.

  • allocated | The compute node is fully allocated

Additionally, east state code has a modifier with the following meanings * ~ | The compute node is in a “cloud” state and will need to be provisioned before receiving work * # | The compute node is currently being provisioned (powering up) * % | The compute node is currently being deleted (powering down)