Submitting Job

Introduction

On the HPC, users will be using a program called Slurm to submit all of their jobs. This is so that we can schedule jobs, and fully utilize all of our nodes efficiently.

It provides three key functions:

Allocating exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work.
Providing a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes.
Arbitrating contention for resources by managing a queue of pending work.

Partition

Within Slurm, we divide certain node to different partition based on certain features.

amd
- Nodes: a001-a012
- Cores: AMD, 128 cores
- Ram: 500G
- Max 720 hours or 30 days per job
gpu
- gpu001
  - Cores: AMD, 128 cores
  - Ram: 500G
  - GPU: 4x Nvidia A100 (80GB)
  - Max 720 hours or 30 days per job
- gpu002
  - cores: Intel, 48 cores
  - Ram: 750G
  - GPU: 2x Nvidia V100 (16GB)
  - Max 720 hours or 30 days per job
  - NOTE: Does not have high speed infiniband
short
- All amd partition are included in this partition
- Has the highest priority, and very low wait time when the HPC is busy
- Max 2 hours per job

Running Job

Running job using srun

To use slurm srun, you need to specify the executable and its arguments that you want to run as a parallel job. You can also use various options to control how the job is executed, such as the number of tasks, the number of nodes, the number of CPUs per task, etc.

You can use the following syntax to run slurm srun:

srun [OPTIONS(0)... [executable(0) [args(0)...]]] [: [OPTIONS(N)...]] executable(N) [args(N)...]

You can specify one or more executables and their arguments, separated by colons, to run multiple jobs in a co-scheduled heterogeneous job. For more details about heterogeneous jobs see the document heterogeneous jobs.

For example, to run a simple hello world program with 4 tasks on 2 nodes, requesting 1 day and 15 minutes, you can use:

srun -n 4 -N 2 --time 1-00:15:00 ./hello_world

To run two different programs with different options in a heterogeneous job, you can use:

srun -n 2 -N 1 ./program1 : -n 4 -N 2 ./program2

To run a simple program using 12 cores, 100G of memory:

srun -c 12 --mem=100G ./program

Options and flags for srun

Slurm srun has many options and flags that you can use to customize its behavior and output. Here are some of the most common ones:

-c, --cpus-per-task=ncpus: Request that ncpus be allocated per process. The default is one processor per process unless otherwise specified by environment variable SLURM_CPUS_PER_TASK.
--mem=<size>[units]: Specify the real memory requested per node. Default units are megabytes. Different units can be specified using the suffix [K|M|G|T].
-D, --chdir=directory: Change working directory before running the job.
-e, --error=err: File name for stderr. By default both stdout and stderr go into a file of the form slurm-%j.out where %j is replaced by the job ID.
-h, --help: Display help message and exit.
-J, --job-name=name: Specify a name for the job allocation.
-l, --label: Prepend task number to lines of stdout/err.
-n, --ntasks=n: Request that n tasks be invoked on allocated nodes. The default is one task per node unless otherwise specified by environment variable SLURM_NTASKS.
-N, --nodes=minnodes[-maxnodes]: Request that a minimum of minnodes nodes be allocated to this job. A maximum node count may also be specified with maxnodes.
-o, --output=out: File name for stdout. By default both stdout and stderr go into a file of the form slurm-%j.out where %j is replaced by the job ID.
-p, --partition=partition_names: Request a specific partition for the resource allocation. If not specified, the default behavior is to allow the slurm controller to select the default partition as designated by system administrator.
-q, --quit-on-interrupt: If set, when srun receives an interrupt signal it will exit immediately with exit code 1 rather than just terminating any pending or running jobs.
-r, --relative=n: Run first n tasks on node 0 (relative node addressing).
-s, --share: Allow other jobs to share allocated nodes with this job.
-t, --time=day-hour:minute:second: Set a limit on the total run time of the job allocation. If the requested time limit exceeds the partition's time limit, the job will be left in a PENDING state until it can be started or cancelled by an administrator or user respectively.
-u, --unbuffered: Line buffered stdout/err (may have performance impact).
-v, --verbose: Increase verbosity of output.
-V, --version: Display version information and exit.

For a complete list of options and flags, please refer to the slurm srun manual page.

Tips and tricks for srun

Here are some tips and tricks for using slurm srun effectively:

If you want to run an interactive shell session on a cluster node, you can use:

srun --pty bash -l

If you want to run an MPI program with slurm srun, you don't need to specify any MPI launcher such as mpirun or mpiexec, as slurm srun will automatically launch your program with MPI support. For example, to run an MPI hello world program with 4 tasks on 2 nodes, you can use:

srun -n 4 -N 2 ./mpi_hello_world

If you want to run an OpenMP program with slurm srun, you need to specify the number of threads per task with either environment variable OMP_NUM_THREADS or OpenMP API function omp_set_num_threads

Running job using sbatch

Sbatch is very similar to srun, but it is used to run a script.

Below is a sample script to run an Rscript in a job. We named this sbatch_script.sh. Which runs in a single node, 10 cores, 10G of memory, and run for 1 day and 15 minutes. An email will be sent to the user when the status of the job changes (Start, Failed, Completed).

#!/bin/bash -l

#SBATCH --nodes=1
#SBATCH --cpus-per-task=10
#SBATCH --mem=10G
#SBATCH --time=1-00:15:00     # 1 day and 15 minutes
#SBATCH --mail-user=useremail@address.com
#SBATCH --mail-type=ALL
#SBATCH --job-name="just_a_test"
#SBATCH -p amd # This is the default partition, you can use any of the following; amd, short, or gpu

# Print current date
date

# Load r
module load r

# running an R  script
Rscript script.R

# Print name of node
hostname

To submit the above script as a job:

sbatch sbatch_script.sh

For more details on how to use sbatch, run man sbatch or visit sbatch

Run GPU job

Note that gpu are only available in the gpu partition.

GPU is a limited resource. Please do not use it if it is not needed for your job, or take up all the memory when using a single gpu, since other users will not be able to use any gpu on that server.

Interactive job

To request gpu in your job, use the following option:

--gres=gpu[:type]:N: Specify the number of generic resources (GPUs) required per node. For example, --gres=gpu:2 or if you need a specific gpu --gres=gpu:a100:2.

Sample srun command request a single gpu of any type, running for 1 day and 15 minutes:

srun -p gpu --gres=gpu:1 --time 1-00:15:00 --pty bash -l

Please see Hardware for available gpu count and type.

Noninteractive job

Within your sbatch script, add another line of

#SBATCH --gres=gpu:1

Then submit your job using sbatch

sbatch sbatch_script.sh

Or you can run your sbatch script without editing with

sbatch --gres=gpu:1 sbatch_script.sh

How to cancel job(s)

To use slurm scancel, you need to know the job ID or the job step ID of the job or job step that you want to signal or cancel. You can find these IDs by using the squeue command or by looking at the output of the sbatch command when you submit a job.

You can then use the following syntax to run slurm scancel:

scancel [OPTIONS...] [job_id[_array_id][.step_id]] [job_id[_array_id][.step_id]]...

You can specify one or more job IDs or job step IDs, separated by spaces, to signal or cancel multiple jobs or job steps at once. You can also use job specification filters, such as --name, --user, --partition, etc., to select jobs based on certain criteria.

For example, to cancel all your pending jobs on a certain partition, you can use:

scancel --user=$USER --partition=partition_name --state=PENDING

To signal a specific job step with a signal number (e.g. 9 for SIGKILL), you can use:

scancel --signal=9 job_id.step_id

To cancel a specific element of a job array, you can use:

scancel job_id_array_id

To cancel all elements of a job array, you can use:

scancel job_id

Options and flags for scancel

Slurm scancel has many options and flags that you can use to customize its behavior and output. Here are some of the most common ones:

-b, --batch: Signal only the batch step (the shell script) of the job, but not any other steps. This is useful when the shell script has to trap the signal and take some application defined action.
-M, --clusters=string: Specify the cluster to issue commands to. Note that the SlurmDBD must be up for this option to work properly.
--ctld: Send the job signal request to the slurmctld daemon rather than directly to the slurmd daemons. This increases overhead, but offers better fault tolerance. This is the default behavior on architectures using front end nodes (e.g. Cray ALPS computers) or when the --clusters option is used.
-c, --cron: Confirm request to cancel a job submitted by scrontab. This option only has effect with the "explicit_scancel" option is set in ScronParameters.
-f, --full: Signal both the batch step and its children processes. This is useful when you want to terminate all processes associated with a job.
-h, --help: Display help message and exit.
-i, --interactive: Prompt for confirmation before signaling any jobs.
-n,--name=name`: Restrict the scancel operation to jobs with this name.
-p, --partition=partition_name: Restrict the scancel operation to jobs on this partition.
-q, --quiet: Disable warnings about invalid jobs or job steps specified on command line.
-s, --signal=signal_number: Specify which signal number (e.g. 9 for SIGKILL) or name (e.g. KILL) to send to selected jobs or job steps. The default signal is SIGINT (2).
-u, --user=user_name: Restrict the scancel operation to jobs owned by this user.
-v, --verbose: Increase verbosity of output.
-V, --version: Display version information and exit.

For a complete list of options and flags, please refer to the slurm scancel.

Tips and tricks for scancel

Here are some tips and tricks for using slurm scancel effectively:

If you want to cancel all your jobs on a cluster, you can use:

scancel -u $USER -M cluster_name

If you want to cancel all your jobs on all clusters, you can use:

scancel -u $USER -M all

If you want to cancel all your jobs that have been running for more than a certain time limit (e.g. 1 hour), you can use:

squeue -u $USER -o "%A %M" | awk '$2 > "01:00:00" {print $1}' | xargs scancel

If you want to cancel all your jobs that have a certain name (e.g. test), you can use:

scancel -u $USER -n test

If you want to cancel all your jobs that match a certain pattern in their name (e.g. test*), you can use:

squeue -u $USER -o "%A %j" | grep "test.*" | awk '{print $1}' | xargs scancel

How to query the status of your jobs

To see the status of your jobs, you can use the squeue command, which queries the current job queue and lists its contents. You can use various options to filter or sort the output, such as:

-a which lists all jobs
-t R which lists all running jobs
-t PD which lists all pending (non-running) jobs
-p amd which lists all jobs in the amd partition
-j [jobid] which lists only your job
--user=[userid] which lists the jobs submitted by [userid]
--start --user=[userid] which lists the jobs submitted by [userid], with their current start-time estimates (as available)

The output of squeue will show you information such as:

JOBID which is the unique identifier of your job
PARTITION which is the queue where your job is submitted
NAME which is the name of your job script or command
USER which is your username
ST which is the state of your job (see Table 1 for possible values)
TIME which is the time used by your job
NODES which is the number of nodes allocated to your job
NODELIST (REASON) which is the list of nodes assigned to your job and the reason for its state (if applicable)

Job state	Description
PD	Pending
R	Running
CG	Completing
CD	Completed
F	Failed
TO	Terminated
S	Suspended
ST	Stopped

How to attach to a running job

If you want to interact with a running job, you can use the srun command with the --jobid=<SLURM_JOBID> option and a shell command, such as:

srun --jobid=<SLURM_JOBID> --pty bash -l

This command will place your shell on the head node of the running job (job in an “R” state in squeue). From there you can run commands such as top, htop, or ps to examine the running processes. If the job has more than a single node, you can ssh from the head node to the other nodes in the job (See the "SLURM_JOB_NODELIST" environment variable or squeue output for the list of nodes assigned to a job). Exiting from the shell will exit the srun command and return your shell to the original login node session.

How to inspect the output and error files of your jobs

While your batch job is running, you will be able to monitor the standard output and error files of your job. By default, slurm writes standard output stdout and standard error stderr into a single file named slurm-<jobid>.out, where <jobid> is your job ID. This file will be created in your current directory when you submit your job. You can use commands such as cat, tail, or less to view the contents of this file.

To separate the stderr from stdout, you can specify different names or locations for these files using slurm directives in your job script, such as:

#SBATCH --output=my_output.txt
#SBATCH --error=my_error.txt

or

#SBATCH --output=/full/path/to/directory/my_output.txt
#SBATCH --error=//full/path/to/directory/my_error.txt

How to view the history and details of your past jobs

To view information about your past jobs, you can use the sacct command, which queries the accounting database for historical data. You can use various options to filter or sort the output, such as:

-b which shows a brief listing of past jobs
-l -j <jobid> which shows detailed historical information of a past job with
-S <starttime> which shows information for jobs that started after
-E <endtime> which shows information for jobs that ended before
-u <userid> which shows information for jobs submitted by
-p <partition> which shows information for jobs submitted to

The output of sacct will show you information such as:

JobID which is the unique identifier of your job
JobName which is the name of your job script or command
User which is your username
Partition which is the queue where your job was submitted
State which is the final state of your job (see Table 1 for possible values)
Start which is when your job started
End which is when your job ended
Elapsed which is how long your job ran
CPUTimeRaw which is how much CPU time (in seconds) was used by all tasks in your job

For more details on how to use sacct, run man sacct or visit slurm sacct

Tips and tricks

Before following the rest of the guide, make sure you are in the same server that is running your job, and have the terminal open if you are using OnDemand.

This can be opened on OnDemand by clicking Applications on the top left and then clicking Terminal Emulator.

Also to note to run squeue --nodelist=$(hostname -s) to see if anyone else is on the same server as your current running job, the numbers under this sections will not be accurate if there is another users running job in the same server.

Check current CPU usage

To check the current cpu usage of your current job, run htop.
The page will refresh every 1 second.
You will need to look at the line that says Load average.

The first number is the average load in the last 1 minute.
The second number is the average load in the last 5 minutes.
The third number is the average load in the last 15 minutes.

To quit, press q.

Alternatively, you can use uptime command which is much more simple, but this command does not refresh.
If you want it to refresh every second use watch -n1 uptime, press ctrl+c or Command+c to quit.

Check current memory usage

To check the current memory usage of your current job, run free -h.
Output example:

              total        used        free      shared  buff/cache   available
Mem:          503Gi       153Gi       274Gi        70Gi        75Gi       276Gi
Swap:          15Gi       1.1Gi        14Gi

To calculate an estimate of what your job is using, use the Mem row.
Add the used and buff/cache column, and subtract 20G (the 20G is a rough estimate of what the system itself is using without any running job).
In the example the total current usage is 153 + 75 - 20 = 208G.

Get average cpu and memory usage of completed job

This command can be run anywhere, jobs need to finish/ended.
Run seff jobid, example seff 1.
You will need to look at CPU Efficiency and Memory Efficiency, this give you how efficiently you are using the CPU and memory.
This gives you can average usage of CPU and memory over the run time, and not the peak CPU or memory usage.