The PBS Batch System

PBS on the Fates

The 25-word Explanation

Have executable/script need run. Type bsubuvic <executable>. Job runs, dumps result to directory from where job is run. Job directory must be in home directory.

Note: For arguments, do bsubuvic <myjob> <myarg1> <myarg2> ...

What is PBS?

PBS is a Portable batch system which can be used to create and submit batch jobs to large number of cluster machines. A batch job is simply a shell script containing a set of commands you want to run on some set of execution machines. The script can contain (i) the characteristics (attributes) of the job and (ii) the resource requirements ( such as memory, cpu time etc) that the job needs.

PBS also provides a special kind of batch job called interactive-batch job which is sometimes useful for debugging the application or for computational steering.

Creating a Sample PBS Job

The first line is always standard for any shell script which specifies the name of the shell for executing the commands. Then, it will consist of resource requirements, job attributes ( if necessary) and the executable name. All pbs directives for resources and job attributes in a shell script start with #PBS. The execuatable can have arguments too.

Example of a PBS sample job script that runs the executable name subrun:

#! /bin/sh
#PBS -l walltime=2:00:00
#PBS -l mem=800mb
#PBS -l ncpus=1
#PBS -j oe
cd /homes/agarwal/release/workdir
./subrun

Note: This is one of the most confusing aspects of PBS. A PBS job script is basically a wrapper around another executable or script. You can avoid the whole hassle of writing job scripts by simply using the bsubuvic program.

In the above example, Bourne shell (sh) is used, but you can select your favourite shell. The lines starting with #PBS imply that they are PBS directives. The option -l means resources, -j oe means merging the standard output (o option) and standard error (e option) in the same file. The resources of walltime (real time of maximum 2 hr), mem (memory 800mb ) and ncpus (number of cpus required is one) have been asked for.

cd is used to go to workdir: /homes/agarwal/release/workdir directory and ./sbrun is the executable name.

Note: #PBS directives are not needed for most users and they can make a simple job script without them.

Submitting PBS Jobs from the Fates

The qsub command is used to subimit jobs to batch queues, -l and -j are options to qsub as used in the above example.

When a job is submitted, PBS returns a lilne like the following:

[wyvern@fate-1 wyvern]$ 20.calliope.beowulf.phys.uvic.ca

Where 20 is the job id and calliope.beowulf.phys.uvic.ca is the pbs server name. This job id is useful for any actions involving the job such as checking, modifying or deleting the job.

Note: By default, user jobs get placed on the queue named workq. There is another queue, short, which is meant for quick jobs. You can specify which queue to use with the -q option.

Important: Each submitted job actually submits two jobs: the real job and a 10-s 'sleeper' job. This extra job is designed to cause multiple job submits to be spread across nodes, not just CPU's. For example, if you submit 6 jobs they would normall be assigned to 3 nodes, as each CPU on our dual-CPU nodes counts as a separate machine. The sleeper jobs 'occupy' the second CPU on each node for a short while so jobs are spread across machines. This is done to make the most of the available bandwidth on each machine.

Checking the Status of PBS Jobs

The qstat command is for checking the PBS job status. If you want to display full or long information about the job whose id is 54419, use:

[wyvern@fate-1 wyvern]$ qstat -f 54419

This returns all the information available for that job:

Job Id: 54419.calliope.beowulf.phys.uvic.ca
    Job_Name = sleep-10
    Job_Owner = wyvern@fate-1.beowulf.phys.uvic.ca
    job_state = Q
    queue = workq
    server = calliope.beowulf.phys.uvic.ca
    Checkpoint = u
    ctime = Wed Nov 26 13:56:38 2003
    Error_Path = fate-1.beowulf.phys.uvic.ca:/homes/wyvern/sleep-10.e54419
    Hold_Types = n
    Join_Path = oe
    Keep_Files = n
    Mail_Points = a
    mtime = Wed Nov 26 13:56:38 2003
    Output_Path = fate-1.beowulf.phys.uvic.ca:/dev/null
    Priority = 0
    qtime = Wed Nov 26 13:56:38 2003
    Rerunable = True
    Resource_List.ncpus = 1
    Variable_List = PBS_O_HOME=/homes/wyvern,PBS_O_LANG=en_US,
        PBS_O_LOGNAME=wyvern,
        PBS_O_PATH=/net/cern/root-3.05.07/bin:/net/cernlib/pro/bin:/usr/kerber
        os/bin:/bin:/usr/bin:/usr/local/bin:/usr/bin/X11:/usr/X11R6/bin:/usr/pb
        s/bin:/homes/wyvern/bin,PBS_O_MAIL=/var/spool/mail/wyvern,
        s/bin:/homes/wyvern/bin,PBS_O_MAIL=/var/spool/mail/wyvern,
        PBS_O_SHELL=/bin/bash,PBS_O_HOST=fate-1.beowulf.phys.uvic.ca,
        PBS_O_WORKDIR=/homes/wyvern,PBS_O_SYSTEM=Linux,PWD=/homes/wyvern,
        XAUTHORITY=/homes/wyvern/.xauthtfvhOb,
        SHLIB_PATH=/net/cern/root-3.05.07/lib:,
        HOSTNAME=fate-1.beowulf.phys.uvic.ca,
        LD_LIBRARY_PATH=/net/cern/root-3.05.07/lib:/net/cernlib/pro/lib:,
        mathlib=/net/cernlib/pro/lib/libmathlib.a,QTDIR=/usr/lib/qt3-gcc2.96,
        hardware=pclinux,LESSOPEN=|/usr/bin/lesspipe.sh %s,CERN=/net/cernlib,
        USER=wyvern,LS_COLORS=,kernlib=/net/cernlib/pro/lib/libkernlib.a,
        MAIL=/var/spool/mail/wyvern,INPUTRC=/etc/inputrc,
        packlib=/net/cernlib/pro/lib/libpacklib.a,LANG=en_US,
        DISPLAY=localhost:13.0,LOGNAME=wyvern,SHLVL=2,
        graflib=/net/cernlib/pro/lib/libgraflib.a,_=/usr/pbs/bin/qsub,
        HARDWARE=pclinux,SHELL=/bin/bash,HISTSIZE=1000,TERM=Eterm,
        HOME=/homes/wyvern,SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass,
        ROOTSYS=/net/cern/root-3.05.07,
        PATH=/net/cern/root-3.05.07/bin:/net/cernlib/pro/bin:/usr/kerberos/bin
        :/bin:/usr/bin:/usr/local/bin:/usr/bin/X11:/usr/X11R6/bin:/usr/pbs/bin:
        /homes/wyvern/bin,PBS_O_QUEUE=workq
    comment = Not Running: Queue job limit has been reached.
    etime = Wed Nov 26 13:56:38 2003

To obtain a quicklist of all batch jobs running or waiting on the batch queue, use qstat with no options. Other useful options include -n which shows which (if any) nodes were assigned to that job and -a which gives a bit more information about all current jobs.

Note: Jobs named sleep-10 are the aforementioned 'sleeper' jobs

Deleting a Job from the Queue

The qdel command is used to delete any job from the qeue. Suppose you want to delete a job with the job id 54418, then use:

[wyvern@fate-1 wyvern]$ qdel 54418

Submission of Job Attributes Through Command Line

In the above example, the PBS resource directives are passed through the shell script. However, you can override these resource attributes by specifying them on the command line. Secondly, it is not necessary to define resources through the shell script only. They can be defined on the command line as well. This is useful if you just want to run a single instance or few instances of your job.

[wyvern@fate-1 wyvern]$ qsub -l walltime=2:00:00, mem=800mb, ncpus=1 -j oe mysubrun2

where mysubrun2 contains only following three lines:

#! /bin/sh
cd /homes/agarwal/release/workdir
./subrun

Getting Standard Output and Standard Error

Whenever you submit the job with the -j oe option, both standard output and standard error are written to a file with shell script name and with extension job id (i.e. mysubrun2.20 where mysubrun2 is job script name and 20 is job id).

If the -j oe option is not used, then output and error files are written separately and their file names are mysubrun2.o20 (o means standard output) and mysubrun2.e20 (e means standard error). This is the default behaviour.

Note: these output files are written in the current working from where the job is submitted. If you want to redirect the output and error files somewhere else, then you must define their paths:

[wyvern@fate-1 wyvern]$ qsub -o /homes/agarwal/mysubrun2.o20 -e /tmp/mysubrun2.e20 ./mysubrun2

You can request that both output and error messages go to the same file:

[wyvern@fate-1 wyvern]$ qsub -oe /homes/agarwal/results.oe ./mysubrun2

Exporting Shell Environment Variables

Sometimes it is very necessary to export the shell environment variables along with the job for its execution. If so, then use the -v option in qsub command as so:

[wyvern@fate-1 wyvern]$ qsub -V -o /homes/agarwal/result.dat ./mysubrun2

The above job will execute on any muse node available free at the time of submission. The default job name in batch queue is the executable name.

Note: The -l option can be used to specifiy many different sorts of resources.

Changing the Job Name while the Job is in the Queue

Sometimes it is desirable to specify the job name in batch queue. If so, then use option -n as so:

[wyvern@fate-1 wyvern]$ qsub -V -l nodes=muse21 -N MCjob -o /homes/agarwal/result.dat ./mysubrun2

Here MCjob is the specified job name.

Available Queues on the Muse Cluster

At UVIC, we have setup 2 batch queues: workq for unlimited execution time and short for short jobs with maximum 59 min of cpu time. The default queue is workq. If you do not specify any queue name, the job will run in workq

Suppose you want to submit a script mysubrun2 in the batch queue short. The command to submit the job is:

[wyvern@fate-1 wyvern]$ qsub -q short mysubrun2

Where -q is the option for defining the queue name. If you do not use the option -q then it will go to the workq

Useful Batch Job Script: bsubuvic

Bob [Kowalewski] noticed that PBS has limitation that qsub can not submit job script with arguments. For example, if you want to run myJob script with arguments myArg1, myArg2 (i.e):

qsub myJob myArg1 myArg2

it will not work. However, you can always write a wrapper script using qsub command and its option to make your requirements and features available.

Bob has found such a script(modified by Jan) which overcomes the above PBS limitation. The script is in /usr/local/bin/bsubuvic and can run the job with arguments such as:

bsubuvic myJob myArg1 myArg2

This script makes PBS job submission look like LSF job submission which is mostly used at CERN.

bsubuvic script has other following advantages: (i) Make batch job execute in directory where execution starts (ii) Transfer environment variables (iii) Keep stdout and stderr on batch node until job ends (iv) Give job name of binary/script but limited to 15 letters.

 
 
Back to Navigation