Overview of the cluster setup

For a practical quide to using the cluster, read the HOWTO.

Introduction

To use the cluster you must login to a submit host. At present the only submit host is login.compbio.dundee.ac.uk.

Do not simply ssh into the login box and launch processes from the command line. Instead, jobs should be launched via the scheduler (see the HOWTO for instructions), which ensures the load on the box is manageable. Running heavy computation on the login box risks creating high loads or causing memory starvation, either of which can make the box sluggish or even unusable. This is a surefire way to make yourself unpopular with other users of the cluster. Accordingly, you should learn to use the scheduler.

You should not run batch jobs on the login box at all unless the job needs more than 16b of RAM. Jobs requiring 16Gb or less should be run on the cluster nodes using the procedures described in the HOWTO.

Sourcing SGE Settings

Before you can use the SGE tools you need to 'source' the default parameters.

source /gridware/sge/default/common/settings.sh

To save you doing that for every new session, you can add it to your ~/.bashrc file.

Some advice about debugging

Working out what has gone wrong with a job can be made much easier by knowing the execution environment of the job on the node. Therefore, it is strongly recommended that you dump the values of the environment variables at the beginning of the job.

Scheduling software

Cluster jobs are managed using Grid Engine.

The sge_intro manpage is a useful starting point for understanding Grid Engine.

Architecture

The cluster consists of the following

  • 12 IB nodes with 8 cores, 16Gb of RAM and InfiniBand and Fibre Channel interconnect,
  • 42 FC nodes with 8 cores, 16Gb of RAM and Fibre Channel interconnect,
  • nanna with 8 cores and 32 Gb of RAM
  • ningal with 48 cores and 128 Gb of RAM
  • cuda-1 with 12 cores (Intel Xeon L5640), 12Gb of RAM and an NVIDIA Tesla m2050 GPU.

All of the cores are 64-bit and run CentOS Linux 5.2 or (in the case of ningal and cuda-1, CentOS 5.5).

For most jobs, the differences between the IB and FC nodes are irrelevant. However, MPI-enabled programs can take advantage of the Infiniband interconnects on the IB nodes to improve performance. Accordingly, MPI jobs should include the mpi queue in the list of queues they request; this ensures that the jobs will run on the IB nodes if slots are available on the nodes.

Queues

Introduction

  • Jobs are scheduled by submitting them to a specified queue.
  • Interactive queues allow you to run a shell on a cluster node using qrsh.
  • Jobs can be submitted only from submit hosts; at present the only submit host is login.compbio.dundee.ac.uk.
  • Each queue will have a group of machines (called nodes or execution hosts ) to which it submits jobs. Each node will have one or more slots per queue. The number of slots determines how many jobs from the same queue that a node can run simultaneously.
  • A node can accept jobs from more than one queue.
  • There may be a default queue on a submit host; again, ask your sysadmin for details. Queues can be specified explicitly using the qsub -q option (see the HOWTO below).
  • Some queues have time limits. Jobs whose run time exceeds the time limit of a queue will be killed by the scheduler. In the case of array jobs, the time limit is applied to individual tasks in the array, rather than the array job as whole.
  • If you are running a large calculation on login, you are strongly advised to run it by submitting it to the bigmem.q rather than running it directly in a shell. This controls the load on the box, which is also a login box for the cluster.

List of queues

Queue name Memory per node h_vmem limit Time limit Batch/Interactive Notes
devel.q 16Gb 14 Gb 8 hour I  
64bit-pri.q 16Gb - 24 hours B  
64bit.q 16Gb - None B  
bigint.q 32Gb 30G b 8 h I  
bigmem.q 32Gb - None B  
ningal.q 128 Gb 128 Gb 24 h BI  
mpi 16 Gb - None B See 'Parallel Environment', below
  • bigint.q and devel.q are interactive only; you can qrsh into them but not qsub batch jobs to them.
  • 64bit.q, 64bit-pri.q, bigmem.q and mpi are batch-only; you can qsub jobs to them but not qrsh into them.
  • h_vmem is the virtual memory limit. Your job will be killed if its virtual memory usage (i.e. physical memory + swap memory) exceeds this limit.

Queue states

The queue instance on each node can be in a number of possible states, indicated by flags in the qstat -f output. No flag means that the queue instance is running normally and accepting jobs. The possible flags are:

  • d - disabled. The queue is not accepting jobs but will continue to run existing jobs. Queues are disabled if the node is overloaded or by sysadmins for maintenance purposes.
  • a - load alarm. A queue load threshold has been exceeded. This means that the queue is not accepting new jobs because the node is under load or running out of memory. The queue will accept new jobs once the threshold violation is removed. Running jobs will be processed as normal.
  • A - suspend alarm. A queue suspend threshold has been exceeded. Jobs in the queue will be suspended because the node is under load or running out of memory. Jobs will be unsuspended once the threshold violation disappears.
  • E - error state. The queue is not accepting or running jobs. This indicates some kind of problem with the node and requires the attention of a sysadmin.
  • S - suspended. The queue has been suspended because a job is running in a superior queue on the same node (see the section on queue prioritisation).
  • u - uncontactable. The queue cannot be contacted. This indicates that the node is offline.

Queue prioritization

Jobs running on a given node are prioritised according to the following rules. Note that these rules affect only jobs running on the same node.

  • A job running in devel.q will suspend jobs running in 64bit.q. The suspended jobs will resume once devel.q becomes empty.
  • Jobs running in 64bit.q have lower priority than jobs in 64-pri.q.
  • If the load on the node is very high, the queues will be suspended until the load decreases. This means that the queue on this node will not accept more jobs, even if it has empty slots, and jobs running in the queue will be suspended.
  • If the load on the node is high enough or the node is running out of memory, the node will stop accepting jobs and running jobs will be suspended.

Time limits

  • Jobs whose run time exceeds the time limit of a queue will be killed by the scheduler. In the case of array jobs, the time limit is applied to individual tasks in the array. In other words, an array job as a whole can exceed the time limit but tasks in the array that exceed the time limit will be killed.
  • There is no time limit on the 64bit.q.
  • The time limit on the 64bit-pri.q queue is 24 hours.
  • The time limit on devel.q is 4 hours.

Memory limits

  • Each compute node has a fixed amount of the ram consumable. The IB nodes and FC nodes have 14000M each while the GE nodes have 2700M. nanna has 30000M of ram available.
  • The default ram requested for a job is 1750M. If your job needs more, you should increase the default using the qsub -l ram option, e.g.
qsub -l ram=4000M

will set the ram requested to 4000M.

  • A node will not accept a job if the amount of ram available on the node is less than the the amount specified by the job.
  • ram is the amount of memory actively in use memory on the node; this limit controls how jobs are scheduled but doesn't limit a running job's memory usage.
  • There are default resource requests defined for jobs (see below); the defaults can be changed using the qsub command's -l option, e.g.
qsub  -l ram=2000M

will set ram to 2000M, respectively.

Changing memory resources if a job isn't being submitted

If a job is queued up but isn't being submitted, it might be because the requested memory resources are too high. You can find out if this is the case by doing the following:

  • Run qstat -j JOBID (pipe the output into less as it's quite long).
  • The output will list each queue and why the job isn't running in the queue.
  • If this is owing to the job being too resource hungry, the entry for the queue will show the requested resources and the resources available.
  • In this case, try changing the requested memory resources by running qalter with similar arguments you gave to qsub to submit the job, but with reduced resource requests. N.B. This will not work for array jobs if some of the tasks in the array are already running. In this case, you will have to resubmit the job.

Practical memory limits for jobs on the cluster

Here are some practical guidelines for large memory jobs:

Job Description Qsub Arguments
Large Jpred Job -q bigmem.q -l qname=bigmem.q,ram=29G,mem_used=28G,h_vmem=28G

I/O and disk space

Space on the home directories is not limitless. If you're submitting a large array or a large number of single jobs, think carefully about how much space the files you generate will need. It is possible to fill up the disk very quickly because of the sheer number of CPUs available on the cluster. If in doubt, run a single test job first to check its storage requirements.

Avoid using your home directory for I/O. If possible, cluster jobs should make of each node's local disk and any network scratch space your group may have.

It is a good general policy to do as much I/O as possible on the local disks and copy the final results of cluster jobs from the local disks to GPFS (if you have lots of files to copy, tar them up and then copy the tarball).

Temporary directories on the local disks

Grid Engine provides a mechanism that makes it easy to use a temporary local directory for I/O. When a cluster job is run on a node, the scheduler creates a temporary directory for it which is deleted automatically after the job finishes. The path to the temporary directory is stored in the TMPDIR environment variable in the job's environment. This makes it easy to use the local disk to store temporary job files because you can write the files to $TMPDIR and, after the job has completed, the scheduler will delete the directory.

It is also possible to store persistent files such as databases on the local disk although this is dependent on space requirements (the sysadmins can advise on this).

Specifying local disk space requirements using qsub

You can specify how much space on the local disk you require when you submit a cluster job using the local_free resource. For example. to specify that your job will need at least 10Gb of space on the local disk, use:

qsub -l local_free=10G

Using the local disks is much faster than using the scratch space.

This is easily seen if you consider the following. The bandwidth available on the local disks is 125MB/s. The nodes can theoretically sustain up to 85MB/s. Therefore, no matter how fast the network is, the bandwidth to the local disk will be 50-100% greater than to GPFS. In practice the difference is even greater because the bandwidth available to the entire cluster to access GPFS is 400MB/s. In other words, the 400MB/s of available bandwidth is shared between the cluster nodes and the other machines that have access to GPFS.

Array jobs.

An array job is one in which the submitted command script is run multiple times. The individual instances of the job, known as tasks, are distinguished by the value of the SGE_TASK_ID environment variable. For example. if an array job of 10 tasks is run, SGE_TASK_ID will have a value of 1 in the first instance, 2 in the second instance and so on up to 10. Note that the task index has no relation to the numbering system used for the queues on the cluster nodes.

Use the qsub -t option to run an array job, e.g. -t 1-10 will run an array job of size 10. See also the qsub manpage.

The task index (the value of SGE_TASK_ID) appears in the ja-task-ID column of the output from qstat.

Each job submitted to the cluster requires a certain amount of resources. If you have a large number of jobs that are only differ from each other in a minor way, and it is possible to distinguish between them using SGE_TASK_ID, it is much more efficient in terms of resources to submit them as a single array job rather than as many individual jobs. See the attachment to Grid Engine for an example of running BLAST as an array job in which a set of sequences from an input file are used to query a database.

There is a limit to the size of an array job. This limit is the max_aj_tasks value in the output from qconf -sconf.

NOTE: the SGE_TASK_ID environment variable is set to 'undefined' for jobs not run as arrays.

Specifying separate log files for each task in an array job.

The default Grid Engine behaviour is to have one output and one error log for a job, even if it is an array job. Are you sure? It seems to default to jobname.[oe]jobid.taskid for array jobs when I run them? It is often much more useful for each task an array job to have its own individual error and output logs. This can be done by including the task id in the log file names. To do this, insert the $SGE_TASK_ID variable into the log filenames on the qsub command-line:

qsub -o job.'$SGE_TASK_ID'

Note that $TASK_ID must be in single quotes to stop the shell interpreting it as a shell. The tasks in this example will have log files named job.1, job.2 etc.

Useful qsub options

See the qsub manpage for more details.

  • -S: specifies the interpreter when running a script.
  • -b y: specifies that the command to be run can be either a script interpreter or an executable.
  • -cwd: run the job in the current working directory.
  • -e: specify the stderr log for a job.
  • -l: used to request resources.
  • -m: send e-mail to the user when certain things (e.g. suspension, system kill) happen to a job. Use in conjunction with -M, which is used to specify the e-mail address.
  • -o: specify the stdout log for a job.
  • -q: adds a queue the list of queues a job is submitted to.
  • -t: used for running array jobs.
  • -V: Specifies that all environment variables in the context in which qsub is run are exported into the context of the job. This option is very useful for ensuring that the environment the job runs in is the same as the one on the command line.

The sge_request file

To set commonly used qsub options more permanently a .sge_request file can be created. All the above settings can be specified, one on each line of the file, and they will be used for every qsub run in that context (unless the -clear option is used on the command-line).

There are three contexts:

  • <sge_root>/<cell>/common/sge_request – global defaults file
  • $HOME/.sge_request – user private defaults file
  • $cwd/.sge_request – local directory defaults file

The local directory defaults file has the highest precedence. See the sge_request manpage for more info.

Job status codes

These appear in column 5 of the default qstat output.

qw: The job is queued and will be submitted when a node becomes available.     
t: The job is being transferred to the cluster from the submit host.     
r: The job is running.     
E: The job is in the error state.     
Eqw: This error code indicates that a job has failed to start         
	  on a node. Common reasons for this include syntax errors          
	  in Perl scripts, failure to load Perl modules and attempts to          
	  access files or directories that are not visible on the node.     
s: The job is suspended.     
S: The job has been suspended because the queue it's running in has been suspended.       
T: The job is suspended because a load threshold such as available memory or the system        
	load has been exceeded.     
dr/dS/dT: A job that was in the r, S or T state has been flagged for deletion by ''qdel''.        
	If a job remains in the this state, try using ''qdel -f'' to delete it. If this doesn't work,        ask a sysadmin to delete the job.

Queue status codes

  • a: a load threshold is exceeded. This means that no further jobs will be scheduled to run in this queue.
  • A: a suspend threshold is exceeded. This means that jobs in the queue will be suspended until the threshold violation is removed.
  • d: the queue is disabled and no jobs will be scheduled to run in it.
  • s: suspended.

Environment variables

When a job is run on the cluster, the scheduler sets several variables in the job's environment. These variables include useful information about the job and its environment. These variables are listed towards the end of the qsub man page.

Parallel environments

Parallel environments allow a job to use more than one slot simultaneously. For example, this is required to run BLAST in multi-threaded mode. The -pe option is required which two arguments: the environment name and the number of slots required. Currently, there is only the mpi parallel environment and it can only be used on the mpi queue:

qsub -pe mpi 2 -q mpi -l ram=3G ...

Each of the ib and fc nodes have a maximum of eight slots, specifying more than that will guarantee the job job won't run! However, there is a maximum number of slots available to concurrent parallel jobs, which avoids the cluster being flooded. Not sure what the maximum no. of slots is, though?

Applications enabled with MPI can be run using mpiexec in a job script similar to this one:

#!/bin/sh #$ -cwd #$ -V /usr/lib64/openmpi/1.2.5-gcc/bin/mpiexec -np $NSLOTS COMMAND...

where COMMAND is the application you want to run.

NSLOTS is set automatically when a job is running in a parallel environment so there is no need to explicitly define its value.

Consumables

Consumables are allocated on a per slot basis. For example, suppose you run:

qsub -pe mpi 10 -l ram=2000M ...

Each individual serial task in the job will consume 2000M of the ram resource.

Scripting

qsub treats lines in a script that begin with #$ as special instructions analogous to the command-line options used with qsub. If a Perl script contains a line beginning with these two characters as a result of code being commented out, qsub will output mysterious error messages about unknown options. Please note that the character sequence #$ can be altered with qsub -C option e.g. if you would like qsub to recognise #: as the start of insruction then use something like this qsub -C #: in your qsub command. (source)

Be very careful with recursive submissions

Keep this in mind if you write job scripts that themselves submit jobs

It is possible to wreak havoc on the cluster by running a script that submits jobs that in turn submit jobs themselves that in turn… This can cause a snowball effect where the number of jobs queued increases faster than you can qdel them. There are two ways to stop this: either move your script somewhere else or chmod 000 the script's output directory so that writing fails.

Accounting

Job accounting data is stored in the Compbio postgres database. It is accessible either via the ARCo web application or by connecting directly to the Postgres database.

Access via ARCo

First, ask the sysadmin to add you to the list of authorized users for ARCo if you're not already on it.

Login to https://sge.cluster.lifesci.dundee.ac.uk:6789 and follow the instructions onscreen.

Direct access

To get access to it, connect to the arco database on postgres.compbio.dundee.ac.uk using username account and password saffron, e.g. using psql, do:

psql -h postgres.compbio.dundee.ac.uk -U account -d arco

The data is in the view view_accounting. For example, to get a list of jobs run by user www-jpred, do:

SELECT * FROM view_accounting WHERE username='www-jpred';

Getting basic accounting information

CPU time and memory usage for a given user can be obtained by running this query:

SELECT sum(cpu)/3600 AS "cpu h",sum(mem)/3600 AS "mem (Gb cpu h)" FROM view_accounting WHERE username='USER';

Similarly, a group's usage can be obtained using:

SELECT sum(cpu)/3600 AS "cpu h",sum(mem)/3600 AS "mem (Gb cpu h)" FROM view_accounting WHERE "group"='GROUP';

Total usage

The total usage over a given period of time can be determined using:

SELECT sum(cpu)/3600 AS "cpu h",sum(mem)/3600 AS "mem (Gb cpu h)" FROM view_accounting WHERE submission_time >= START AND submission_time <= END;

where START and END have the format YYYY-MM-DD.

Other resources

A script is available which returns simple stats from the view_accounting table. It can be found here:

/sw/local/bin/arco_stats.pl

Use the –man switch for up-to-date information on its functionality.

Grid Engine keeps a log of the jobs that have been submitted to the cluster. This can be accessed using the qacct command on sge.cluster.

Mark has written a nice web front-end to the ARCo server. Try that instead of the script if you prefer.

For high level statistics data please see here