
For a practical quide to using the cluster, read the HOWTO.
To use the cluster you must login to a submit host. At present the only submit host is login.compbio.dundee.ac.uk.
Do not simply ssh into the login box and launch processes from the command line. Instead, jobs should be launched via the scheduler (see the HOWTO for instructions), which ensures the load on the box is manageable. Running heavy computation on the login box risks creating high loads or causing memory starvation, either of which can make the box sluggish or even unusable. This is a surefire way to make yourself unpopular with other users of the cluster. Accordingly, you should learn to use the scheduler.
You should not run batch jobs on the login box at all unless the job needs more than 16b of RAM. Jobs requiring 16Gb or less should be run on the cluster nodes using the procedures described in the HOWTO.
Before you can use the SGE tools you need to 'source' the default parameters.
source /gridware/sge/default/common/settings.sh
To save you doing that for every new session, you can add it to your ~/.bashrc file.
Working out what has gone wrong with a job can be made much easier by knowing the execution environment of the job on the node. Therefore, it is strongly recommended that you dump the values of the environment variables at the beginning of the job.
Cluster jobs are managed using Grid Engine.
The sge_intro manpage is a useful starting point for understanding Grid Engine.
The cluster consists of the following
All of the cores are 64-bit and run CentOS Linux 5.2 or (in the case of ningal and cuda-1, CentOS 5.5).
For most jobs, the differences between the IB and FC nodes are irrelevant. However, MPI-enabled programs can take advantage of the Infiniband interconnects on the IB nodes to improve performance. Accordingly, MPI jobs should include the mpi queue in the list of queues they request; this ensures that the jobs will run on the IB nodes if slots are available on the nodes.
| Queue name | Memory per node | h_vmem limit | Time limit | Batch/Interactive | Notes |
|---|---|---|---|---|---|
| devel.q | 16Gb | 14 Gb | 8 hour | I | |
| 64bit-pri.q | 16Gb | - | 24 hours | B | |
| 64bit.q | 16Gb | - | None | B | |
| bigint.q | 32Gb | 30G b | 8 h | I | |
| bigmem.q | 32Gb | - | None | B | |
| ningal.q | 128 Gb | 128 Gb | 24 h | BI | |
| mpi | 16 Gb | - | None | B | See 'Parallel Environment', below |
The queue instance on each node can be in a number of possible states, indicated by flags in the qstat -f output. No flag means that the queue instance is running normally and accepting jobs. The possible flags are:
Jobs running on a given node are prioritised according to the following rules. Note that these rules affect only jobs running on the same node.
qsub -l ram=4000M
will set the ram requested to 4000M.
qsub -l ram=2000M
will set ram to 2000M, respectively.
If a job is queued up but isn't being submitted, it might be because the requested memory resources are too high. You can find out if this is the case by doing the following:
Here are some practical guidelines for large memory jobs:
| Job Description | Qsub Arguments |
|---|---|
| Large Jpred Job | -q bigmem.q -l qname=bigmem.q,ram=29G,mem_used=28G,h_vmem=28G |
Space on the home directories is not limitless. If you're submitting a large array or a large number of single jobs, think carefully about how much space the files you generate will need. It is possible to fill up the disk very quickly because of the sheer number of CPUs available on the cluster. If in doubt, run a single test job first to check its storage requirements.
Avoid using your home directory for I/O. If possible, cluster jobs should make of each node's local disk and any network scratch space your group may have.
It is a good general policy to do as much I/O as possible on the local disks and copy the final results of cluster jobs from the local disks to GPFS (if you have lots of files to copy, tar them up and then copy the tarball).
Grid Engine provides a mechanism that makes it easy to use a temporary local directory for I/O. When a cluster job is run on a node, the scheduler creates a temporary directory for it which is deleted automatically after the job finishes. The path to the temporary directory is stored in the TMPDIR environment variable in the job's environment. This makes it easy to use the local disk to store temporary job files because you can write the files to $TMPDIR and, after the job has completed, the scheduler will delete the directory.
It is also possible to store persistent files such as databases on the local disk although this is dependent on space requirements (the sysadmins can advise on this).
You can specify how much space on the local disk you require when you submit a cluster job using the local_free resource. For example. to specify that your job will need at least 10Gb of space on the local disk, use:
qsub -l local_free=10G
This is easily seen if you consider the following. The bandwidth available on the local disks is 125MB/s. The nodes can theoretically sustain up to 85MB/s. Therefore, no matter how fast the network is, the bandwidth to the local disk will be 50-100% greater than to GPFS. In practice the difference is even greater because the bandwidth available to the entire cluster to access GPFS is 400MB/s. In other words, the 400MB/s of available bandwidth is shared between the cluster nodes and the other machines that have access to GPFS.
An array job is one in which the submitted command script is run multiple times. The individual instances of the job, known as tasks, are distinguished by the value of the SGE_TASK_ID environment variable. For example. if an array job of 10 tasks is run, SGE_TASK_ID will have a value of 1 in the first instance, 2 in the second instance and so on up to 10. Note that the task index has no relation to the numbering system used for the queues on the cluster nodes.
Use the qsub -t option to run an array job, e.g. -t 1-10 will run an array job of size 10. See also the qsub manpage.
The task index (the value of SGE_TASK_ID) appears in the ja-task-ID column of the output from qstat.
Each job submitted to the cluster requires a certain amount of resources. If you have a large number of jobs that are only differ from each other in a minor way, and it is possible to distinguish between them using SGE_TASK_ID, it is much more efficient in terms of resources to submit them as a single array job rather than as many individual jobs. See the attachment to Grid Engine for an example of running BLAST as an array job in which a set of sequences from an input file are used to query a database.
There is a limit to the size of an array job. This limit is the max_aj_tasks value in the output from qconf -sconf.
NOTE: the SGE_TASK_ID environment variable is set to 'undefined' for jobs not run as arrays.
The default Grid Engine behaviour is to have one output and one error log for a job, even if it is an array job. Are you sure? It seems to default to jobname.[oe]jobid.taskid for array jobs when I run them? It is often much more useful for each task an array job to have its own individual error and output logs. This can be done by including the task id in the log file names. To do this, insert the $SGE_TASK_ID variable into the log filenames on the qsub command-line:
qsub -o job.'$SGE_TASK_ID'
Note that $TASK_ID must be in single quotes to stop the shell interpreting it as a shell. The tasks in this example will have log files named job.1, job.2 etc.
See the qsub manpage for more details.
To set commonly used qsub options more permanently a .sge_request file can be created. All the above settings can be specified, one on each line of the file, and they will be used for every qsub run in that context (unless the -clear option is used on the command-line).
There are three contexts:
The local directory defaults file has the highest precedence. See the sge_request manpage for more info.
These appear in column 5 of the default qstat output.
qw: The job is queued and will be submitted when a node becomes available.
t: The job is being transferred to the cluster from the submit host.
r: The job is running.
E: The job is in the error state.
Eqw: This error code indicates that a job has failed to start
on a node. Common reasons for this include syntax errors
in Perl scripts, failure to load Perl modules and attempts to
access files or directories that are not visible on the node.
s: The job is suspended.
S: The job has been suspended because the queue it's running in has been suspended.
T: The job is suspended because a load threshold such as available memory or the system
load has been exceeded.
dr/dS/dT: A job that was in the r, S or T state has been flagged for deletion by ''qdel''.
If a job remains in the this state, try using ''qdel -f'' to delete it. If this doesn't work, ask a sysadmin to delete the job.When a job is run on the cluster, the scheduler sets several variables in the job's environment. These variables include useful information about the job and its environment. These variables are listed towards the end of the qsub man page.
Parallel environments allow a job to use more than one slot simultaneously. For example, this is required to run BLAST in multi-threaded mode. The -pe option is required which two arguments: the environment name and the number of slots required. Currently, there is only the mpi parallel environment and it can only be used on the mpi queue:
qsub -pe mpi 2 -q mpi -l ram=3G ...
Each of the ib and fc nodes have a maximum of eight slots, specifying more than that will guarantee the job job won't run! However, there is a maximum number of slots available to concurrent parallel jobs, which avoids the cluster being flooded. Not sure what the maximum no. of slots is, though?
Applications enabled with MPI can be run using mpiexec in a job script similar to this one:
#!/bin/sh #$ -cwd #$ -V /usr/lib64/openmpi/1.2.5-gcc/bin/mpiexec -np $NSLOTS COMMAND...
where COMMAND is the application you want to run.
NSLOTS is set automatically when a job is running in a parallel environment so there is no need to explicitly define its value.
Consumables are allocated on a per slot basis. For example, suppose you run:
qsub -pe mpi 10 -l ram=2000M ...
Each individual serial task in the job will consume 2000M of the ram resource.
qsub treats lines in a script that begin with #$ as special instructions analogous to the command-line options used with qsub. If a Perl script contains a line beginning with these two characters as a result of code being commented out, qsub will output mysterious error messages about unknown options. Please note that the character sequence #$ can be altered with qsub -C option e.g. if you would like qsub to recognise #: as the start of insruction then use something like this qsub -C #: in your qsub command. (source)
Keep this in mind if you write job scripts that themselves submit jobs
It is possible to wreak havoc on the cluster by running a script that submits jobs that in turn submit jobs themselves that in turn… This can cause a snowball effect where the number of jobs queued increases faster than you can qdel them. There are two ways to stop this: either move your script somewhere else or chmod 000 the script's output directory so that writing fails.
Job accounting data is stored in the Compbio postgres database. It is accessible either via the ARCo web application or by connecting directly to the Postgres database.
First, ask the sysadmin to add you to the list of authorized users for ARCo if you're not already on it.
Login to https://sge.cluster.lifesci.dundee.ac.uk:6789 and follow the instructions onscreen.
To get access to it, connect to the arco database on postgres.compbio.dundee.ac.uk using username account and password saffron, e.g. using psql, do:
psql -h postgres.compbio.dundee.ac.uk -U account -d arcoThe data is in the view view_accounting. For example, to get a list of jobs run by user www-jpred, do:
SELECT * FROM view_accounting WHERE username='www-jpred';
CPU time and memory usage for a given user can be obtained by running this query:
SELECT sum(cpu)/3600 AS "cpu h",sum(mem)/3600 AS "mem (Gb cpu h)" FROM view_accounting WHERE username='USER';Similarly, a group's usage can be obtained using:
SELECT sum(cpu)/3600 AS "cpu h",sum(mem)/3600 AS "mem (Gb cpu h)" FROM view_accounting WHERE "group"='GROUP';The total usage over a given period of time can be determined using:
SELECT sum(cpu)/3600 AS "cpu h",sum(mem)/3600 AS "mem (Gb cpu h)" FROM view_accounting WHERE submission_time >= START AND submission_time <= END;where START and END have the format YYYY-MM-DD.
A script is available which returns simple stats from the view_accounting table. It can be found here:
/sw/local/bin/arco_stats.plUse the –man switch for up-to-date information on its functionality.
Grid Engine keeps a log of the jobs that have been submitted to the cluster. This can be accessed using the qacct command on sge.cluster.
Mark has written a nice web front-end to the ARCo server. Try that instead of the script if you prefer.
For high level statistics data please see here