Hail cluster notes

Real-time status of the cluster
Wiki documentation

Whats on this page

Local Policies
Logging in
Scratch Space
Submitting a job to torque
Checking the status of a job
Removing a job from the queue
Logging into back-end nodes

Local Policies

All long-running processes are to be submitted to the queue.
Any tasks found running on the head node 'hail' are subject to an immediate kill -9. Running memory, disk, or CPU intensive processes on the head node greatly impacts functioning of the rest of the cluster.
Files/directories in scratch space will be purged after 6 months unless arrangements are made.

Logging in

To log into the hail cluster, simply use your favorite ssh client to log into the host 'hail'.

Your home directory (/home/username) on the cluster is independent from your normal /homes/gws account. This local home directory is not backed up. Please make backups of important code and documents elsewhere. This directory is physically located on the node 'hail'.

It is possible to get access to your /homes/gws account. You need to email CSE Support asking them to export your home directory to the machine hail. Once you get confirmation from Support that the export has been done, try 'ls /homes/gws/username. If this doesn't work, email CSE Support asking that the maps be updated.

Exporting home directories or any other NFS points to hail is considered a security risk by the folks that tend most of the department machines. There is a chance that hail would be compromised which would give them access to said NFS areas. Hail is kept up to date with patches to prevent such breaches.

Once your gws account has been exported, you may find it convenient to make a sym link in your local hme directory that points to your normal homedir. You can do this by running the cmmand 'ln -s /homes/gws/<username> gws'. This will create a new virtual directory called 'gws' in your hail home directory. You can now access anything in your gws homedir by cd'ing into ~/gws.

Scratch Space

Each compute node has /scratch. Those nodes with two disks also have /scratch2. These areas are free for anyone to use. Please make a subdirectory under /scratch to contain your work (eg. /scratch/myname).
Any directories left untouched for 6 months are eligible for purging unless arrangements are made. This space is not appropriate for general backups of your data.

Appropriate disk use

Use /tmp for truely temporary (during execution, small data sets) needs,
use /scratch or /scratch2 for longer-term scratch storage (ie, length of project)
Your local (/home/username) homedir is also appropriate for some longer-term data storage. Anything that needs to be backed up must be stored either in /projects or your gws home directory. Nothing on the hail cluster is backed up.

Submitting a job

The most straight forward way to execute a job under torque is with a command like:

qsub -N MyJobName /full/path/to/prog

This will return text like "123.master", which is the job ID and host that the job was submitted to. Note that you cannot pass any command line arguments into the program.

If you are getting an error 'cannot execute binary file', try submitting a job using a description file as described next.

Alternatively, you can write a job description file. The major advantage of a description file is that it allows you to run multiple executables as part of the same job. A description file that is equivalent to the one liner above is:

	## My comments follow double hashes
	## command line flags for the qsub command are given
	## in this file starting with '# PBS'
	# PBS -N MyJobName
	/full/path/to/prog

This is then queued by 'qrun MyJobDescFile'.

A job defaults to using only one processor on one node. To request two processors on one node, use the qsub flag '-l nodes=1:ppn=2'. For two nodes and one CPU per node '-l nodes=2:ppn1'. ppn stands for 'Processors per Node'.

The -l flag is also used to specify other properties of a node. Currently the following properties are defined:

onedisk
twodisk
fourgb
eightgb

These correspond to nodes with one disk or two, and 4gb or 8gb of RAM respectively. '-l twodisk,eightgb' would run the job on any 8gb RAM with two disks.

You can request a specific node by using '-l hosts=nodename'.

Read the qsub manpage for other flags.

qstat

qdel

Logging in to back-end nodes manually

Shouldn't be necessary, except in unusual cases (ie: debugging batch processes).
Use 'rsh nodename' where nodename is n01-n36. rsh should be used only when working on the cluster.
Use ssh/scp when communicating with non-hail nodes.

Last updated May 23, 2005.

Contact the HAIL Administrator with any comments regarding this page, or any questions about topics covered on this page (usage policies, job submission, scheduling tools, etc.).
Contact CSE Support with questions about basic operation of the cluster (outages, system failures, accounts, etc.).