Hints/Tips for the Unsat cluster

Want to see what the cluster is doing? http://unsat.cs.washington.edu/gmetad/

Three methods for executing jobs:

OpenPBS
gexec
MPI

OpenPBS

This is the most flexible of the available methods. To submit a job, there are a few steps that are taken:

Write & debug your program

Write a shell script with a few extra lines:

### Job name (Or delete to use executable name)
#PBS -N MyPBSJob
### Send mail with status (job 'b'egins, 'e'nds, 'a'borts)
### or delete this to only get email on job abort
#PBS -m bae
### Number of nodes
### If you need both CPUs per job, use 'nodes=1:ppn=2'
### (ppn=processor per node)
### or delete to get one job per cpu, spread across nodes
#PBS -l nodes=1
/path/to/your_program_here -with -args

(See the qsub man page for other options)
Each #PBS line is a command-line flag to simplify the qsub call. ie.

qsub -N MyPBSJob -m bae -l nodes=1
				/path/to/your_program_here -with -args

is an equivalent call.
Call your test at the end of the script.

Use 'qsub name_of_above_file' to submit your job for execution

Use the 'qstat' or 'xpbsmon' commands to see the current status of your job(s).

Advantages:Some measure of job control. Optional email notification of errors and completion. Ability to migrate jobs around nodes
Disadvantages:Have to write a tiny wrapper script to start process

gexec

This is by far the simplest method to get a job running. To view the current status of nodes in the cluster, use gstat -l -1. This will give you a list of all nodes ordered by CPU load.
Use gexec -n NUMHOSTS command arg1 arg2 ... to execute command on NUMHOSTS (ie 3) with the command-line args arg1, arg2.

Make sure your command is available in your PATH env variable or gexec will give you a 'Bad filename' error.

Advantages: Quick to use, some measure of control as to where jobs are executed
Disadvantages: No job control or checkpointing. No queueing behaviour provided.

Vendor docs: http://ganglia.sourceforge.net/docs/

MPI

The local implementation of MPI in use on the unsat cluster is MPICH. It is available at unsat:/usr/local/mpich. Docs, headers, and example programs are available.

MPICH home page: http://www-unix.mcs.anl.gov/mpi/mpich/
Manuals from the MPICH website: http://www-unix.mcs.anl.gov/mpi/mpich/docs.html (We are using the ch_p4 model)

To run your MPI-aware test:

mpirun -np <num processors> your-program arg1 arg2

to run on a set number of processors. If you want to use all available processors, use '-allcpus -machinefile /etc/machinefile' in place of '-np <num processors>' above.

NOTE that this does not give you the benefits of OpenPBS. To integrate the two, simply call mpirun as the program submitted to qsub. You will most likely want to use 'nodes=#:ppn=#' to allocate the proper number of CPUs.

Common Problems

qstat et.al list my job as running although all of the processes are dead
Use 'qsig -s 0 jobid' to get the manager to notice the processes are gone

screen is available on unsat. This provides the text-mode equivalent of terminal services. That is, allows you to connect, start some process that is tied to the display, and disconnect without killing your processes. You can connect at a later time and be right back where you were.
Quick Start guide:

Use 'screen -R -D' to start screen (this reconnects you to an existing session, disconnecting any others if necessary)
Ctrl-A, C creates a new terminal window
Ctrl-A, W lists all existing windows
Ctrl-A, # where # is the number of an existing window switches you to that display. Much like Alt-(F1-F6) on the linux console
Ctrl-A, D disconnects your session
Ctrl-A, ? gives help

Last updated May 23, 2005.

Contact the UNSAT Administrator with any comments regarding this page, or any questions about topics covered on this page (usage policies, job submission, scheduling tools, etc.).
Contact CSE Support with questions about basic operation of the cluster (outages, system failures, accounts, etc.).