Want to see what the cluster is doing? http://unsat.cs.washington.edu/gmetad/

Common problems

Three methods for executing jobs:

OpenPBS

This is the most flexible of the available methods. To submit a job, there are a few steps that are taken:
  • Use the 'qstat' or 'xpbsmon' commands to see the current status of your job(s).

    Advantages:Some measure of job control. Optional email notification of errors and completion. Ability to migrate jobs around nodes
    Disadvantages:Have to write a tiny wrapper script to start process

    gexec

    This is by far the simplest method to get a job running. To view the current status of nodes in the cluster, use gstat -l -1. This will give you a list of all nodes ordered by CPU load.
    Use gexec -n NUMHOSTS command arg1 arg2 ... to execute command on NUMHOSTS (ie 3) with the command-line args arg1, arg2.

    Make sure your command is available in your PATH env variable or gexec will give you a 'Bad filename' error.

    Advantages: Quick to use, some measure of control as to where jobs are executed
    Disadvantages: No job control or checkpointing. No queueing behaviour provided.

    Vendor docs: http://ganglia.sourceforge.net/docs/

    MPI

    The local implementation of MPI in use on the unsat cluster is MPICH. It is available at unsat:/usr/local/mpich. Docs, headers, and example programs are available.

    MPICH home page: http://www-unix.mcs.anl.gov/mpi/mpich/
    Manuals from the MPICH website: http://www-unix.mcs.anl.gov/mpi/mpich/docs.html (We are using the ch_p4 model)

    To run your MPI-aware test:

    to run on a set number of processors. If you want to use all available processors, use '-allcpus -machinefile /etc/machinefile' in place of '-np <num processors>' above.

    NOTE that this does not give you the benefits of OpenPBS. To integrate the two, simply call mpirun as the program submitted to qsub. You will most likely want to use 'nodes=#:ppn=#' to allocate the proper number of CPUs.

    Common Problems

    1. qstat et.al list my job as running although all of the processes are dead
      Use 'qsig -s 0 jobid' to get the manager to notice the processes are gone

    screen is available on unsat. This provides the text-mode equivalent of terminal services. That is, allows you to connect, start some process that is tied to the display, and disconnect without killing your processes. You can connect at a later time and be right back where you were.
    Quick Start guide:


    Last updated May 23, 2005.