|
********Note that there are some changes to the user environment as of the September 18th 2008
upgrade****** They are not yet incorporated in this document. For a short summary of what's new, please
check this page until we've updated this document.
This document is intended to answer common questions about how to use the Saguaro cluster. If this page doesn't address your questions, or if you need more help than this page provides, please send an email to support@hpchelp.asu.edu.
Saguaro is the 1060 processor, 285 node, main compute cluster for the Fulton High Performance Computing Initiative. Saguaro provides the resources needed for advanced research in computing, biology, physics, and a myriad of other topics. This document will give you the knowledge to make use of Saguaro.
To use Saguaro you will need to:
- Get an Account
- Prepare Your Jobs
- Run and Manage Your Jobs
This may seem a little overwhelming, so we have prepared the classic Hello World! example that steps you through the process. You will need an account before you can run the example.
Our FAQ may also be of interest.
1. Get an Account
The resources supplied by Saguaro are only available to account holders. Saguaro accounts are available to all undergraduate and graduate students, faculty, research staff, and select industrial and academic partners. Educational use, temporary accounts are available free of charge for coursework involving parallel computing. Instructors should contact Dan Stanzione to make the appropriate arrangements.
Saguaro account holders get:
- SSH access to Saguaro
- CPU Time on Saguaro
- Disk Space on Saguaro
Our Allocation Policies determine the amount and level of resources you get.
Fill out our account request form to get an account.
After you have an account, you can Access Saguaro and Learn Basic Linux Commands. Be sure to keep track of your CPU Hours and your Disk Usage. If you run out of either of these, your ability to use Saguaro will be greatly reduced.
If you are unfamiliar with linux, please read about using linux
Access Saguaro
Saguaro is accessed using ssh, a secure system that provides interactive shell (command line) access, file transfers, and X-Windows tunneling. Linux, UNIX, and Apple OS X systems include an ssh client named 'ssh'. Windows users must install a client. Suggested clients are ASU site-licensed client, PuTTY, or OpenSSH.
Saguaro has two (2) login nodes, saguaro1.fulton.asu.edu and saguaro2.fulton.asu.edu. It does not matter which login node you use, they both offer the same functionality and access to the same resources. If one login node is busy or is experiencing problems, please try the other node.
Shell Access is accomplished by connecting to one of the login nodes from your system with a command like ssh saguaro1.fulton.asu.edu or ssh saguaro2.fulton.asu.edu or using a graphical client.
File Transfers can be done using scp or sftp. scp works like the cp or copy commands, but allows you to copy files to a remote system. To copy a file to Saguaro using scp, use a command like, scp filename username@saguaro1.fulton.asu.edu: replacing filename with the name of the file to copy, and username with your username. Please notice the ':' at the end of the command, this is what tells scp that you are copying to a remote system. With this command files will be copied to your home directory on Saguaro. To copy to other directories, append the directory path to the command line after the ':'. For example, scp filename username@saguaro2.fulton.asu.edu:testdir will copy files to the testdir directory in your home directory. If the command were scp filename username@saguaro2.fulton.asu.edu:/testdir the file will be copied to /testdir on Saguaro. This is probable NOT what you want, so be careful. sftp is used like ftp, but over the secure ssh connection. Many people prefer graphical sftp clients for transferring files.
Graphical Access with X-Windows can be done by using ssh to tunnel the X-Windows session back to your workstation. Linux and UNIX systems use X-Windows for their graphical environment. Apple OS X users will need to install and run the X11 program that comes with OS X. Windows users will need to install a X-Windows server like the ASU site-licensed one. Once the X-Windows server is running, just ssh -X username@saguaro1.fulton.asu.edu to connect, then run the application you need. If your workstation is properly configured, the application will then appear on your desktop. A good application to test with is xterm, which provides a shell in a window.
Track Your CPU Hours
To check the CPU hour allocation allotted to you, run mybalance -h. To see the allocation in seconds, run mybalance. The allocation will be listed by project.
CPU Hours are the number of CPUs for the job times the amount of time the job takes to run. All times are rounded to the nearest second. For example, if you have a job that takes 3 hours to run on 64 processors, you will have used 3*62=192 hours.
When you submit a job you must specify the amount of walltime you expect the job to take. The system will then check to see if you have enough hours, then reserve that number of hours, from our example 192 hour. When the run is completed, the actual amount of time used will be deducted and the reserve will be released. It is a good idea to leave a little extra time in your estimate. Note the word little, because any time that is reserved cannot be used on another job. For example, If you have 100 hours in your account and two jobs that have 100 CPU Hours estimated (number of processors * walltime), only the first job submitted will run, even if each job only uses 1 hour.
Jobs can run slightly over their walltime estimates before being killed.
If you have a job that is hard to estimate a walltime for (say a convergence problem), checkpoint the job so it can be restarted. As the job can be restarted from about where it left off, you can control how much time is reserved and not be worried about a particular sequence taking longer than expected.
If you are running code for a different project than your default, add #PBS -A projectname to your jobscript.
Track Your Disk Usage
Your account gives you access to a limited amount of persistent disk space in =/home/yourusername= (your home space) that can be accessed from all the nodes. This space should be used for your jobs and data. The amount of disk space you are allotted is called your quota.
The quota command tells you how much disk space you are using in the blocks column and how much you have left in the quota column. When you run out of space you will no longer be allowed to write to the disk and will need to remove some files before you can write again. Running out of quota can kill your jobs, so keep an eye on it.
In deference to not killing jobs, we allow you exceed your quota slightly for a short period of time. When that short period of time is over, writes will not be allowed until you are under your quota again.
Get More Help
For more help on using our systems, read our FAQ, then if your question is not answered there, submit a support request or email us at support@hpchelp.asu.edu.
2. Prepare Your Jobs
After you have gained access to Saguaro, you will need to prepare your jobs to run on the Cluster. Saguaro uses Torque (formerly known as PBS) to manage its resources. To make full use of Saguaro, you will need to know how to edit files, take advantage of available resources, and prepare job scripts for submitting your jobs to Torque. If you have not edited files on linux before, read about editing files
Take Advantage of Available Resources
So, you want to get your jobs up and running as quickly as possible, and you want them to run quickly too? Well, knowing a few things about how Saguaro is set up will help you in this process. We have already installed a variety of compilers, environments, libraries, and scientific applications to help you get started on the software side as quickly as possible. To get your job running quickly you should know about our network, queues, and storage options to make the best use of these resouces.
Network
Saguaro only connects to the public network through the head node, but has two internal networks, one Ethernet and one Infiniband. Infiniband is a specialized high-speed, low-latency network well suited to High Performace Computing. Parallel computations should use Infiniband wherever possible.
By default MPI jobs run over Infiniband. Should you need to run over the slower Ethernet, please submit a service request.
Queues
To help run jobs as quickly as possible, Saguaro has four (4) queues to submit jobs to. Submitting jobs to the correct queue will speed up the time from submission of your job to the actual starting of your job. If your job does not match the parameters of the queue it is submitted to, it may not run.
- devel for development jobs (e.g., testing, short runs). Jobs are limited to 8 processors and 15 minutes total CPU time.
- normal for all other jobs
Additional memory is available through custom job scripts that leave certain processor cores idle to free up additional RAM, up to 16GB per processor. For assistance with this, please review the allocation policy and file a service request.
Queues are specified in your jobscript with the line #PBS -q queuename.
Job Scripts
To make the most effective use of Saguaro, user jobs need to share the system. Saguaro uses the Torque resource management system from Cluster Resources to distribute compute tasks evenly across the available compute nodes, allowing your jobs to run quickly and without conflict. Using Torque to run cluster applications replaces the difficult task of manually finding free compute nodes to run on.
Once you have a job you want to run, you will also need to prepare a script that tells the cluster scheduler which resources to use to run your job. For example, Torque has a pool of nodes with a 6GB of memory per node. By default, Torque uses nodes that have 1GB of memory, but users can tell Torque to only use nodes with 6GB of RAM. Many cluster applications require multiple processors. By default Torque allocates 1 processor to a job, but users can tell Torque to use 2, 4, 8, or more processors to run a job. It is possible to submit a job directly with these parameters, but much more efficient to include this information in a file we will call a submission script, so that all you need to remember each time you submit the job is the name of the submission script.
A submission script is actually quite easy to create. The script is just a text file that means something special to Torque. The script is comprised of two parts: qsub options and the command to execute your program. An example script follows with an explanation of what it does.
The following examples start the program on 8 processors on the development queue, join the normal and error outputs together, append the output to Example.output, and set it to run for 12 hours. Note: Users are now required to submit walltime estimates to run jobs on Saguaro. Jobs substantially exceeding this estimate will be killed. Run man qsub for more submission options.
If you compiled your program with PGI or Intel compilers, or executed a "use" command, please see the Software page for information on submitting jobs that need the "use" command.
Example Job Script using MPI
#!/bin/bash
#PBS -l nodes=8
#PBS -q devel
#PBS -j oe
#PBS -o Example.output
#PBS -l walltime=12:00:00
cd $PBS_O_WORKDIR
mpiexec /home/user/mpiprogram
Example Job Script (non-MPI)
#!/bin/bash
#PBS -l nodes=1
#PBS -q devel
#PBS -j oe
#PBS -o Example.output
#PBS -l walltime=12:00:00
cd $PBS_O_WORKDIR
/home/user/yourprogram
The meaning of these commands are:
#!/bin/bash the shell to run the script under.
#PBS -l nodes=8 sets how many processors to run on
#PBS -q devel sets the queue to run in
#PBS -j oe Combine stdout and stderr into the same file
#PBS -o Example.output redirects the output from your job to Example.output
#PBS -l walltime=12:00:00 sets your walltime estimate to 12 hours. The time takes the form Days:Hours:Minutes:Seconds
cd $PBS_O_WORKDIR makes sure the job run from the directory where it was submitted from.
mpiexec ... or .../yourprogram is the program to run. If you need to use something, put the use command before this line.
Optionally you can specify another name for the job with #PBS -N jobname. The default for the job name is the name of the jobscript.
To learn more about the commands you can use to interact with Torque, the first place to look is the man (i.e., manual) page for that command via the command line. For instance, to learn more about qsub, run man qsub. Another information source is the Torque Documentation.
3. Run and Manage Your Jobs
Now that your job is ready to run, you will need to:
- Submit your job
- Monitor (check on) your job
- And possibly cancel your job
Submit Jobs
Torque jobs are submitted using the qsub command. There are two ways to use qsub: directly and, as described in Preparing Jobs, using a submission script. The direct way is running qsub like qsub [options] command [arguments].
To submit a job, simply run: qsub myjobscript.sh
For an example to get started with, continue reading and play with our hello world example.
Monitor Jobs
To view the status of the Queue, and the status of your submitted jobs run qstat -a , or to just look at a specific job checkjob jobid#
Canceling Jobs
To cancel a submitted job: qdel jobid#
Example: Hello World!
Traditionally Hello World is used as an introductory tutorial for learning new systems. The most basic version just writes "Hello World" on the screen. This example wants to show that each of the processors is actually doing something, so they will each say "Hello world". This example uses MPI and is written in C. I will first give the code listing to the program itself, then a submission script. Then I will give step by step instructions for compiling and running the program.
hello.c
#include <mpi.h>
#include <stdio.h>
int main(int argc, char **argv) {
int rank, size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
printf("Hello, world. I am %d of %d.\n", rank, size);
MPI_Finalize();
return 0; }
hello.sh
#!/bin/bash
#PBS -l nodes=4
#PBS -q devel
#PBS -j oe
#PBS -o hello_output
cd $PBS_O_WORKDIR
mpiexec /home/user/a.out
Now that you have the correct files, we can compile and run your new program. To compile the program run mpicc hello.c. This will create the executable a.out. As it is an MPI program it cannot be run by itself, but not to worry! Running it is as easy as qsub hello.sh!
Now that you have submitted it to the queue, you must wait for it to get back. To see the status of this "HelloWorld" job, run qsat -a.
After the job is complete, open the file hello_output with your preferred text editor to see the output of the program. We ran the "HelloWorld" job with 4 processors, so there should be 4 lines of output from this program, one from each processor used. If you want to play around more, try running it on more or fewer processors by modifying hello.sh, or have the processors print out different things.
This basic exercise is a good starting point for using Torque to execute parallel applications. Any questions or problems using Torque should be submitted as a Service Request.
|