This page last modified: Feb 02 2009
keywords:pbs,portable,batch,system,qsub,qstat,pbsnodes,cluster,server,queue,15007 description:Configuration suggestions and diagnosing problems with the Portable Batch System PBS (e.g. OpenPBS). title:PBS setup configuration debugging Table of contents ----------------- Mount point suggestion Example PBS qsub command (see "Example details" below) Hardware and operating system Problem description How to diagnose "No permissions" and/or "cannot connect to host" How to fix permissions/connect error How to diagnose the problem of jobs hanging in the E state Example details Mount point suggestion ---------------------- PBS copies the executable it was told to run into a PBS spool directory on the worker node. This creates several problems. If your code makes assumptions about libraries and config files that exist in the same directory as the executable, then PBS will break your code. PBS supplies an environment variable PBS_O_WORKDIR, however, I suggest that you create your on variable for your working and/or binary directory. If your mount points have different names between the head node and the cluster nodes (and there are good reasons for a differences to exist), then PBS_O_WORKDIR will be "wrong" on the worker nodes. For example, I cd /m0/mst3k/gss/bin on the head node and launch a job. It turns out that /bin/pwd will tell us that the real path I'm using is /Volumes/xsadminHD1/m0/mst3k/gss/bin. However, the same cd /m0/mst3k/gss/bin on the worker nodes takes us to a different directory: /private/automount/m0/mst3k/gss/bin. The workaround is to set your own private path environment variable, and standardize its use throughout your code. Vela (my job queuing system) sets vela_path to /m0/mst3k/gss/bin which is identical on both head and worker nodes. Ideally, you would set up your cluster with identical mount points for the head node and worker nodes. That could be an interesting challenge. Example PBS qsub command (see "Example details" below) ------------------------------------------------------ qsub -p -100 -c n -k eo -m n -r n -e /home/mst3k -o /home/mst3k Hardware and operating system ----------------------------- We found this problem on our Apple XServe cluster. The OSX software was reasonably up to date. We are running PBS Pro, under an educational license. The problems described here appear to be configuration issues and not bugs. I'm guessing that these same issues apply to Linux clusters running Open PBS as well as PBS Pro. Problem description ------------------- PBS requires that the (head node) can send email to users on the system, and PBS requires that worker nodes can rcp files to the head node. There are workarounds for both of these requirements. I also include some techniques to diagnose problems with PBS. We had two main problems: 1) "No permission" error messages about PBS facilities being unavailable. 2) Jobs stayed in the E state for around 10 minutes. If you get errors like the following from PBS when the system has active jobs but not when the queue is empty (and has been empty long enough for all the offending processes to time out), then you may have an overloaded head node. pbs_iff: cannot connect to host No Permission. qstat: cannot connect to server zeus (errno=15007) This error was intermittent. Sometimes we could go for several minutes with no error, and other times it would be several minutes when PBS was unavailable. Apparently the cause was the head node being overloaded trying to send undeliverable email. With a properly running system, no matter how busy the work nodes are, load on the head node will be reasonable This is typical proper load on our Apple XServe OSX dual G5 head node with 34 worker nodes all busy: [mst3k@zeus mst3k]$ uptime 11:38 up 4 days, 3:48, 9 users, load averages: 0.93 1.01 0.98 [mst3k@zeus mst3k]$ In this case, worker nodes were running BLAST searches which are pretty much cpu and memory bound. A fairly large file (250Mb) is initially read into memory, and output is small. We keep the 250Mb file on local disk for each node so that we don't bog down the nfs server. The head node is not the nfs server for our cluster. How to diagnose "No permissions" and/or "cannot connect to host" ----------------------------------------------------------- Remember, this is a server load problem caused by too many email processes with that can't complete because the email is undeliverable. In our instance /var/log/mail.log (on Linux this file is called /var/log/maillog) was a very large file (the file was 1Gb where it is normally 1Mb), and probably contained loads of error messages (I didn't see it's contents.) Server load as seen in "update" or "top" will be very high (>2, or very low cpu percent idle). top -o cpu ps aux You will see many processes related to email. These may be called "bounce", or contain email-related substrings. I remember seeing processes with "imap" in them. I seem to recall many processes owned by "postfix". How to fix permissions/connect error ------------------------------------ from /etc/postfix/aliases # Put your local aliases here. adm: /dev/null ja4n: /dev/null mst3k: /m1/mst3k.zeus.mail jaw2d: /dev/null You need to run the command newaliases after making changes to the aliases file. The file above gets postfix to put all of mst3k's email into the file /m1/mst3k.zeus.mail. How to diagnose the problem of jobs hanging in the E state ---------------------------------------------------------- If your jobs remain in the E state after completion, then you may have the rcp problem. You could get rcp working between the worker nodes and the head node. Or you could get PBS to just leave the output files on the worker nodes. Since our worker nodes all have at least one nfs mounted, shared file system, I chose to keep the files "local". In our case, the local directory used is nfs mounted, so we don't need rcp. I suspect that nfs is more efficient than rcp. Besides, rcp and the related rsh utilities seem ancient and weird. A sysadmin who has a .rhosts file did not experience this problem, so if you want to use rcp, you probably only need a good .rhosts file (The .rhosts file either needs to be on a nfs share for each node, or the .rhosts needs to be copied to each node.) If your jobs are stuck in the E state, you'll see qstat output like this: [mst3k@zeus mst3k]$ qstat Job id Name User Time Use S Queue ---------------- ---------------- ---------------- -------- - ----- 171266.zeus pbs_run.pl mst3k 00:10:26 E workq 171269.zeus pbs_run.pl mst3k 00:09:19 E workq 171274.zeus pbs_run.pl mst3k 00:10:34 E workq 171276.zeus pbs_run.pl mst3k 00:09:23 E workq 171277.zeus pbs_run.pl mst3k 00:09:37 E workq 171278.zeus pbs_run.pl mst3k 00:08:33 E workq If you get qstat details for a node, you can discover the node number. Then you can ssh to that node, and look at running processes. Typically, you'll see a pbs_rcp process that stays in the process list for a long time (10 minutes). Also, you will be unable to rcp or rsh from the worker nodes to the head node. There will also be "operation timed out" messages in PBS's rcperr file(s) on the worker node. qstat -f 171278.zeus In the output of the qstat command , look for this line: exec_host = xs04/0 ssh xs04 ps aux | grep rcp If you see an pbs_rcp process, and the process stays in the list, this indicates a potential problem. Lastly, if PBS is logging problems in the rcperr file, that's a sure sign of trouble. [mst3k@xs23 mst3k]$ ls -alt /var/spool/PBS/spool/rcperr* total 64 -rw-r--r-- 1 mst3k wheel 26 13 Apr 10:14 rcperr.4709 drwxrwxrwt 18 root wheel 612 13 Apr 10:10 . [mst3k@xs23 mst3k]$ cat /var/spool/PBS/spool/rcperr.4709 zeus: Operation timed out [mst3k@xs23 mst3k]$ Example details --------------- The key to dealing with rcp issues is to stop PBS from trying to use rcp, or fix the system configuration so that rcp works. There isn't a PBS setting to disable rcp, so you have to tell PBS to keep the error and output files on each host, and you have to give PBS a valid path for those files. As far as I can tell, if you do not do both things, PBS will still try to use rcp. Keep both files on execution host. -k eo Paths for stderr and stdout output -e stderr_path -o stdout_path For example: -k eo -e /home/mst3k -o /home/mst3k These were a nice idea, but didn't help solve our PBS performance issues. Don't checkpoint. -c n Don't send mail. -m n Not rerunable (seems like less work for PBS) -r n If you are using nfs mounted (or otherwise network shared) home directories, and if you are therefore not using rcp, then you will want to set -e and -o. This is my suggested command line: /usr/pbs/bin/qsub -p -100 -c n -k eo -m n -r n -e /home/mst3k -o /home/mst3k Use your preference for -p prioritiy. Negative numbers are lower priority. The default is zero.