mainboard

UN*X-Cluster:DQS3

DQS is a method for distributing the batch workload among multiple unix-based machines.

Configuration

Available Resources (Alias Complexes)
cpu CPU time (not real time) in minutes mandatory
memmin minimal amount of memory in MB optional for most queues
mandatory for large memory queues
memmax maximal amount of memory in MB mandatory
sys possible values: OSF1, LINUX mandatory for LINUX queues
psr processor, possible values: alpha_ev4 mandatory for queues on hosts with EV4 processors
priority priority, possible values: low mandatory for low priority queues on slow computers
stable stable, possible values: no mandatory for queues, which can be disabled or killed for any reason.
dualcpu dualcpu, possible values: true mandatory for queues with two cpu's (for use of openmp etc.)

For a list of queues see below.

Hints for Using DQS

  • Make sure that your file $HOME/.login starts with the lines:
              if ($?JOB_NAME) then
                      exit 0
              endif
    
  • Avoid using option -cwd. Since we use the automounter amd this might cause the job changing to the wrong directory. Use the command cd in your job script, e.g. cd $HOME. (But never use a path which starts with /amd.)
  • Submit your jobs using the command qsub. Note that you have to specify some mandatory resources using the option -l. (See example)
  • You can get the status of all queues/jobs using the command qstat -f
  • Use qdel <job_id_list> to delete a job. Type qstat to get job identification number.
  • Although memmin is an optional parameter for most queues, defining memmin makes your job more likely to be scheduled. E.g., if your jobs needs roughly 45 MB of (real) memory you should define: memmin.le.45.and.memmax.ge.45.
  • If your job is going to run binaries which are only available for one system, do not forget to define sys. (e.g. sys.eq.OSF1 or sys.eq.LINUX)
  • To get more information for a special queue use qconf -sq <queue_name>.
  • To get the status of all your currently running and queued jobs use qstat -u <my_name>.
  • If you know that your job needs not much computing power and does not need to be computed fast, use priority.eq.low.
  • If you do not care wether your jobs will be supended or killed for any reason without notifying you, you can use queues with stable.eq.low.

Small Examples

Copy the following script into a file and submit it by typing qsub <file> (replace <file> appropriately).

#!/bin/csh
#
# Simple DQS3 test script
#
# I want a queue where the cpu time limit is larger than 60 minutes and
# where I can use up to 5MB of memory.
#$ -l cpu.gt.60.and.memmax.gt.5
#
# Send me a mail at beginning/end/on suspension
#$ -m bes

hostname
date
sleep 60
   

In case you need a lot of memory, please note that it will be necessary to specify the minimal required memory memmin. The complex memmin is mandatory for large memory queues in order to avoid small jobs getting scheduled to large memory queues (you can always specify memmin since it usually will not make any difference). Here an example, where John F. Physicist wants to run his program bighummer which needs about 150MB of memory and will take 10 hours to complete and should run on a Alpha workstation with Digital UN*X (formerly OSF1):

#!/bin/csh
#
# DQS3 script to run program bighummer
#
#$ -l cpu.gt.600.and.memmin.le.150.and.memmax.ge.150.and.sys.eq.OSF1
#$ -m bes

cd /home/john/bighummer
./bighummer
   

How to find out, how much memory my program needs?

To find out how much memory your program needs, please make a test run on any machine. While the program is running you can use top to get information about the amount of memory your program needs. You will get output like this:

  PID USERNAME PRI NICE  SIZE   RES STATE   TIME    CPU COMMAND
 7752 myname    62   15  902M  432M run   762:36 48.40% myprogram
 1288 another   62   15  102M   69M run    21.3H 48.30% cpu_eater
   

From this you can see that your program needs 902MB of memory. But the best way is to compute it from the memory allocations you make in your code!

Please check your need of memory. If you exceed the memory of your queue be aware that your job can be killed without asking you.

Some rules

  • DQS only works if every group that wants to join also makes as much machines as possible available for DQS. If one group does not want to provide machines, we kindly ask their members not to use DQS!
  • If there is a queue on a machine it makes not much sense to run non DQS jobs there, because the DQS jobs are maximum niced. In this case either the queue should be deleted or the user should use DQS.
  • Whenever possible use local disk space for your IO to avoid nfs traffic slowing down your program.
  • On all of the CIP computers are the queues disabled during the day, because of several complaints.

Further Help

For more information about DQS see here.

List of Queues

Available Queues
Queue Quantity cpu[min] memmin[MB] memmax[MB] Operating System Priority stable dualcpu
apollo_1 1 2880 0001 0032 Osf1      
axp4_1 1 2880 0001 0064 Osf1      
beta_1 1 4320 0256 0768 Linux      
bos2_1 1 2880 0001 0064 Osf1      
bos2_2 1 0120 0001 0016 Osf1      
bounty_1 1 2880 0001 0064 Linux      
brahms_1 1 2880 0001 0064 Linux      
carlina_1 1 2880 0256 0768 Linux      
chip_1 1 2880 0001 0064 Linux      
cyanus_1 1 2880 0256 0768 Linux      
dale_1 1 2880 0001 0064 Linux      
dasy_1 1 2880 0001 0064 Linux      
edvpc5_1 1 2880 0001 0128 Linux   no  
egge_1 1 2880 0001 0064 Linux      
fantasio_1 1 2880 0064 0256 Linux   no  
ficus_1 1 2880 0001 0128 Linux low    
g24_1 1 2880 0256 0768 Linux      
g25_1 1 2880 0256 0768 Linux      
g26_1 1 2880 0256 0768 Linux      
g27_1 1 2880 0256 0768 Linux      
g28_1 1 2880 0256 0768 Linux      
george_1 1 2880 0001 0064 Linux      
hayden_1 1 4320 0001 0384 Linux      
hils_1 1 2880 0001 0064 Linux      
ith_1 1 2880 0001 0064 Linux      
jerry_1 1 2880 0001 0064 Linux      
john_1 1 2880 0001 0064 Linux      
kenny_1 1 2880 0001 0064 Linux      
kleinert_1 1 2880 0001 0064 Osf1      
kleinert_2 1 0120 0001 0032 Osf1      
krabat_1 1 4320 0256 0768 Linux      
lion_1 1 2880 0001 0064 Linux      
mars_1 1 2880 0001 0064 Linux      
mime_1 1 2880 0256 0768 Linux      
mozart_1 1 2880 0001 0064 Linux      
obelix_1 1 2880 0001 0064 Linux      
paul_1 1 2880 0001 0064 Linux      
ping_1 1 4320 0256 0768 Linux      
pips_1 1 2880 0001 0064 Linux      
pong_1 1 4320 0256 0768 Linux      
ringo_1 1 2880 0001 0064 Linux      
sigma_1 2 4320 0128 0256 Osf1      
sigma_2 1 2880 0064 0128 Osf1      
sigma_3 2 2880 0001 0064 Osf1      
sigma_4 1 2880 0128 0256 Osf1      
smart_1 1 4320 0256 0768 Linux   no  
spiron_1 1 2880 0001 0064 Linux      
struppi_1 1 2880 0001 0064 Linux      
suentel_1 1 2880 0001 0064 Linux      
thisbe_1 2 2880 0001 0064 Linux      
thisbe_2 1 2880 0001 0128 Linux     true
tom_1 1 2880 0001 0064 Linux      
warp9_1 1 2880 0001 0128 Linux   no