How To - production job

How To - MC production job on mercury Kenji Hamano

Last modified : Dec21, 2004 Back to How To Back to Home

A. Daily routine.
   1. Login and set up
     > ssh -l babarpro mercury.uvic.ca
     > [password]
     > cd relese/sp1443a_event
     > srtpath
     > [return] [return]
     > source setboot_con
   2. Do spcheck and qstat to check the status of jobs.
    It is convenient to use two windows: one for spcheck and the other for qstat.
     > spcheck
     > qstat
   3. Clean up killed jobs from the queue
     > qdel 354450
   4. Submit built runs and re-submit failed or killed runs
     > subbuilt.pl 5160709-5160725
     > subfail.pl 5160709-5161487
   5. Check the number of jobs on the hep_ queue to decide the number of jobs to submit.
     > qstat -q hep_
   6. Build new runs
     > spbuild --user babarpro 5161488-5162267
   7. Submit the runs (jobs)
     > spsub -y -t simu 5161488-5162267
   8. Check runs (jobs) by spcheck and qstat
   9. Do merge and keep logs in splog directory
     > spmerge --user babarpro -v --debug | tee splog/merge050123-1.log
   10. Check submitted merge_jobs by qstat
   11. Do export and keep logs in splog directory
     > spexport --user babarpro -v --debug -n 10 | tee splog/export050123-1.log

B. Login and set up
   1. Login
     Login mercury using babarpro account.
       > ssh -l babarpro mercury.uvic.ca
       > [password]
   2. Go to the release directory.
       > cd relese/sp1443a_event
     a. Release directory changes sometimes. A release is a version of Babar software.
     b. Currentry we are producing SP6 MC, and release is 14.4.3a
     c. We are going to move to SP7 soon, and release will be something like 16.0.x.
   3. Set srtpath and boot file for babar jobs
       > srtpath
       > [return] [return]
       > source setboot_con

C. Process of Babar MC production jobs
   1. Build run directories.
       > spbuild --user babarpro 5161488-5162267
     a. Run ranges are given by allocations: something like 5160708-5161487.
     b. If you are close to the end of given allocation, you need to request a new allocation.
     c. The number of runs to build is decided by looking at the number of jobs(=runs) on the queue.
     d. Usually, mercury can finish 300-700 runs per day depending on the type of MC.
     e. The type of MC is decided by the people at SLAC, so if you run spbuild,
       it communicates with a database at SLAC and get correct setting files for each run from SLAC.
     f. Run directories are built in $BFROOT/prod/log/allruns directories
   2. Submitt jobs(=runs)
       > spsub -y -t simu 5161488-5162267
     a. You can check the status of jobs by "spcheck".
     b. If the job finished successfully, it will be marked as "good"
     c. Status of each run are in this file: $BFROOT/prod/log/allruns/[run number]/status.txt
     d. Log file is: $BFROOT/prod/log/allruns/[run number]/A14.4.3aV01x47F/simu5161037.log.gz
     e. If something is wrong, you can check this log file.
     f. The outputs of the SM job are two root files in
       $BFROOT/prod/log/allruns/[run number]/A14.4.3aV01x47F
   3. Merge jobs.
     Next process is to merge output root files into two bigger root files.
       > spmerge --user babarpro -v --debug | tee splog/merge050123-1.log
     a. It is a good practice to keep logs in splog directory
     b. spmerge collect tens of runs into a merge collection like SP_003397_001969,
       submitts a few merge_jobs (one merge_job for one mere collection) to express queue,
       and change "good" jobs to "merging"
     c. these merge_jobs merge small root files into two big root files.
     d. Those merged root files and logs of merge_jobs are sotoreed in
       $BFROOT/prod/log/allruns/merge directory.
     e. You can check these files by
       > ls -alR $BFROOT/prod/log/allruns/merge
     f. If there are finished merge_jobs spmerge check the merge collections and mark them "good" merged collections.
       Then change the status of runs from "merging" to "merged"
     g. After that, spmerge start sparchive
     h. sparchive tars merged run directories into a bit tar file something like:
       $BFROOT/prod/log/allruns/logs_SP6_0737.tar
   4. Export merged collections to SLAC
       > spexport --user babarpro -v --debug -n 10 | tee splog/export050123-1.log
     a. spexport checks merged collections, and if there are good collections, soexort export the collections to SLAC
     b. spexport exports the two big root files of each collection.
     c. To see the progress of export, follow the following steps.
       > ssh -l bbrdist bbr-xfer06.slac.stanford.edu
       > cd /ddist1/import/evs/SP6/uvic2
       > ls
     d. spexport, then, updates the status of transfer to "good" and status of the collection to "exported"
     e. If there are exported collections, spexport removes those collections from the merge directory.

D. Useful commands and scripts
   1. spcheck
     This script shows the status of runs in the $BFROOT/prod/log/allruns directory. The following is the output example:
       5134034 simu : A14.4.3aV01x48F - killed
                              A14.4.3aV02x48F - killed
       5134042 simu : A14.4.3aV01x48F - fail - signal 6 error
                             A14.4.3aV02x48F - good - 2000 events - 3.00 hrs - 5.41 s/ev
       5134060 simu : A14.4.3aV01x48F - fail - signal 6 error
                             A14.4.3aV02x48F - good - 2000 events - 2.97 hrs - 5.35 s/ev
       5134061 simu : A14.4.3aV01x48F - fail - signal 6 error
                             A14.4.3aV02x48F - good - 2000 events - 3.17 hrs - 5.71 s/ev
       5134137 simu : A14.4.3aV01x48F - fail - signal 6 error
                             A14.4.3aV02x48F - good - 2000 events - 3.03 hrs - 5.46 s/ev
       5134152 simu : A14.4.3aV01x48F - fail - signal 6 error
                              A14.4.3aV02x48F - good - 2000 events - 3.05 hrs - 5.49 s/ev
       5134419 simu : A14.4.3aV01x48F - good - 2000 events - 3.07 hrs - 5.53 s/ev
       5134420 simu : A14.4.3aV01x48F - good - 2000 events - 3.13 hrs - 5.64 s/ev
       5134421 simu : A14.4.3aV01x48F - good - 2000 events - 3.49 hrs - 6.29 s/ev
       5134422 simu : A14.4.3aV01x48F - good - 2000 events - 3.04 hrs - 5.47 s/ev
       5166551 simu : A14.4.3aV01x51F - merging - into SP_001235_025337
       5166552 simu : A14.4.3aV01x51F - merging - into SP_001235_025337
       5166553 simu : A14.4.3aV01x51F - merging - into SP_001235_025337
       5166554 simu : A14.4.3aV01x51F - merging - into SP_001235_025337
       5166555 simu : A14.4.3aV01x51F - merging - into SP_001235_025337
       5166556 simu : A14.4.3aV01x51F - merging - into SP_001235_025337
     When you build run directories, those are V01.
     V01 of runs 5134034, 5134042, 5134060, 5134061, 5134137 and 5134152 were either "killed" or "failed",
     so V02 of those runs were submitted. V02 of 5134034 was killed again but others became "good"
     Runs 5134419-5134422 succeeded at the first attempt. Usually this is the case.
     Runs 5166551-5166556 are "merging" into one merge collection SP_001235_025337.
     When you run spmerge again, these "merging" jobs are changed to "merged", tared into a file, and removed form allruns directory.
   2. spbuild
     This script builds run directories in allruns directory. If you do not specify it builds V01 of runs. If you need V02
       > spbuild --user babarpro -V02 5134060
     This builds V02 of the run 5134060
   3. spsub
     This script submitts jobs (= runs). If you do nos specify, it submitts V01 of runs. If you need to submitt V02
       > spsub -y -t simu -V02 5134060
     This sumbmitts V02 of the run 5134060
     For killed or failed runs, the "subfail" script builds V02 runs and submitts them automatically.
   4. qstat
     This command check the status of the queue. The following is an example output:
       355280.mercury    simu5161562    babarpro    02:33:09 R hep_
       355281.mercury    simu5161563    babarpro    02:14:00 R hep_
       355282.mercury    simu5161564    babarpro    02:03:36 R hep_
       355283.mercury    simu5161565    babarpro    01:50:51 R hep_
       355284.mercury    simu5161566    babarpro    01:36:48 R hep_
       355285.mercury    simu5161567    babarpro    01:29:29 R hep_
       355286.mercury    simu5161568    babarpro    01:21:00 R hep_
       355287.mercury    simu5161569    babarpro    01:21:31 R hep_
       355288.mercury    simu5161570    babarpro    00:55:05 R hep_
       355289.mercury    simu5161571    babarpro                0 R hep_
       355290.mercury    simu5161572    babarpro                0 Q hep_
       355291.mercury    simu5161573    babarpro                0 Q hep_
       355292.mercury    simu5161574    babarpro                0 Q hep_
       355293.mercury    simu5161575    babarpro                0 Q hep_
       355294.mercury    simu5161576    babarpro                0 Q hep_
       355295.mercury    simu5161577    babarpro                0 Q hep_
       355296.mercury    simu5161578    babarpro                0 Q hep_
     The numbers in the first column is the job ID numbers and in the second column is the run numbers.
     R means the job is now runing. Q means the job is waiting in the queue.
     hep_ is the name of the queue
     To check the number of jobs running and waiting:
       > qstat -q hep_
   5. qdel
     qdel delete a job from the queue
       > qdel [job ID]
   6. top
       To check how busy the mercuryn is.
       You can also check the processes you are running.
       For example, while exporting you see "bbftp" processes.

E. Trouble shootings
   1. When mercury was down and recovered.
     Most likely, lock server is down.
     You have to start lock server before starting production jobs. To start
        > oolockserver
     To see if it's running
        > ps -aux | grep ools
     The lock server controlles MC production jojbs. So, if it is not running, all the production jobs will fail immediately.
     Merge and export have nothing to do with the lock server.
   2. When a specific node is causing problems. These commnds are useful.
     > psnodes -a
     > psnodes -l
     > qstat -n | grep [node number]
       Node numbers are something like c03b02
   3. When jobs failed and you want to know why.
     Usually log files are good place to look at.
       > less $BFROOT/prod/log/allruns/[run number]/A14.4.3aV01x48F/simu[run number].log.gz
     For V02 runs
       > less $BFROOT/prod/log/allruns/[run number]/A14.4.3aV02x48F/simu[run number].log.gz
   4. When the status of a run is "done" and does not become "good"
     This sometimes happen because of some unknown problems of mercury.
     Use this perl script to fix this problem
       > parse_spcheck_done.pl [run number]
   5. When the time is wrong and spcheck stops.
     This sometimes happen because of some unknown problems of mercury.
     You have to modify the status file of the run.
       > emacs $BFROOT/prod/log/allruns/[run number]/status.txt
   6. When you have to delete many jobs on the queue.
     First make the list of job numbers. Supposing if you want to delete all the jobs waiting in the queue,
       > qstat | grep babarpro | grep 'Q ' | sed 's/.mercury//g' | awk '{print $1}' > list.txt
     Then delete those jobs
       > foreach n (`cat list.txt`)
       > qdel $n
       > end

F. Useful websites, etc.
   1. Hyper nwes "Simulation Production"
     To subscribe this HN, go to this site
   2. MC Production - Home Page
   3. SP6 - KiloEvents per week
   4. Simulation Production and Site Stats
   5. Contact persons:
     Ashock Agarwal: numod@uvvm.uvic.ca -- sys admin of Babar related system
     Kenji Hamano: khamano@uvic.ca
     Drew Leske: dleske@uvic.ca -- sys admin of mercury
     Dirk Hufnagel: hufnagel@mps.ohio-state.edu -- manager of SLAC simulation production