How To - MC production job on mercury
Kenji Hamano
Last modified : Dec21, 2004
Back to How To
Back to Home
A. Daily routine.
1. Login and set up
> ssh -l babarpro mercury.uvic.ca
> [password]
> cd relese/sp1443a_event
> srtpath
> [return] [return]
> source setboot_con
2. Do spcheck and qstat to check the status of jobs.
It is convenient to use two windows: one for spcheck and the other for qstat.
> spcheck
> qstat
3. Clean up killed jobs from the queue
> qdel 354450
4. Submit built runs and re-submit failed or killed runs
> subbuilt.pl 5160709-5160725
> subfail.pl 5160709-5161487
5. Check the number of jobs on the hep_ queue to decide the number of jobs to submit.
> qstat -q hep_
6. Build new runs
> spbuild --user babarpro 5161488-5162267
7. Submit the runs (jobs)
> spsub -y -t simu 5161488-5162267
8. Check runs (jobs) by spcheck and qstat
9. Do merge and keep logs in splog directory
> spmerge --user babarpro -v --debug | tee splog/merge050123-1.log
10. Check submitted merge_jobs by qstat
11. Do export and keep logs in splog directory
> spexport --user babarpro -v --debug -n 10 | tee splog/export050123-1.log
B. Login and set up
1. Login
Login mercury using babarpro account.
> ssh -l babarpro mercury.uvic.ca
> [password]
2. Go to the release directory.
> cd relese/sp1443a_event
a. Release directory changes sometimes. A release is a version of Babar software.
b. Currentry we are producing SP6 MC, and release is 14.4.3a
c. We are going to move to SP7 soon, and release will be something like 16.0.x.
3. Set srtpath and boot file for babar jobs
> srtpath
> [return] [return]
> source setboot_con
C. Process of Babar MC production jobs
1. Build run directories.
> spbuild --user babarpro 5161488-5162267
a. Run ranges are given by allocations: something like 5160708-5161487.
b. If you are close to the end of given allocation, you need to request a new allocation.
c. The number of runs to build is decided by looking at the number of jobs(=runs) on the queue.
d. Usually, mercury can finish 300-700 runs per day depending on the type of MC.
e. The type of MC is decided by the people at SLAC, so if you run spbuild,
it communicates with a database at SLAC and get correct setting files for each run from SLAC.
f. Run directories are built in $BFROOT/prod/log/allruns directories
2. Submitt jobs(=runs)
> spsub -y -t simu 5161488-5162267
a. You can check the status of jobs by "spcheck".
b. If the job finished successfully, it will be marked as "good"
c. Status of each run are in this file: $BFROOT/prod/log/allruns/[run number]/status.txt
d. Log file is: $BFROOT/prod/log/allruns/[run number]/A14.4.3aV01x47F/simu5161037.log.gz
e. If something is wrong, you can check this log file.
f. The outputs of the SM job are two root files in
$BFROOT/prod/log/allruns/[run number]/A14.4.3aV01x47F
3. Merge jobs.
Next process is to merge output root files into two bigger root files.
> spmerge --user babarpro -v --debug | tee splog/merge050123-1.log
a. It is a good practice to keep logs in splog directory
b. spmerge collect tens of runs into a merge collection like SP_003397_001969,
submitts a few merge_jobs (one merge_job for one mere collection) to express queue,
and change "good" jobs to "merging"
c. these merge_jobs merge small root files into two big root files.
d. Those merged root files and logs of merge_jobs are sotoreed in
$BFROOT/prod/log/allruns/merge directory.
e. You can check these files by
> ls -alR $BFROOT/prod/log/allruns/merge
f. If there are finished merge_jobs spmerge check the merge collections and mark them "good" merged collections.
Then change the status of runs from "merging" to "merged"
g. After that, spmerge start sparchive
h. sparchive tars merged run directories into a bit tar file something like:
$BFROOT/prod/log/allruns/logs_SP6_0737.tar
4. Export merged collections to SLAC
> spexport --user babarpro -v --debug -n 10 | tee splog/export050123-1.log
a. spexport checks merged collections, and if there are good collections, soexort export the collections to SLAC
b. spexport exports the two big root files of each collection.
c. To see the progress of export, follow the following steps.
> ssh -l bbrdist bbr-xfer06.slac.stanford.edu
> cd /ddist1/import/evs/SP6/uvic2
> ls
d. spexport, then, updates the status of transfer to "good" and status of the collection to "exported"
e. If there are exported collections, spexport removes those collections from the merge directory.
D. Useful commands and scripts
1. spcheck
This script shows the status of runs in the $BFROOT/prod/log/allruns directory. The following is the output example:
5134034 simu : A14.4.3aV01x48F - killed
A14.4.3aV02x48F - killed
5134042 simu : A14.4.3aV01x48F - fail - signal 6 error
A14.4.3aV02x48F - good - 2000 events - 3.00 hrs - 5.41 s/ev
5134060 simu : A14.4.3aV01x48F - fail - signal 6 error
A14.4.3aV02x48F - good - 2000 events - 2.97 hrs - 5.35 s/ev
5134061 simu : A14.4.3aV01x48F - fail - signal 6 error
A14.4.3aV02x48F - good - 2000 events - 3.17 hrs - 5.71 s/ev
5134137 simu : A14.4.3aV01x48F - fail - signal 6 error
A14.4.3aV02x48F - good - 2000 events - 3.03 hrs - 5.46 s/ev
5134152 simu : A14.4.3aV01x48F - fail - signal 6 error
A14.4.3aV02x48F - good - 2000 events - 3.05 hrs - 5.49 s/ev
5134419 simu : A14.4.3aV01x48F - good - 2000 events - 3.07 hrs - 5.53 s/ev
5134420 simu : A14.4.3aV01x48F - good - 2000 events - 3.13 hrs - 5.64 s/ev
5134421 simu : A14.4.3aV01x48F - good - 2000 events - 3.49 hrs - 6.29 s/ev
5134422 simu : A14.4.3aV01x48F - good - 2000 events - 3.04 hrs - 5.47 s/ev
5166551 simu : A14.4.3aV01x51F - merging - into SP_001235_025337
5166552 simu : A14.4.3aV01x51F - merging - into SP_001235_025337
5166553 simu : A14.4.3aV01x51F - merging - into SP_001235_025337
5166554 simu : A14.4.3aV01x51F - merging - into SP_001235_025337
5166555 simu : A14.4.3aV01x51F - merging - into SP_001235_025337
5166556 simu : A14.4.3aV01x51F - merging - into SP_001235_025337
When you build run directories, those are V01.
V01 of runs 5134034, 5134042, 5134060, 5134061, 5134137 and 5134152 were either "killed" or "failed",
so V02 of those runs were submitted. V02 of 5134034 was killed again but others became "good"
Runs 5134419-5134422 succeeded at the first attempt. Usually this is the case.
Runs 5166551-5166556 are "merging" into one merge collection SP_001235_025337.
When you run spmerge again, these "merging" jobs are changed to "merged", tared into a file, and removed form allruns directory.
2. spbuild
This script builds run directories in allruns directory. If you do not specify it builds V01 of runs. If you need V02
> spbuild --user babarpro -V02 5134060
This builds V02 of the run 5134060
3. spsub
This script submitts jobs (= runs). If you do nos specify, it submitts V01 of runs. If you need to submitt V02
> spsub -y -t simu -V02 5134060
This sumbmitts V02 of the run 5134060
For killed or failed runs, the "subfail" script builds V02 runs and submitts them automatically.
4. qstat
This command check the status of the queue. The following is an example output:
355280.mercury simu5161562 babarpro 02:33:09 R hep_
355281.mercury simu5161563 babarpro 02:14:00 R hep_
355282.mercury simu5161564 babarpro 02:03:36 R hep_
355283.mercury simu5161565 babarpro 01:50:51 R hep_
355284.mercury simu5161566 babarpro 01:36:48 R hep_
355285.mercury simu5161567 babarpro 01:29:29 R hep_
355286.mercury simu5161568 babarpro 01:21:00 R hep_
355287.mercury simu5161569 babarpro 01:21:31 R hep_
355288.mercury simu5161570 babarpro 00:55:05 R hep_
355289.mercury simu5161571 babarpro 0 R hep_
355290.mercury simu5161572 babarpro 0 Q hep_
355291.mercury simu5161573 babarpro 0 Q hep_
355292.mercury simu5161574 babarpro 0 Q hep_
355293.mercury simu5161575 babarpro 0 Q hep_
355294.mercury simu5161576 babarpro 0 Q hep_
355295.mercury simu5161577 babarpro 0 Q hep_
355296.mercury simu5161578 babarpro 0 Q hep_
The numbers in the first column is the job ID numbers and in the second column is the run numbers.
R means the job is now runing. Q means the job is waiting in the queue.
hep_ is the name of the queue
To check the number of jobs running and waiting:
> qstat -q hep_
5. qdel
qdel delete a job from the queue
> qdel [job ID]
6. top
To check how busy the mercuryn is.
You can also check the processes you are running.
For example, while exporting you see "bbftp" processes.
E. Trouble shootings
1. When mercury was down and recovered.
Most likely, lock server is down.
You have to start lock server before starting production jobs. To start
> oolockserver
To see if it's running
> ps -aux | grep ools
The lock server controlles MC production jojbs. So, if it is not running, all the production jobs will fail immediately.
Merge and export have nothing to do with the lock server.
2. When a specific node is causing problems. These commnds are useful.
> psnodes -a
> psnodes -l
> qstat -n | grep [node number]
Node numbers are something like c03b02
3. When jobs failed and you want to know why.
Usually log files are good place to look at.
> less $BFROOT/prod/log/allruns/[run number]/A14.4.3aV01x48F/simu[run number].log.gz
For V02 runs
> less $BFROOT/prod/log/allruns/[run number]/A14.4.3aV02x48F/simu[run number].log.gz
4. When the status of a run is "done" and does not become "good"
This sometimes happen because of some unknown problems of mercury.
Use this perl script to fix this problem
> parse_spcheck_done.pl [run number]
5. When the time is wrong and spcheck stops.
This sometimes happen because of some unknown problems of mercury.
You have to modify the status file of the run.
> emacs $BFROOT/prod/log/allruns/[run number]/status.txt
6. When you have to delete many jobs on the queue.
First make the list of job numbers. Supposing if you want to delete all the jobs waiting in the queue,
> qstat | grep babarpro | grep 'Q ' | sed 's/.mercury//g' | awk '{print $1}' > list.txt
Then delete those jobs
> foreach n (`cat list.txt`)
> qdel $n
> end
F. Useful websites, etc.
1. Hyper nwes "Simulation Production"
To subscribe this HN, go to this site
2. MC Production - Home Page
3. SP6 - KiloEvents per week
4. Simulation Production and Site Stats
5. Contact persons:
Ashock Agarwal: numod@uvvm.uvic.ca -- sys admin of Babar related system
Kenji Hamano: khamano@uvic.ca
Drew Leske: dleske@uvic.ca -- sys admin of mercury
Dirk Hufnagel: hufnagel@mps.ohio-state.edu -- manager of SLAC simulation production