wiki:RunningOrcaJobsOnAbe
Last modified 9 years ago Last modified on 06/28/2010 09:42:59 AM

Running Orca Jobs on Abe

The following page assumes you are familiar with the basic production policy layout as described on page Running Orca for DC3b. Please check that documentation for further information about the specifics of production policy files.

Orca uses the Condor software system to request nodes from Abe and to run pipelines on those nodes.

This needs to be all run from LSST6.

Setup orca

$ source /lsst/DC3/stacks/default/loadLSST.sh
$ setup ctrl_orca

Running "setup orca" will also setup datarel and globus, which you will use below.

Running Orca

This must be run from lsst6, and must have your !X.509 security credentials installed and up to date. If you have not already done this, follow the instructions on creating an X.509 certificate to set them up.

First, run grid-proxy-init. This needs to be done every twelve hours. (Note: Values other than the default 12 hours can be specified with the -valid <h:m> option)

$ grid-proxy-init
Your identity: /C=US/O=National Center for Supercomputing Applications/OU=People/CN=Joe User
Enter GRID pass phrase for this identity: <enter pass phrase here>
Creating proxy .......................................... Done
Your proxy is valid until: Tue Jun 22 20:05:05 2010
$

You must also have your .lsst directory created and chmod-ed to 700, and have your db-auth.paf file in that directory, chmod-ed to 600.

Next, make a copy of the $DATAREL_DIR/pipeline directory in your own workspace, and edit the files as needed. Details of how the production files are structured are listed here and for Condor specific details, when running on Abe.

Next run orca. The command line options are the same as if you run on the LSST cluster.

$ orca.py -r $PWD -e ~/srp_stack.sh -V 10 -P 10 production.paf myrunid

At this point, you'll get slightly different output than you do on an LSST cluster orca run. The VanillaCondorWorkflowConfigurator plugin and its associated classes will print status messages once Condor jobs start to be placed. That way you can keep track of where things are in the process of submitting jobs.

You can use the condor utilities to view progress of jobs outside of orca while an orca run is happening. The two main utilities to do this are condor_status and condor_q

The condor_status utility will show you the machines that the condor_glidein obtained through its request. Initially it will show that all the machines are idle. When orca submits pipelines to be run, those machines will be changed to a "claimed" status.

The condor_q utility will show the status of the condor_submit requests that orca makes.

Shutting down

To stop the pipelines, use the shutprod.py command (in ctrl_orca) with a severity level and the runid to shut down.

shutprod.py 1 <runid>

This sends an event to orca, which coordinates sending messages to each workflow to shut things down.

You'll see messages similar to the following:

orca.manager DEBUG: DONE!
orca.manager: Shutting down production (urgency=1)
          orca.manager.config.workflow DEBUG: WorkflowManager:stopWorkflow
          orca.manager.config.workflow.monitor DEBUG: GenericPipelineWorkflowMonitor:stopWorkflow
          orca.manager.config.workflow.monitor DEBUG: GenericPipelineWorkflowMonitor:handleEventCalled
          orca.manager DEBUG: Everything shutdown - All finished

This should kill all the pipelines, as it does on the LSST cluster. However, this does not shut down the condor_glidein request. If you still have the nodes available, you can re-run the orca.py command with the "-g" option to request a new run, without having to wait for a new glidein to occur:

$ orca.py -g -r $PWD -e ~/srp_stack.sh -V 10 -P 10 production.paf mynewrunid

In order to relinquish the glidein request, you can do one of two things. You can use the the killcondor.py command, or you can use the condor_rm command on that particular condor job number.

The killcondor.py command to kill the condor_glidein request takes the following arguments:

killcondor.py -g production.paf runid

The "-g" option refers to the glidein. Use the production.paf and the runid you did for the orca.py command you issued earlier. The killcondor.py command will look in the workspace it created to run the production and will look for ".job" files it created. In this case it looks for the glidein.job file. It uses this to issue a condor_rm command for that job number.

To use condor_rm directly, you need to discover the number of the glidein job. Issue a condor_q command:

$ condor_q


-- Submitter: lsst6.ncsa.uiuc.edu : <141.142.15.103:40389> : lsst6.ncsa.uiuc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
 455.0   user             6/22 13:25   0+00:02:04 R  0   0.0  glidein_startup -d

1 jobs; 0 idle, 1 running, 0 held

The ID for this condor job is 455.0. Run the condor_rm command:

$ condor_rm 455.0
Job 455.0 marked for removal
$ 

And this will clear out the job.

$ condor_q


-- Submitter: lsst6.ncsa.uiuc.edu : <141.142.15.103:40389> : lsst6.ncsa.uiuc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               

0 jobs; 0 idle, 0 running, 0 held
$

In the event you can not issue a shutprod.py to orca to shut down the pipelines, you can use the killcondor.py command to do this.

$ killcondor.py production.paf testrunid
killJob:  /home/user/orca_scratch/testrunid/associationWorkflow/joboffices_1/work/joboffices_1/joboffices_1.job
Job 464.0 marked for removal
killJob:  /home/user/orca_scratch/testrunid/associationWorkflow/isr_1/work/isr_1/isr_1.job
Job 465.0 marked for removal
killJob:  /home/user/orca_scratch/testrunid/associationWorkflow/isr_2/work/isr_2/isr_2.job
Job 466.0 marked for removal
killJob:  /home/user/orca_scratch/testrunid/associationWorkflow/isr_3/work/isr_3/isr_3.job
Job 467.0 marked for removal
killJob:  /home/user/orca_scratch/testrunid/associationWorkflow/isr_4/work/isr_4/isr_4.job
Job 468.0 marked for removal
killJob:  /home/user/orca_scratch/testrunid/associationWorkflow/ccdassembly_1/work/ccdassembly_1/ccdassembly_1.job
Job 469.0 marked for removal
killJob:  /home/user/orca_scratch/testrunid/associationWorkflow/ccdassembly_2/work/ccdassembly_2/ccdassembly_2.job
Job 470.0 marked for removal
$

This will look at all the condor jobs submitted for production.paf under runid, and remove all of them. This will not kill the condor_glidein request, and will leave those nodes allocated. Use the "-g" option, outlined above, to do that.

$ killcondor.py -g production.paf testrunid
killJob:  /home/user/orca_scratch/testrunid/associationWorkflow/glidein.job
Job 463.0 marked for removal
$

You should ALWAYS kill the pipelines before killing the glidein job. If you don't kill the jobs in this order, condor can get confused about which machines are actually allocated as part of the glidein, making it appear that phantom machines are allocated. If you run condor_q and see that nothing is running, but then do condor_status and see that machines appear to still be allocated, you can clear this condition by turning condor off:

$ condor off
Sent "Kill-All-Daemons" command to local master
$

wait a moment or two, and then turn condor back on:

$ condor on
Sent "Spawn-All-Daemons" command to local master

If you have a production file with multiple workflows, you can specify the workflow name on the command line:

$ killcondor.py -w associationWorkflow production.paf testrunid

If you want to kill a set of pipelines for a particular workflow, you also specify the name of those:

$ killcondor.py -w associationWorkflow -p isr production.paf testrunid

If you want to kill one particular instance of a pipeline, you also specify the number of that pipeline:

$ killcondor.py -w associationWorkflow -p isr -n 2 production.paf testrunid

Errors

If an error occurs in starting things on Condor, and a condor job goes away unexpectedly, you Condor will send files back to your local scratch directory. (The "local scratch" is specified in the production file). For example, if a setup script doesn't set up ctrl_provenance, the script to record provenance for the job office will fail and disappear from the condor_q. If you look in the local scratch directory under your runid, you can search for the files Condor.err, Condor.out and Condor.log, which can contain additional information about what happened.

For example, if local scratch is set to /home/user/orca_scratch/, and runid is testrunid, you can run:

$ find . -print | grep Condor.err
./joboffices_1/work/joboffices_1/Condor.err
./isr_1/work/isr_1/Condor.err
./isr_2/work/isr_2/Condor.err
./isr_3/work/isr_3/Condor.err
./isr_4/work/isr_4/Condor.err
-bash-3.2$ 

These files are created when condor launches these pipelines, and have null contents when they are launched. When the scripts that launch these pipelines complete, Condor transfers data back to these files in these directories. In this particular case, we're interested in joboffices_1/work/joboffices_1/Condor.err.

$ cat joboffices_1/work/joboffices_1/Condor.err
Not writing cache as your version of python's cPickle is too old
/cfs/projects/lsst/DC3/data/datarel-runs/testrunid/work/joboffices_1/launch_joboffices_1.sh: line 7: PipelineProvenanceRecorder.py: command not found

This indicates that the shell script launch_joboffices_1.sh couldn't find the PipelineProvenanceRecorder.py. This file is located in ctrl_provenance, so it's likely it hasn't been setup in the setup script.