wiki:DC3PipeHowto
Last modified 10 years ago Last modified on 07/07/2009 03:30:58 PM

How to Run the Complete Set of DC3a Pipelines

The ctrl_dc3pipe package is provided for running the DC3a pipelines with one script: launchDC3a.py.

Prerequisites

You need to load one of the software stacks to run the DC3a pipelines on the LSST cluster at NCSA. The current recommended stack is found under /lsst/DC3/stacks/default which is a link pointing to a specific version. The /lsst/DC3/stacks directory contains other stacks you may use, primarily under the gcc412 and gcc433 subdirectories which correspond to the version of the gcc compiler used to build it. Note that we try to keep version of the stack for each compiler up-to-date and in sync with each other. For example, if /lsst/DC3/stacks/default points to /lsst/DC3/stacks/gcc412/24apr, you should find that the stack /lsst/DC3/stacks/gcc433/24apr contains the same set of packages.

Before you load a stack, make sure that your environment variables EUPS_PATH and LSST_PKGS are not set. LSST_DEVEL may be set, but only if you intentionally want to use private installations of certain packages; however, for default operation, LSST_DEVEL should not be set either. Set your environment to use this stack in the usual way:

    source /lsst/DC3/stacks/default/loadLSST.sh   # or source loadLSST.csh

The scripts needed, along with the environment for all the dependent packages, are loaded when you set-up the ctrl_dc3pipe package:

    setup ctrl_dc3pipe

This will setup the default versions that are known to work. If you need to run with different versions, then you usually have to set them up explicitly after setting up ctrl_dc3pipe. You will also need to use the -e option to launchDC3a.py script; see below for details. You can review what's been setup with:

    eups list --setup

Before you can run launchDC3a.py for the first time, there are a few things you need to set up.

SSH

SSH must be set up to allow password-less logins using ssh-agent. The typical way is to run the ssh-agent on your local machine, load your private key into it using ssh-add, and then forward the agent connection wherever it is needed. You may need to specify the "-A" option or the "ForwardAgent yes" configuration option (in your ~/.ssh/config file) to perform this forwarding. You can tell if agent forwarding succeeded by looking for the SSH_AUTH_SOCK environment variable on the remote host. Agent forwarding is not required on the cluster machines unless you want/need to ssh twice within the cluster (e.g. from lsst7 to lsst10 and then from lsst10 to lsst9). To handle this possibility, since agent forwarding is not enabled by default on the LSST cluster, you may want to create a ~/.ssh/config file (read-write by owner only) that contains this:

Host *.ncsa.uiuc.edu
     ForwardAgent yes
Host lsst*
     ForwardAgent yes

You must have effectively logged into every host in the cluster from at least one other host in the cluster in order to initialize your ~/.ssh/known_hosts file. It is best for you to have logged into both the lsst{5,6,7,8,9,10} short names and the lsst{5,6,7,8,9,10}.ncsa.uiuc.edu fully-qualified names, although typically only the latter are used in platform configuration files (via a default domain name).

.mpd.conf

A configuration file for MPI, .mpd.conf, must be installed into your home directory on every machine that you use in a pipeline. This file must contain a "secret word" for MPI to operate.

The NCSA LSST cluster now has shared home directories. Thus, to set this up, simply type:

    cp $CTRL_DC3PIPE_DIR/etc/mpd.conf $HOME/.mpd.conf

The file must have permissions set to 400.

Remote Environment

MPI uses ssh to launch processes used by the pipeline on all the pipeline worker nodes. These processes are Python scripts; in particular, the require Python 2.5. Unfortunately, the default version of Python installed by the operating system, which is usually what is available when one does a remote command using ssh,is version 2.4. You must edit your shell setup file (e.g. .cshrc, .profile, or .bashrc) so that Python 2.5 is in your path. The most straight-forward way to do this is to load the LSST software stack environment within that file; that is, add the following line:

   source /lsst/DC3/stacks/default/loadLSST.sh   # for BASH users

This will put Python 2.5 into your path.

DB authentication

When you run the DC3a pipelines the catalog results will be stored in the MySQL database running on lsst10. To do this, you will need to have MySQL account setup for you in the database. If this has not been done, yet, you could contact DavidGehrig (backup: JacekBecla or RayPlante). One of them can create an account with a default password which you can then change. To change your password, connect to the MySQL server using "mysql -u {user} -h lsst10.ncsa.uiuc.edu -p" and then execute this command: "SET PASSWORD = PASSWORD('newPassword');"

The pipelines will need to know your username and password in order to connect to the database. This is done by creating a file in your home directory called .lsst/db-auth.paf. Because this contains your password, you will need to set its file permissions that prevent other users from reading it. Here's a recipe for setting up this file:

mkdir $HOME/.lsst
cp $CTRL_DC3PIPE_DIR/etc/db-auth.paf $HOME/.lsst
chmod -R go-rwx $HOME/.lsst

Next, you should edit $HOME/.lsst/db-auth.paf, replacing the username and password values.

Launching with launchDC3a

To run the default configuration of the DC3a Pipelines, you can simply run the launchDC3a.py script:

    launchDC3a.py -C D3 $CTRL_DC3PIPE_DIR/pipeline/dc3pipe.paf myrunid 

This will launch the pipelines, pause until the pipelines appear ready, and then send data events (from the D3 collection) to feed data into the pipelines. When the script runs out of data to send events for, the script will exit.

If the pipelines are launched successfully, you will find all files related to the run saved under /lsst/DC3root/myrunid. In particular, /lsst/DC3root/myrunid/pipeline_name/work contains a copy of all of the policy files used to configure that pipeline, along with logging output from the master node. In particular, Pipeline.log contains messages logged by the master Pipeline process, and each SliceN.log file captures the log messages from a Slice worker process. Another log file, pipeline_name-runid.log, also contains log messages from the Pipeline process (same as Pipeline.log); however, interspersed with it are any messages written to standard out or standard error by any of the pipeline processes (Pipeline and Slices).

To watch the progress as it happens, you can monitor the contents of these files (with, say, tail -f). There are two other ways to watch the messages:

Often, you want to tweak how the pipelines run. There are different ways to do this.

Controlling the Input Data

The easiest way to control the input data is with the -C (or --collections) option. The argument is a comma-separated list of dataset collections to process. The supported names are:

Name Description Notes
D1 CFHT Legacy Survey D1 field Not yet supported for astrometry_net_data cfhttemplate
D2 CFHT Legacy Survey D2 field
D3 CFHT Legacy Survey D3 field
D4 CFHT Legacy Survey D4 field Not yet supported for astrometry_net_data cfhttemplate
Sim Simulated LSST Data Not yet supported

You can also specify specific images to process by providing one or more visit list files (as supported by $CTRL_DC3PIPE_DIR/bin/eventFromFitsfileList.py script) on the command line after the runid. A visit file is simple a list of directories, one per line. Each directory is a "visit" directory containing the pair of images that constitute a visit. (In particular, the directory will contain 2 subdirectories called 0 and 1.) An example of such a file can be found in the $CTRL_DC3PIPE_DIR/etc directory.

Note that if you provide both visit list files and collections with -C, the visits in the files willl be processed first. If you specify neither, then no data events will be sent to the pipelines; they will simple sit idle.

You can control the total number of visits sent to the pipeline with the -m (--max-visits) parameter. For example, to process 10 visits from the D3 collection, type:

    launchDC3a.py -C D3 -m 10 $CTRL_DC3PIPE_DIR/pipeline/dc3pipe.paf myrunid 

Altering the Policy Parameters

Often, you will want to change the configuration before you launch it. The typical thing to do is to copy the parts of the policy repository (which is actually most of it) that you will use to a local directory, edit the files, and then launch.

The default policy repository is $CTRL_DC3PIPE_DIR/pipeline. For each defined pipeline, there is a top level policy file named after the pipeline and (optionally) a subdirectory having the same name (without an extension) containing the stage policy files for the pipeline. DC3a is made up of 3 pipelines: IPSD (image processing and source detection), ap (association), and nightmops. Thus to configure all 3:

   cp $CTRL_DC3PIPE_DIR/pipeline/{IPSD,ap,nightmops}.paf .
   cp -r $CTRL_DC3PIPE_DIR/pipeline/{IPSD,ap,nightmops,datatypePolicy} .
   cp $CTRL_DC3PIPE_DIR/pipeline/lsstcluster_*.paf $CTRL_DC3PIPE_DIR/pipeline/dc3MetadataPolicy.paf .
   cp $CTRL_DC3PIPE_DIR/pipeline/dc3pipe.paf .

or for quick typing:

   cp -r $CTRL_DC3PIPE_DIR/pipeline/* .

First edit dc3pipe.paf; this specifies the pipelines that make up your DC3a production run. The most common thing to do in this file is to turn off different pipelines. A pipeline can be turned off by setting the pipelines.pipeline_name.launch parameter to "false".

Another thing to do is change the machines as well as the number of processes on each machine that a pipeline will run on. This done separately for each pipeline, usually in a separate platform file (lsstcluster_*.paf). The machines are listed within this file as the deploy.nodes parameters. Each machine is specified by its hostname, a colon, followed by the number of processes to run on that host, e.g:

deploy:  {

   # the class that should be used to deploy, launch, and monitor 
   # a pipeline on this platform
   #
   managerClass:  lsst.ctrl.orca.pipelines.SimplePipelineManager

   # the default domain to assume for nodes listed below if a domain is 
   # not specified in the name
   #   
   defaultDomain:  ncsa.uiuc.edu

   # the node names and the number of cores available to them
   # 
   nodes:  "lsst5:8"
   nodes:  "lsst6:8"
   nodes:  "lsst7:8"
   nodes:  "lsst8:8"
}

Keep in mind the following:

  • the first host listed will run the master or "Pipeline" process plus (if the number is > 1) some number of worker or "Slice" processes. The other hosts will run Slice processes.
  • to run N slices you need N+1 processes. The sum of the numbers appearing after the hosts is the total number of processes available.
  • each pipeline must run on a completely independent sets of machines. A single machine cannot host processes from more than one pipeline.

You can then edit any of the other policy files as needed. To launch, you need to tell launchDC3a.py to look in the current directory for "sub-policy" files that are referenced in the top level policy file (dc3pipe.paf) using the -r (--policyRepository) option:

    launchDC3a.py -r . -C D3 dc3pipe.paf myrunid 

In this example, the policy repository directory is the current directory; all files referenced in dc3pipe.paf will be looked for there.

Note: the directory given to the -r option does not affect where the top-level policy file (dc3pipe.paf) is found.

Tweaking the Software Stack

If simply running setup ctrl_dc3pipe as you did above does not setup the proper versions of all packages you need--that is, you had to explicitly run setup for those packages afterward, then you will need one other file for local editing: either setup.sh or setup.csh:

   cp $CTRL_DC3PIPE_DIR/etc/setup.csh .   # or setup.sh for bash users

This script is used to setup the proper DC3a environment on the master nodes from which each pipeline is launched. Add the necessary setup calls to get the correct package versions. Then, below, you will need to use the -e option to launchDC3a.py.

   launchDC3a.py -r . -e setup.csh -m 5 -C D2 dc3pipe.paf myrunid 

Editor's Note: Due to a bug in ctrl_orca, it is critical that the pipelines be launched with the same environment as will be used to execute the pipelines; otherwise, the recorded provenance will not be correct. Until this is corrected, it is a good idea if you use a setup script that you interactively source it before you run launchDC3a.py.

Stopping/Killing? the Pipelines

Even if the pipeline fails with errors, there will likely remain some processes still running that need to be stopped when processing is done. To stop the pipelines, run killPipeline.py:

   killPipeline.py -p dc3pipe.paf -i myrunid

This will extract the head nodes from dc3pipe.paf and run killpipe.sh runid on each one.

If you launched the pipeline with your own policy repository (e.g. in the current directory), type:

   killPipeline.py -r . -p dc3pipe.paf -i myrunid

Other Useful Tools

There are some other tools you may find useful for running or debugging production runs:

  • watchLogs.py: "Tailing" a log file in the work directory while the pipeline is running works pretty well to get a sense of where the pipeline is at any given moment. However, buffering issues prevent this file from being updated in anything close to real time. If you need a more real-time view, this tool is helpful. It will print selected Log messages sent from the event broker.
  • testEventLogger.py: This tool, when used in concert with watchLogs.py, lets you test if the event broker is working by sending test log messages.
  • showEvents.py: This tool allows you to spy on any kind of event sent through the event broker.

All of these tools support the --help and -h options which print explanations of how to call these programs.

Reviewing the Results

When a production run is launched, a directory named after the Run ID is created under /lsst/DC3root (e.g. /lsst/DC3root/rlp1176). Under that run directory, you will find a set of directories representing each of the component pipelines (IPSD, ap, nightmops). Within each pipeline directory is a set of subdirectories used by the pipeline for some kind of filesystem I/O; they are (and what they contain):

  • input: read-only input data
  • output: output data
  • scr: scratch data
  • update: read-write data
  • work: the working directory from which the pipeline was launched

Also during the launch process a database is created in the database server on lsst10 called user_DC3a_u_runid where user is the user that launched the production (as set in the user's $HOME/.lsst/db-auth.paf file) and runid is the Run ID. This database contains the output catalog data.

Reviewing the Logs

Copies of the log files produced from the run can be found in a pipeline's work directory (/lsst/DC3root/myrunid/pipeline_name/work). The Pipeline.log contains the messages from the master Pipeline process, which the files of the form, SliceN.log contain the messages from each Slice process. Another log file, pipeline_name-runid.log, also contains (briefer versions of) the log messages from the Pipeline process (same as Pipeline.log); however, interspersed with it are any messages written to standard out or standard error by any of the pipeline processes (Pipeline and Slices) and which were not captured by the standard logging system.

Messages that are recorded via the standard logging system get sent to the logging database. (This is done via the event system and a separate database loading process that runs all the time.) The logging database is located in the database server on lsst10 with the name "logs". Sometimes, due to buffering issues, not all of the final messages get flushed to the files in the work directory before a pipeline is killed. A more definitive place to look for the messages would then be in the "logger" table of the "logs" database. [More info here]

Browsing the Images

DS9 is available on the cluster for viewing images. To start:

   source /lsst/DC3/stacks/default/loadLSST.sh    # or .csh for tcsh, to load the stack
   setup ds9
   ds9

You can find the output images in /lsst/DC3root/myrunid/pipeline_name/output.

Browsing the Catalog

With knowledge of SQL, you can peruse the catalog by connecting to the database on lsst10. For example, for a run done by user rplante with a run ID of "rlp1176", you can (from any lsst cluster machine):

   source /lsst/DC3/stacks/default/loadLSST.sh    # or .csh for tcsh, to load the stack
   setup mysqlclient
   mysql -h lsst10 -u your_username -p rplante_DC3a_u_rlp1176

You need a database login account to connect; contact JacekBecla or RayPlante if you do not have one.