Ticket #832 (closed defect: fixed)

Opened 10 years ago

Last modified 8 years ago

Pipelines terminate prematurely

Reported by: krughoff Owned by: srp
Priority: normal Milestone:
Component: ctrl_orca Keywords:
Cc: daues Blocked By:
Blocking: Project: LSST
Version Number:
How to repeat:

The following should reproduce this behavior.

$> source appropriate LSST setup script
$> setup ctrl_dc3pipe
$> cd ~krughoff/policy-dir/Sim-nfs/
$> launchDC3a.py -L verb3 -r . -e setup.csh dc3pipe.paf [runid]
$>  eventFromFitsfileList.py -b lsst4 -m 10 simdeep.list datatypePolicy/simDataTypePolicy.paf 1 1

Description

When running the SimDeep? data through, I noticed that the pipelines would terminate early if the value given to the maximum visits flag (-m) was greater than 5. In fact it seems to always be the case that two visits are processed followed by the message in the IPSD-runid.log file:

terminate called after throwing an instance of 'activemq::exceptions::ActiveMQException'

So for example:

$> setup ctrl_dc3pipe
$> cd $HOME/policy-dir/Sim-nfs/
$> launchDC3a.py -L verb3 -r . -e setup.csh dc3pipe.paf sk0527131
orca.pipelineMgr: launching IPSD on lsst5.ncsa.uiuc.edu
launchDC3: Waiting for pipelines to setup (this can take a while)...
$>  eventFromFitsfileList.py -b lsst4 -m 10 simdeep.list datatypePolicy/simDataTypePolicy.paf 1 1

This results in the first two visits being processed and then nothing else happens except for the above message in the log.

If -m is set to 5, all 5 visits run, but if another batch of visits is sent with the eventFromFitsfileList.py program, only two get processed and the termination message is issued.

eups list:
activemq              5.2.0             Current
activemqcpp           2.2.6+1           Current Setup
activemqcpp           2.2.6+2   
afw                   3.3.15    
afw                   3.3.16            Current Setup
afwdata               svn7459   
afwdata               svn9256           Current Setup
ap                    3.1.1             Current Setup
apr                   1.3.3             Current Setup
astrometry_net        0.25              Current Setup
astrometry_net_data   usnob     
astrometry_net_data   cfhttemplate      Current Setup
auton                 1.0               Current Setup
base                  3.1               Current Setup
boost                 1.37.0            Current Setup
cat                   3.2       
cat                   3.3               Current Setup
cfitsio               3006.2            Current Setup
ctrl_dc3pipe          3.1.1     
ctrl_dc3pipe          3.3.1     
ctrl_dc3pipe          3.3.2             Current Setup
ctrl_events           3.6               Current Setup
ctrl_orca             3.5               Current Setup
daf_base              3.2.8     
daf_base              3.2.9             Current Setup
daf_data              3.2.3             Current Setup
daf_persistence       3.3.7             Current Setup
doxygen               1.5.7.1           Current Setup
doxygen               1.5.9     
ds9                   5.5               Current
eigen                 2.0.0             Current Setup
eups                  LOCAL:/lsst/DC3/stacks/gcc433/24apr/eups/1.1.1    Setup
expat                 2.0.1             Current
freetype              2.3.8             Current Setup
gcc                   4.3.3             Current Setup
gsl                   1.8               Current Setup
ip_diffim             3.3.8     
ip_diffim             3.3.9             Current Setup
ip_isr                3.3.8     
ip_isr                3.3.9             Current Setup
isrdata               svn8518           Current Setup
java                  1.6.0+12          Current
jdk                   1.6.0+12          Current
jython                2.2.1             Current
libpng                1.2.35            Current Setup
lsst                  1.0       
lsst                  1.0.1             Current Setup
lssteups              1.0               Current Setup
matplotlib            0.98.5.2          Current Setup
meas_algorithms       3.0.7     
meas_algorithms       3.0.8     
meas_algorithms       3.0.9             Current Setup
meas_astrom           3.0.5     
meas_astrom           3.0.6     
meas_astrom           3.0.7     
meas_astrom           3.0.8             Current Setup
meas_pipeline         3.0.5     
meas_pipeline         3.0.6             Current Setup
minuit                1.7.9             Current Setup
mops                  3.2.5             Current Setup
mpich2                1.0.5p4           Current Setup
mysqlclient           5.0.45+1          Current Setup
mysqlpython           1.2.2             Current Setup
numpy                 1.2.1             Current Setup
openssl               0.9.8j            Current
pex_exceptions        3.2.2             Current Setup
pex_harness           3.3.1     
pex_harness           3.3.2             Current Setup
pex_logging           3.3.3     
pex_logging           3.3.4             Current Setup
pex_policy            3.3.5             Current Setup
python                2.5.2             Current Setup
scons                 3.3               Current Setup
sconsDistrib          0.98.5            Current Setup
sconsUtils            3.3               Current Setup
sdqa                  3.0.3             Current Setup
security              3.2.2             Current Setup
ssd                   4                 Current Setup
subversion            1.5.5             Current Setup
swig                  1.3.36+2          Current Setup
tcltk                 8.5a4             Current Setup
utils                 3.4.3             Current Setup
wcslib                4.2+3             Current Setup
xpa                   2.1.7b2           Current Setup

Change History

comment:1 Changed 10 years ago by RayPlante

  • Owner changed from RayPlante to srp
  • Status changed from new to assigned

Steve, can you look into the activemq exception? I'm also looking for a reason why the pipelines running on the CFHT data appear to stop running after 14-16 visits. I have not ruled out that it could be something in the payload that is causing a failure. Thanks!

comment:2 Changed 10 years ago by srp

I re-ran this myself to look at the logging messages and the messages being sent. I didn't see anything the looked out of the ordinary.

There are any number of reasons that the exception could have occurred. I looked at the activemq broker's logs and in /proc and didn't see anything out of the ordinary there either.

I tried looking at the "triggerImageprocEvent0" topic, and that looked OK too. I'm not sure that was the topic that was being waited on, though. I doubt it was.

My suspicion is that there was an attempt to read from a EventReceiver? that was closed, or an EventTransmitter? that was closed.

comment:3 Changed 10 years ago by ktl

  • Cc daues added

The main Pipeline is the one that appears to be dying while waiting on the last event. In between "Starting wait for event..." and "Ending wait for event..." is just one line of Python: inputParamPropertySetPtr = eventReceiver.receive(self.eventTimeout). The Slices are still alive.

These eventReceivers are never closed explicitly by the Pipeline; something else must be causing this.

Greg: It might help other debugging (though probably not this failure) if the Pipeline and Slices would log the contents of the event PropertySet.

comment:4 Changed 10 years ago by srp

This directory:

~krughoff/policy-dir/Sim-nfs/

no longer exists.

I'm going to try it with the directories that are there, in hopes I can recreate the problem with one of those.

comment:5 Changed 10 years ago by srp

I'm getting the error:

[krughoff@lsst5 SimWide?-nfs]$ launchDC3a.py -L verb3 -r . -e setup_srp.csh dc3pipe.paf srp0602001 Traceback (most recent call last):

File "/lsst/DC3/stacks/gcc433/24apr/Linux64/ctrl_dc3pipe/3.3.2/bin/launchDC3a.py", line 6, in <module>

import lsst.pex.harness.run as run

File "/lsst/DC3/stacks/gcc433/24apr/Linux64/base/3.1/python/lsstimport.py", line 53, in load_module

return imp.load_module(fullname, fd, filename, desc)

File "/lsst/DC3/stacks/gcc433/24apr/Linux64/pex_harness/3.3.2/python/lsst/pex/harness/init.py", line 1, in <module>

from harnessLib import Pipeline, Slice

File "/lsst/DC3/stacks/gcc433/24apr/Linux64/pex_harness/3.3.2/python/lsst/pex/harness/harnessLib.py", line 13, in <module>

import _harnessLib

ImportError?: libboost_system-gcc41-mt-1_37.so.1.37.0: cannot open shared object file: No such file or directory

comment:6 Changed 10 years ago by srp

Never mind that last one...user error.

comment:7 follow-up: ↓ 8 Changed 10 years ago by srp

I put a try/catch around the suspected area of code, re-ran and got:

decaf::net::SocketOutputStream::write - Broken pipe

FILE: decaf/net/SocketOutputStream.cpp, LINE: 100 FILE: decaf/io/BufferedOutputStream.cpp, LINE: 118 FILE: ./decaf/io/FilterOutputStream.h, LINE: 191 FILE: activemq/connector/openwire/OpenWireCommandWriter.cpp, LINE: 78 FILE: activemq/transport/IOTransport.cpp, LINE: 94 FILE: activemq/transport/filters/ResponseCorrelator.cpp, LINE: 62 FILE: activemq/connector/openwire/OpenWireFormatNegotiator.cpp, LINE: 78 FILE: activemq/connector/openwire/OpenWireConnector.cpp, LINE: 1497 FILE: activemq/connector/openwire/OpenWireConnector.cpp, LINE: 921 FILE: activemq/core/ActiveMQConsumer.cpp, LINE: 460 FILE: activemq/core/ActiveMQConsumer.cpp, LINE: 414 FILE: activemq/core/ActiveMQConsumer.cpp, LINE: 288

A broken pipe message like this happens if one side drops a connection, and the other side writes to it.

comment:8 in reply to: ↑ 7 Changed 10 years ago by krughoff

~krughoff/policy-dir/Sim-nfs/ became ~krughoff/policy-dir/SimDeep-nfs/

I managed to get all 90 visits of SimDeep?? to run by dividing the list into 90 separate 1 visit files. I used eventFromFitsfileList.py to launch a single visit (with -m 1). I then issued sleep 300 before sending another event. Evidently, if one waits long enough between events that the pipeline never gets more than 5 visits behind, it all runs o.k.

comment:9 Changed 10 years ago by srp

  • Component changed from unknown to ctrl_orca

comment:10 Changed 8 years ago by srp

  • Status changed from assigned to closed
  • Resolution set to fixed

These tools are not used anymore, and the event system has been revamped. I don't believe this is an error any longer.

Note: See TracTickets for help on using tickets.