Ticket #1757 (closed defect: wontfix)

Opened 8 years ago

Last modified 8 years ago

The tagged weekly run failed with 2019 tracebacks.

Reported by: robyn Owned by: robyn
Priority: critical Milestone:
Component: meas_astrom Keywords:
Cc: daues, srp, krughoff, rhl, cloomis, price, dstn Blocked By:
Blocking: Project: LSST
Version Number:
How to repeat:

not applicable

Description

The tagged weekly run failed due to a variety of error conditions. The folks with primary responsibility are being noted as 'CC:' bystanders on the Ticket.

There are memory errors, bad_alloc errors, RunTime? Exceptions, LSST CPP Exceptions, fitsio problems. For the full list of errors and pointers to the error summaries refer to: http://dev.lsstcorp.org/trac/wiki/DM/buildbot/Weekly_Production for 2 Sep 2011.

The most numerous error (bad_alloc 1844 errors) is in _mathLib.makeStatistics(*args) which is called by various SDQA metrics processing methods.

There are a goodly number of meas_astrom errors, too.

The Ticket is being provided to RHL so that he can distribute to the correct Apps developer. Steve P is added in case errors 7&8 are middleware based.

Change History

comment:1 Changed 8 years ago by DefaultCC Plugin

  • Cc cloomis, price added

comment:2 Changed 8 years ago by dstn

I would expect (7) and (8) are due to some low-level I/O error. I will add code to get those low-level error messages to bubble up.

comment:3 Changed 8 years ago by dstn

I suspect that 4(a) is due to I/O also, since there doesn't seem to be anything wrong with the logical flow, and nothing has changed recently:

harness.slice.WcsDeterminationStageParallel DEBUG: Exposure was taken with filter "r"
harness.slice.WcsDeterminationStageParallel DEBUG: Have catalogue magnitudes for filter: "r"
harness.slice.WcsDeterminationStageParallel WARNING: 0: lsst::pex::exceptions::RuntimeErrorException thrown at src/net/GlobalAstrometrySolution.cc:1264 in std::vector<double, std::allocator<double> > lsst::meas::astrom::net::getTagAlongFromIndex(index_t*, std::string, int*, int)
0: Message: No meta data called r found in index /lsst/DC3/data/astrometry_net_data/imsim-2011-08-01-0/index-110801004.fits
harness.slice.WcsDeterminationStageParallel WARNING: Attempting to access catalogue positions and fluxes
harness.slice.WcsDeterminationStageParallel WARNING: Catalogue version: /lsst/DC3/data/astrometry_net_data/imsim-2011-08-01-0
harness.slice.WcsDeterminationStageParallel WARNING: ID column: id
harness.slice.WcsDeterminationStageParallel WARNING: Requested filter: r
harness.slice.WcsDeterminationStageParallel WARNING: Available filters: ('y_err', 'z_err', 'i_err', 'r_err', 'g_err', 'u_err', 'parallax', 'mudec', 'mura', 'y', 'z', 'i', 'r', 'g', 'u', 'variable', 'starnotgal', 'id')

Notice that it's looking for the "r" magnitude column, and that column is indeed listed among the available columns. There is nothing wrong with that index file that I can see.

I would verify that it runs correctly in pipette or datarel SST, if only I could remember how one figures out which CCD was being run...

comment:4 follow-up: ↓ 5 Changed 8 years ago by robyn

  • Cc dstn added

Dustin: If you want to run it thru pipette. try:

886258751 3,0 1,2 ------ 80 reads, 2 writes, 0 calexp persisted

which is record 305 in Block: "Processing log file ./work/PT1PipeA_2/Slice0.log" from ProcessedRecords?.out. That record matches LOOPNUM in ./work/PT1PipeA_2/Slice0.log error:

harness.slice.WcsDeterminationStageParallel WARNING: Requested filter: r
  RUNID: wp_tags_2011_0902_222518
  JobId: unknown
  DATE: 2011-09-04T00:25:57.573323
  workerid: -1
  TIMESTAMP: 1315095991573323000
  LEVEL: 10
  SLICEID: 0
  PIPELINE: main-ImSim

harness.slice.WcsDeterminationStageParallel WARNING: Available filters: ('y_err', 'z_err', 'i_err', 'r_err', 'g_err', 'u_err', 'parallax', 'mudec', 'mura', 'y', 'z', 'i', 'r', 'g', 'u', 'variable', 'starnotgal', 'id')
  RUNID: wp_tags_2011_0902_222518
  JobId: unknown
  DATE: 2011-09-04T00:25:57.576496
  workerid: -1
  TIMESTAMP: 1315095991576496000
  LEVEL: 10
  SLICEID: 0
  PIPELINE: main-ImSim

harness.slice.visit.stage.tryProcess FATAL: Traceback (most recent call last):
  File "/lsst/DC3/stacks/gcc443/15oct2010/Linux64/pex_harness/4.4.0.1/python/lsst/pex/harness/Slice.py", line 575, in tryProcess
    stageObject.applyProcess()
  File "/lsst/DC3/stacks/gcc443/15oct2010/Linux64/pex_harness/4.4.0.1/python/lsst/pex/harness/stage.py", line 353, in applyProcess
    self.process(clipboard)
  File "/lsst/DC3/stacks/gcc443/15oct2010/Linux64/meas_pipeline/4.4.0.1/python/lsst/meas/pipeline/wcsDeterminationStage.py", line 113, in process
    srcSet, solver=self.solver, log=self.log)
  File "/lsst/DC3/stacks/gcc443/15oct2010/Linux64/meas_astrom/4.4.0.1/python/lsst/meas/astrom/determineWcs.py", line 277, in determineWcs
    X = solver.getCatalogueForSolvedField(filterName, idName, margin)
  File "/lsst/DC3/stacks/gcc443/15oct2010/Linux64/meas_astrom/4.4.0.1/python/lsst/meas/astrom/net/netLib.py", line 747, in getCatalogueForSolvedField
    return _netLib.GlobalAstrometrySolution_getCatalogueForSolvedField(*args)
LsstCppException: 0: lsst::pex::exceptions::RuntimeErrorException thrown at src/net/GlobalAstrometrySolution.cc:1264 in std::vector<double, std::allocator<double> > lsst::meas::astrom::net::getTagAlongFromIndex(index_t*, std::string, int*, int)
0: Message: No meta data called r found in index /lsst/DC3/data/astrometry_net_data/imsim-2011-08-01-0/index-110801004.fits

comment:5 in reply to: ↑ 4 ; follow-up: ↓ 7 Changed 8 years ago by robyn

Replying to robyn:

I should add that I scanned upward from teh error statement to find LOOPNUM in the Slice log.

comment:6 Changed 8 years ago by dstn

I cannot reproduce this error.

gawk '{print "setup -j -v ", $1, $2}' /lsst3/weekly/datarel-runs/wp_tags_2011_0902_222518/config/weekly.tags  > s
. s
eups list -s > s2
diff -u s2 /lsst3/weekly/datarel-runs/wp_tags_2011_0902_222518/config/weekly.tags
# no difference; I've got the same environment set up.
/lsst/DC3/stacks/gcc443/15oct2010/Linux64/datarel/4.4.0.11/bin/sst/runImSim.py -v 886258751 -r 3,0 -s 1,2 -i /lsst3/weekly/datarel-runs/wp_tags_2011_0902_222518/input -o .
# runs to successful completion

I am closing (in my own mind) the meas_astrom parts of this ticket, chalking them up to undiagnosed I/O problems in the production environment.

comment:7 in reply to: ↑ 5 Changed 8 years ago by krughoff

Replying to robyn:

Replying to robyn:

I should add that I scanned upward from teh error statement to find LOOPNUM in the Slice log.

I also could not repeat this. I mapped LOOPNUM 311 (which is one that failed with the std::bad_alloc exception according to work/PT1PipeA_2/Slice0.log) back to ProcessedRecords?.out and found that the id is: visit:886258751 raft:4,2 sensor:2,0

lsst10> mkdir -p ~/Outputs/prov
lsst10> python /lsst/DC3/stacks/gcc443/15oct2010/Linux64/datarel/4.4.0.11/bin/prov/prov.py -d ~/Outputs/prov /lsst3/weekly/datarel-runs/wp_tags_2011_0902_222518 886258751 4,2 2,0

The above runs to completion with no errors or warnings.

Should I try running on another machine?

comment:8 Changed 8 years ago by rhl

  • Owner changed from rhl to robyn
  • Status changed from new to assigned

I think that this is back to NCSA. We can't reproduce it: Simon says:

I can't reproduce the error. Do you have a suggestion as to further investigation? I've run chips from 4 different visits that all show the error in the logs, but all run to completion without error or warning using the weekly stack on lsst10. I've also run an example chip through pipette without error.

Maybe the next thing is to try the production environment? I'm passing it back to Robyn

comment:9 Changed 8 years ago by robyn

  • Cc daues added

Just to close the loop on the differences between the ABE 4000 stack and lsst cluster stack for the errant lsst cluster run... there were a few packages which were in one stack vs the other stack. I decided that they must not have been used. There was a single potential difference but I'm not sure the package is used:

  • meas_multifitdata is 4.1.0.1 on ABE and 'trunk' on lsst. Can't tell if they're the same or not.
  • mysqlpython is 1.2.2+1 on ABE and 1.2.2 on lsst - but that should indicate some special ld need on the platform but still using the same base code.

Also, the lsst6 run's config/weekly.tags file shows the correct astrometry_net_data as installed:

  • astrometry_net_data imsim-2011-08-01-0 Current Setup

I will talk with Greg about possibility of doing a run with the same dataset.

comment:10 Changed 8 years ago by robyn

  • Status changed from assigned to closed
  • Resolution set to wontfix

A rerun on the cluster was decided to be not useful.

This ticket is being closed since we are moving towards 4.5 tags and a clean fresh 'Current Tagged' stack.

Note: See TracTickets for help on using tickets.