wiki:DM/buildbot/Weekly_Production/June2011
Last modified 8 years ago Last modified on 07/18/2011 06:59:22 PM

DM/buildbot/Weekly_Production

June 2011 Runs

Detailed discussion of the Weekly Production Runs and Not-Runs

  • Arranged sequentially with Most Recent first.
  • Codeset: Trunk indicates the trunk was used for the lsst stack. In the future, a tagged-set may be specified
  • Dataset: Full indicates the complete set of data specified in $SVN/DMS/datarel/pipeline/<date>_weekly.input. Short indicates a test run using a small subset of available image data.
  • To check on the status of an in-progress run, determine <date_time current run> from run details below, then:
    % cd /lsst3/weekly/datarel-runs/wp_trunk_<date_time current run> # eg 2011_0507_160801
    % /home/buildbot/slave/trunkVsTrunk_lsst/work/RunStatus.sh
    % /home/buildbot/slave/trunkVsTrunk_lsst/work/FindAllErrors.sh
    

For a status overview of all daily buildbot runs see DM/buildbot/Daily_Status
Past Months' Daily_Status pages: May June

14 June 2011

0007

  • Why: Trunk Tuesday Run
  • Setup: /home/buildbot/slave/trunkVsTrunk_lsst/work/weeklyPR/pipeline_2011_0614_000759
  • Codeset: trunk; see /lsst3/weekly/datarel-runs/wp_trunk_2011_0614_000759/config/weekly.tags
  • Dataset: full
  • Output: /lsst3/weekly/datarel-runs/wp_trunk_2011_0614_000759
  • Database: buildbot_PT1_2_u_wp_trunk_2011_0614_000759
  • Status: ???
  • Resolution: Don't know; left on vacation before completion.

11 June 2011

1121

  • Why: Test Source Measurement fixes.
  • Setup: /home/buildbot/slave/trunkVsTrunk_lsst/work/weeklyPR/pipeline_2011_0611_112151
  • Codeset: trunk; see /lsst3/weekly/datarel-runs/wp_trunk_2011_0611_112151/config/weekly.tags
  • Dataset: full
  • Output: /lsst3/weekly/datarel-runs/wp_trunk_2011_0611_112151
  • Database: buildbot_PT1_2_u_wp_trunk_2011_0611_112151
  • Status: Completed. 500 out of 502 processed to completion,
    • 2 apparently hung in Source Measurement, refer to final block of log file indicated below:
      work/PT1PipeB_1/Slice0.log:   885724041 1,1 1,0 ------ 80 reads, 32 writes, 0 calexp persisted
      work/PT1PipeA_2/Slice0.log:   885335911 1,0 2,0 ------ 80 reads, 32 writes, 0 calexp persisted
      
  • Resolution:
    • Compliments to Jim who created a better status report tool which will get merged into datarel tomorrow or in 3 weeks. Until then, use:

/home/jbosch/runStatus.py -a -r <fullpath to run output root>

  • Back to Jim and Martin for more debug. Should consider asking GregD if there is an orca run timeout which might have been hit; there were no relevant error messages in an error log (only 3 warnings) and the run seems to have terminated without any errors.

9 June 2011

1331

  • Why: Test source association changes, test fix to eigen assertion failure, test misc meas_multifit fixes & improvements.
  • Setup: overwritten by following run; see actual work directory for provenance info.
  • Codeset: trunk; see /lsst3/weekly/datarel-runs/wp_trunk_2011_0609_133111/config/weekly.tags
  • Dataset: full
  • Output: /lsst3/weekly/datarel-runs/wp_trunk_2011_0609_133111
  • Database: buildbot_PT1_2_u_wp_trunk_2011_0609_133111
  • Status: Run partially completed.
    • Lost 3 pipeline slices due to assertion failure.
      • PT1PipeB_3/launch.log
        python: /lsst/DC3/stacks/gcc443/15oct2010/Linux64/external/boost/1.37.0+2/include/boost/shared_ptr.hpp:419:
         T* boost::shared_ptr< <template-parameter-1-1> >::operator->() const [with T = const lsst::afw::detection::Source]:
         Assertion `px != 0' failed.
        
      • PT1PipeA_1/launch.log
        python: /lsst/DC3/stacks/gcc443/15oct2010/Linux64/external/boost/1.37.0+2/include/boost/shared_ptr.hpp:419:
         T* boost::shared_ptr< <template-parameter-1-1> >::operator->() const [with T = const lsst::afw::detection::Source]:
         Assertion `px != 0' failed.
        
        
      • PT1PipeA_2/launch.log
        python: /lsst/DC3/stacks/gcc443/15oct2010/Linux64/external/boost/1.37.0+2/include/boost/shared_ptr.hpp:419:
         T* boost::shared_ptr< <template-parameter-1-1> >::operator->() const [with T = const lsst::afw::detection::Source]:
         Assertion `px != 0' failed.
        
    • Other 2 Pipelines may have hung during processing since their logs don't show activity after mid-afternoon on the 9th.
      PT1PipeA_1:  (Assert Failure occurred)
      total 393100
      -rw-rw-r-- 1 buildbot buildbot      3471 Jun  9 13:31 eups-env.txt
      -rw-rw-r-- 1 buildbot buildbot      1319 Jun 10 03:13 launch.log
      -rwx------ 1 buildbot buildbot      1470 Jun  9 13:31 launch_PT1PipeA_1.sh
      -rw-rw-r-- 1 buildbot buildbot        31 Jun  9 13:31 nodelist.paf
      -rw-rw-r-- 1 buildbot buildbot        22 Jun  9 13:31 nodelist.scr
      -rw-rw-r-- 1 buildbot buildbot 130911010 Jun 10 03:12 Pipeline.log
      -rw-rw-r-- 1 buildbot buildbot 271192729 Jun 10 03:12 Slice0.log
      
      PT1PipeA_2:   (Assert Failure occurred)
      total 86220
      -rw-rw-r-- 1 buildbot buildbot     3471 Jun  9 13:31 eups-env.txt
      -rw-rw-r-- 1 buildbot buildbot     1319 Jun  9 16:38 launch.log
      -rwx------ 1 buildbot buildbot     1470 Jun  9 13:31 launch_PT1PipeA_2.sh
      -rw-rw-r-- 1 buildbot buildbot       31 Jun  9 13:31 nodelist.paf
      -rw-rw-r-- 1 buildbot buildbot       22 Jun  9 13:31 nodelist.scr
      -rw-rw-r-- 1 buildbot buildbot 28672736 Jun  9 16:37 Pipeline.log
      -rw-rw-r-- 1 buildbot buildbot 59493681 Jun  9 16:37 Slice0.log
      
      PT1PipeB_1:
      total 72308
      -rw-rw-r-- 1 buildbot buildbot     3471 Jun  9 13:31 eups-env.txt
      -rw-rw-r-- 1 buildbot buildbot     1067 Jun  9 13:32 launch.log
      -rwx------ 1 buildbot buildbot     1470 Jun  9 13:31 launch_PT1PipeB_1.sh
      -rw-rw-r-- 1 buildbot buildbot       36 Jun  9 13:31 nodelist.paf
      -rw-rw-r-- 1 buildbot buildbot       27 Jun  9 13:31 nodelist.scr
      -rw-rw-r-- 1 buildbot buildbot 24042050 Jun  9 16:21 Pipeline.log
      -rw-rw-r-- 1 buildbot buildbot 49894068 Jun  9 16:21 Slice0.log
      
      PT1PipeB_2:
      total 30756
      -rw-rw-r-- 1 buildbot buildbot     3471 Jun  9 13:31 eups-env.txt
      -rw-rw-r-- 1 buildbot buildbot     1067 Jun  9 13:32 launch.log
      -rwx------ 1 buildbot buildbot     1470 Jun  9 13:31 launch_PT1PipeB_2.sh
      -rw-rw-r-- 1 buildbot buildbot       36 Jun  9 13:31 nodelist.paf
      -rw-rw-r-- 1 buildbot buildbot       27 Jun  9 13:31 nodelist.scr
      -rw-rw-r-- 1 buildbot buildbot 10207316 Jun  9 14:43 Pipeline.log
      -rw-rw-r-- 1 buildbot buildbot 21214181 Jun  9 14:43 Slice0.log
      
      PT1PipeB_3:   (Assert Failure occurred)
      total 80072
      -rw-rw-r-- 1 buildbot buildbot     3471 Jun  9 13:31 eups-env.txt
      -rw-rw-r-- 1 buildbot buildbot     1319 Jun  9 16:36 launch.log
      -rwx------ 1 buildbot buildbot     1470 Jun  9 13:31 launch_PT1PipeB_3.sh
      -rw-rw-r-- 1 buildbot buildbot       36 Jun  9 13:31 nodelist.paf
      -rw-rw-r-- 1 buildbot buildbot       27 Jun  9 13:31 nodelist.scr
      -rw-rw-r-- 1 buildbot buildbot 26623808 Jun  9 16:36 Pipeline.log
      -rw-rw-r-- 1 buildbot buildbot 55254983 Jun  9 16:36 Slice0.log
      
      PT1Pipe-joboffice:
      total 460
      drwxrwxr-x 2 buildbot buildbot  36864 Jun  9 13:32 dataAvailable
      -rw-rw-r-- 1 buildbot buildbot 376910 Jun 10 03:13 joboffice.log
      drwxrwxr-x 2 buildbot buildbot  20480 Jun 10 03:10 jobsAvailable
      drwxrwxr-x 2 buildbot buildbot  16384 Jun 10 03:10 jobsDone
      drwxrwxr-x 2 buildbot buildbot   4096 Jun 10 03:10 jobsInProgress
      drwxrwxr-x 2 buildbot buildbot   4096 Jun  9 13:32 jobsPossible
      drwxrwxr-x 2 buildbot buildbot   4096 Jun 10 03:10 pipelinesReady
      
  • The job Office log indicates a stop event occurred --- which is probably why the pipeline job terminated cleanly and then proceeded to perform the ingest all the data into the DB.
    s COMMENT: job:done: jobDone on lsst5.ncsa.uiuc.edu finised successfully
    s LOG: PT1Pipe
    t DATE: 2011-06-10T08:10:09.250417
    L TIMESTAMP: 1307693443250417000
    i LEVEL: -2
    
    s COMMENT: job:ready: getAJob on lsst5.ncsa.uiuc.edu is ready
    s LOG: PT1Pipe
    t DATE: 2011-06-10T08:10:09.661186
    L TIMESTAMP: 1307693443661186000
    i LEVEL: -2
    
    s COMMENT: received stop event; shutting down JobOffice thread
    s LOG: PT1Pipe.stop
    t DATE: 2011-06-10T08:13:26.145520
    L TIMESTAMP: 1307693640145520000
    i LEVEL: -1
    
    s COMMENT: Stop requested; shutting down.
    s LOG: PT1Pipe
    t DATE: 2011-06-10T08:13:26.435958
    L TIMESTAMP: 1307693640435958000
    i LEVEL: 0
    
    s COMMENT: job office done.
    s LOG: PT1Pipe
    t DATE: 2011-06-10T08:13:26.436109
    L TIMESTAMP: 1307693640436109000
    i LEVEL: 0
    
  • Resolution: Many issues to address:
    • Assert issue still remains
    • Never-ending processes probably still remain
    • Analysts should still examine the output to determine if
      • Jim's improvements worked ok.
    • Serge should check that his ap changes installed properly

8 June 2011

1845

  • Why: Test newly tagged version of thread-safe Citizen in daf_base.
  • Setup: /home/buildbot/slave/trunkVsTrunk_lsst/work/weeklyPR/pipeline_2011_0608_184515
  • Codeset: TAG with overlay for daf_base 4.4.0.1 and SHAPELET_MODEL_8 disabled in $DATAREL_DIR/PT1Pipe/SFM-sourceMeasure.paf; see /lsst3/weekly/datarel-runs/wp_tags_2011_0608_184515/config/weekly.tags
  • Dataset: full
  • Output: /lsst3/weekly/datarel-runs/wp_tags_2011_0608_184515
  • Database: buildbot_PT1_2_u_wp_tags_2011_0608_184515
  • Status: Failed when the filesystem ran out of space.
    • A scan of the error logs uncovered a MemoryException? which was ultimately traced to a failure of the environment setup shipped to and setup on each pipeline host. The failure caused the experimental daf_base 4.4.0.1 not to overlay the Current tagged version.
    • When filesystem ran out of space, the pipeline(s) not fail gracefully - all processes hung. Tracing a single run_pipeline process and two of its threads follows:
      (gdb) info threads
        37 Thread 0x4267f940 (LWP 24350)  0x0000003c3d00b150 in pthread_cond_timedwait
      @@GLIBC_2.3.2 () from /lib64/libpthread.so.0
        36 Thread 0x43080940 (LWP 24351)  0x0000003c3d00b150 in pthread_cond_timedwait
      @@GLIBC_2.3.2 () from /lib64/libpthread.so.0
        35 Thread 0x43a81940 (LWP 24352)  0x0000003c3d00d91b in read ()
         from /lib64/libpthread.so.0
        34 Thread 0x44482940 (LWP 24353)  0x0000003c3d00aee9 in pthread_cond_wait@@GLI
      BC_2.3.2 () from /lib64/libpthread.so.0
        33 Thread 0x44e83940 (LWP 24354)  0x0000003c3d00b150 in pthread_cond_timedwait
      @@GLIBC_2.3.2 () from /lib64/libpthread.so.0
        32 Thread 0x45884940 (LWP 24355)  0x0000003c3d00aee9 in pthread_cond_wait@@GLI
      BC_2.3.2 () from /lib64/libpthread.so.0
        31 Thread 0x414f5940 (LWP 24356)  0x0000003c3d00b150 in pthread_cond_timedwait
      @@GLIBC_2.3.2 () from /lib64/libpthread.so.0
        30 Thread 0x46285940 (LWP 24357)  0x0000003c3d00b150 in pthread_cond_timedwait
      @@GLIBC_2.3.2 () from /lib64/libpthread.so.0
        29 Thread 0x46c86940 (LWP 24358)  0x0000003c3d00d91b in read ()
         from /lib64/libpthread.so.0
        28 Thread 0x47687940 (LWP 24359)  0x0000003c3d00aee9 in pthread_cond_wait@@GLI
      BC_2.3.2 () from /lib64/libpthread.so.0
        27 Thread 0x48088940 (LWP 24360)  0x0000003c3d00aee9 in pthread_cond_wait@@GLI
      BC_2.3.2 () from /lib64/libpthread.so.0
        26 Thread 0x48a89940 (LWP 24381)  0x0000003c3d00b150 in pthread_cond_timedwait
      @@GLIBC_2.3.2 () from /lib64/libpthread.so.0
        25 Thread 0x4948a940 (LWP 24382)  0x0000003c3d00b150 in pthread_cond_timedwait
      @@GLIBC_2.3.2 () from /lib64/libpthread.so.0
        24 Thread 0x49e8b940 (LWP 24383)  0x0000003c3d00d91b in read ()
         from /lib64/libpthread.so.0
        23 Thread 0x4a88c940 (LWP 24384)  0x0000003c3d00aee9 in pthread_cond_wait@@GLI
      BC_2.3.2 () from /lib64/libpthread.so.0
        22 Thread 0x4b28d940 (LWP 24388)  0x0000003c3d00b150 in pthread_cond_timedwait
      @@GLIBC_2.3.2 () from /lib64/libpthread.so.0
        21 Thread 0x4bc8e940 (LWP 24389)  0x0000003c3d00b150 in pthread_cond_timedwait
      @@GLIBC_2.3.2 () from /lib64/libpthread.so.0
        20 Thread 0x4c68f940 (LWP 24390)  0x0000003c3d00d91b in read ()
         from /lib64/libpthread.so.0
        19 Thread 0x4d090940 (LWP 24392)  0x0000003c3d00aee9 in pthread_cond_wait@@GLI
      BC_2.3.2 () from /lib64/libpthread.so.0
        18 Thread 0x4da91940 (LWP 24397)  0x0000003c3d00b150 in pthread_cond_timedwait
      @@GLIBC_2.3.2 () from /lib64/libpthread.so.0
        17 Thread 0x4e492940 (LWP 24398)  0x0000003c3d00b150 in pthread_cond_timedwait
      @@GLIBC_2.3.2 () from /lib64/libpthread.so.0
        16 Thread 0x4ee93940 (LWP 24399)  0x0000003c3d00d91b in read ()
         from /lib64/libpthread.so.0
        15 Thread 0x4f894940 (LWP 24400)  0x0000003c3d00aee9 in pthread_cond_wait@@GLI
      BC_2.3.2 () from /lib64/libpthread.so.0
        14 Thread 0x50295940 (LWP 24405)  0x0000003c3d00b150 in pthread_cond_timedwait
      @@GLIBC_2.3.2 () from /lib64/libpthread.so.0
        13 Thread 0x50c96940 (LWP 24406)  0x0000003c3d00b150 in pthread_cond_timedwait
      @@GLIBC_2.3.2 () from /lib64/libpthread.so.0
        12 Thread 0x51697940 (LWP 24407)  0x0000003c3d00d91b in read ()
         from /lib64/libpthread.so.0
        11 Thread 0x52098940 (LWP 24408)  0x0000003c3d00aee9 in pthread_cond_wait@@GLI
      BC_2.3.2 () from /lib64/libpthread.so.0
        10 Thread 0x52a99940 (LWP 24410)  0x0000003c3d00aee9 in pthread_cond_wait@@GLI
      BC_2.3.2 () from /lib64/libpthread.so.0
        9 Thread 0x5349a940 (LWP 24414)  0x0000003c3d00b150 in pthread_cond_timedwait@
      @GLIBC_2.3.2 () from /lib64/libpthread.so.0
        8 Thread 0x53e9b940 (LWP 24415)  0x0000003c3d00b150 in pthread_cond_timedwait@
      @GLIBC_2.3.2 () from /lib64/libpthread.so.0
        7 Thread 0x5489c940 (LWP 24416)  0x0000003c3d00d91b in read ()
         from /lib64/libpthread.so.0
        6 Thread 0x5529d940 (LWP 24417)  0x0000003c3d00aee9 in pthread_cond_wait@@GLIB
      C_2.3.2 () from /lib64/libpthread.so.0
        5 Thread 0x55c9e940 (LWP 24418)  0x0000003c3d00b150 in pthread_cond_timedwait@
      @GLIBC_2.3.2 () from /lib64/libpthread.so.0
        4 Thread 0x5669f940 (LWP 24419)  0x0000003c3d00b150 in pthread_cond_timedwait@
      @GLIBC_2.3.2 () from /lib64/libpthread.so.0
        3 Thread 0x570a0940 (LWP 24420)  0x0000003c3d00d91b in read ()
         from /lib64/libpthread.so.0
        2 Thread 0x57aa1940 (LWP 24421)  0x0000003c3d00aee9 in pthread_cond_wait@@GLIB
      C_2.3.2 () from /lib64/libpthread.so.0
      * 1 Thread 0x2b14c7268ca0 (LWP 24348)  0x0000003c3c4cd722 in select ()
         from /lib64/libc.so.6
      (gdb) 
      
      (gdb) where
      #0  0x0000003c3c4cd722 in select () from /lib64/libc.so.6
      #1  0x00002b14cf9a4579 in floatsleep (self=<value optimized out>, 
          args=<value optimized out>)
          at /lsst/DC3/stacks/gcc443/15oct2010/EupsBuildDir/Linux64/python-2.5.2/Pytho
      n-2.5.2/Modules/timemodule.c:910
      #2  time_sleep (self=<value optimized out>, args=<value optimized out>)
          at /lsst/DC3/stacks/gcc443/15oct2010/EupsBuildDir/Linux64/python-2.5.2/Pytho
      n-2.5.2/Modules/timemodule.c:206
      #3  0x00002b14c676b9f9 in call_function (f=0xebcb8f0, 
          throwflag=<value optimized out>) at Python/ceval.c:3573
      #4  PyEval_EvalFrameEx (f=0xebcb8f0, throwflag=<value optimized out>)
          at Python/ceval.c:2272
      #5  0x00002b14c676bc16 in fast_function (f=0xebc9a00, 
          throwflag=<value optimized out>) at Python/ceval.c:3659
      #6  call_function (f=0xebc9a00, throwflag=<value optimized out>)
          at Python/ceval.c:3594
      #7  PyEval_EvalFrameEx (f=0xebc9a00, throwflag=<value optimized out>)
          at Python/ceval.c:2272
      #8  0x00002b14c676bc16 in fast_function (f=0xe67e7a0, 
          throwflag=<value optimized out>) at Python/ceval.c:3659
      #9  call_function (f=0xe67e7a0, throwflag=<value optimized out>)
          at Python/ceval.c:3594
      #10 PyEval_EvalFrameEx (f=0xe67e7a0, throwflag=<value optimized out>)
          at Python/ceval.c:2272
      #11 0x00002b14c676ce11 in PyEval_EvalCodeEx (co=0x2b14ca904288, 
          globals=<value optimized out>, locals=<value optimized out>, args=0x6, 
          argcount=128158480, kws=0xe67c818, kwcount=0, defs=0x2b14c72d1a60, 
          defcount=4, closure=0x0) at Python/ceval.c:2836
      #12 0x00002b14c676b98f in fast_function (f=0xe67c650, 
          throwflag=<value optimized out>) at Python/ceval.c:3669
      #13 call_function (f=0xe67c650, throwflag=<value optimized out>)
          at Python/ceval.c:3594
      #14 PyEval_EvalFrameEx (f=0xe67c650, throwflag=<value optimized out>)
          at Python/ceval.c:2272
      #15 0x00002b14c676bc16 in fast_function (f=0xe1c78f0, 
          throwflag=<value optimized out>) at Python/ceval.c:3659
      #16 call_function (f=0xe1c78f0, throwflag=<value optimized out>)
          at Python/ceval.c:3594
      #17 PyEval_EvalFrameEx (f=0xe1c78f0, throwflag=<value optimized out>)
          at Python/ceval.c:2272
      #18 0x00002b14c676ce11 in PyEval_EvalCodeEx (co=0x2b14ca9043f0, 
          globals=<value optimized out>, locals=<value optimized out>, args=0x0, 
          argcount=128158480, kws=0x0, kwcount=0, defs=0x0, defcount=0, closure=0x0)
          at Python/ceval.c:2836
      #19 0x00002b14c676cf32 in PyEval_EvalCode (co=0x0, globals=0x0, locals=0x0)
          at Python/ceval.c:494
      #20 0x00002b14c678c070 in run_mod (fp=0xe14f010, 
          filename=<value optimized out>, start=<value optimized out>, 
          globals=<value optimized out>, locals=0xe173220, closeit=1, 
          flags=0x7fff07a39500) at Python/pythonrun.c:1273
      #21 PyRun_FileExFlags (fp=0xe14f010, filename=<value optimized out>, 
          start=<value optimized out>, globals=<value optimized out>, 
          locals=0xe173220, closeit=1, flags=0x7fff07a39500)
          at Python/pythonrun.c:1259
      #22 0x00002b14c678c227 in PyRun_SimpleFileExFlags (fp=0xe14f010, 
          filename=0x7fff07a3a5a6 "/lsst/DC3/stacks/gcc443/15oct2010/Linux64/pex_harne
      ss/4.3.0.0/bin/runPipeline.py", closeit=1, flags=0x7fff07a39500)
          at Python/pythonrun.c:879
      #23 0x00002b14c6795521 in Py_Main (argc=7, argv=<value optimized out>)
          at Modules/main.c:523
      #24 0x0000003c3c41d994 in __libc_start_main () from /lib64/libc.so.6
      #25 0x0000000000400669 in _start ()
      (gdb) thread 3
      [Switching to thread 3 (Thread 0x570a0940 (LWP 24420))]#0  0x0000003c3d00d91b in
       read () from /lib64/libpthread.so.0
      (gdb) where
      #0  0x0000003c3d00d91b in read () from /lib64/libpthread.so.0
      #1  0x00002b14ce626549 in apr_socket_recv ()
         from /lsst/DC3/stacks/gcc443/15oct2010/Linux64/external/activemqcpp/3.1.2/lib
      /libactivemq-cpp.so.9
      #2  0x00002b14ce5ea1ae in decaf::net::SocketInputStream::read (
          this=0x2aaaacbe36c0, buffer=<value optimized out>, 
          offset=<value optimized out>, bufferSize=18446744073709551615)
          at decaf/net/SocketInputStream.cpp:179
      #3  0x00002b14ce5ce2eb in decaf::io::BufferedInputStream::bufferData (
          this=0x2aaaad1b3250) at decaf/io/BufferedInputStream.cpp:265
      #4  0x00002b14ce5ce990 in decaf::io::BufferedInputStream::read (
          this=0x2aaaad1b3250, targetBuffer=0x2aaaad1ba9da "", offset=0, 
          targetBufferSize=4) at decaf/io/BufferedInputStream.cpp:186
      #5  0x00002b14ce5d4e66 in readAllData (this=0x2aaaad1ba9b0)
          at ./decaf/io/DataInputStream.h:379
      #6  decaf::io::DataInputStream::readInt (this=0x2aaaad1ba9b0)
          at decaf/io/DataInputStream.cpp:164
      #7  0x00002b14ce453384 in activemq::wireformat::openwire::OpenWireFormat::unmars
      hal (this=0x2aaaacbc9f80, transport=0x1, dis=0xffffffffffffffff)
          at activemq/wireformat/openwire/OpenWireFormat.cpp:263
      #8  0x00002b14ce408965 in activemq::transport::IOTransport::run (
          this=0x2aaaad188690) at activemq/transport/IOTransport.cpp:233
      #9  0x00002b14ce5e6522 in decaf::lang::ThreadProperties::runCallback (
          properties=0x2aaaad19f350) at decaf/lang/Thread.cpp:133
      #10 0x00002b14ce5e418b in (anonymous namespace)::threadWorker (
          arg=0x2aaaad19f350) at decaf/lang/Thread.cpp:186
      #11 0x0000003c3d00673d in start_thread () from /lib64/libpthread.so.0
      #12 0x0000003c3c4d44bd in clone () from /lib64/libc.so.6
      (gdb) 
      
  • Resolution:
    • Fix the package overlay issue.
    • Should the 'No space on dev' failure have provoked a different ending rather than a permanent wait-hang?
    • 3% of Total Space cleared on /lsst3; can probably do more.

1349

  • Why: Test newly tagged version of thread-safe Citizen in daf_base.
  • Setup: /home/buildbot/slave/trunkVsTrunk_lsst/work/weeklyPR/pipeline TBD after run: _2011_0608_134930
  • Codeset: TAG with overlay of daf_base 4.4.0.1; see /lsst3/weekly/datarel-runs/wp_tags_2011_0608_134930/config/weekly.tags
  • Dataset: full
  • Output: /lsst3/weekly/datarel-runs/wp_tags_2011_0608_134930
  • Database: buildbot_PT1_2_u_wp_tags_2011_0608_134930
  • Status: Failure since I didn't inhibit the use of a missing algorithm which is not available in the tagged suite of packages.
  • Resolution: Rerun with the algorithm disabled as was done the previous tag run last week.

7 June 2011

2210

  • Why: Test Galaxy Modeling
  • Setup: /home/buildbot/slave/trunkVsTrunk_lsst/work/weeklyPR/pipeline TBD after run: _2011_0607_221040
  • Codeset: trunk; see /lsst3/weekly/datarel-runs/wp_trunk_2011_0607_221040/config/weekly.tags
  • Dataset: full Tuesday dataset
  • Output: /lsst3/weekly/datarel-runs/wp_trunk_2011_0607_221040
  • Database: buildbot_PT1_2_u_wp_trunk_2011_0607_221040
  • Status: 501 out of 501 sensors processed. Problems:
    • No errors cited by automatic scanner (will be updated) but 2 'assertion' failures were found in the logs.
      • Run Status output - need to verify&refine the job office counts. Data Available includes the 10 end-processing lines incorporated in the sensor list.
        /lsst3/weekly/datarel-runs/wp_trunk_2011_0607_221040
         Data Available: 512 ( minus 11 expected but extraneous counts)
         Jobs Available: 8
         Jobs Possible: 0
         Jobs In Progress: 5  (2 entries totally missing SRC output: V:886257211;R2,0;S:0,0  V:886257211;R2,0;S:1,1 )
         Jobs Done: 499       (same 2 entries are not in jobsDone directory)
        -----------------------------------------------------------------
         icSrc: 501
         psf: 501
         sdqaCcd: 1004
         src: 499              (missing V:886257211;R2,0;S:0,0  V:886257211;R2,0;S:1,1 )
         calexp: 501
         icMatch: 501
         postISR: 16064
         apCorr: 501
         csv-SourceAssoc: 129
         
         Science_Ccd_Exposure.csv: exists
         Science_Ccd_Exposure_Metadata.csv: exists
         Raw_Amp_To_Snap_Ccd_Exposure.csv: exists
         Snap_Ccd_To_Science_Ccd_Exposure.csv: exists
         sdqa_Rating_ForScienceAmpExposure.csv: exists
         sdqa_Rating_ForScienceCcdExposure.csv: exists
        
      • File: PT1PipeA_1/Slice0.log ends in midst of processing: {'raft': '2,0', 'sensor': '1,1', 'visit': 886257211}; PT1PipeA_1/Launch.log last line indicates:
        python: /lsst/DC3/stacks/gcc443/15oct2010/Linux64/external/eigen/2.0.15/include/
        Eigen/src/Core/CwiseNullaryOp.h:70: Eigen::CwiseNullaryOp<NullaryOp, MatrixType>
        ::CwiseNullaryOp(int, int, const NullaryOp&) [with NullaryOp = Eigen::ei_scalar_
        constant_op<double>, MatrixType = lsst::ndarray::EigenView<double, 1, 1>]: Asser
        tion `rows > 0 && (RowsAtCompileTime == Dynamic || RowsAtCompileTime == rows) &&
         cols > 0 && (ColsAtCompileTime == Dynamic || ColsAtCompileTime == cols)' failed
        .
        
      • File PT1PipeB_1/Slice0.log ends in midst of processing: {'raft': '2,0', 'sensor': '0,0', 'visit': 886257211}. PT1PipeB_1/Launch.log last line indicates:
        python: /lsst/DC3/stacks/gcc443/15oct2010/Linux64/external/eigen/2.0.15/include/
        Eigen/src/Core/CwiseNullaryOp.h:70: Eigen::CwiseNullaryOp<NullaryOp, MatrixType>
        ::CwiseNullaryOp(int, int, const NullaryOp&) [with NullaryOp = Eigen::ei_scalar_
        constant_op<double>, MatrixType = lsst::ndarray::EigenView<double, 1, 1>]: Asser
        tion `rows > 0 && (RowsAtCompileTime == Dynamic || RowsAtCompileTime == rows) &&
         cols > 0 && (ColsAtCompileTime == Dynamic || ColsAtCompileTime == cols)' failed
        
    • Once again, the /lsst3/weekly/datarel-runs/latest_trunk symlink was not established on run completion. Have not checked if the DB view of latest was successful. The linkDb.log indicates it was successful. Note, the DB view has not be renamed to take into consideration latest_tag or latest_trunk.
  • Resolution:
    • Analysts should check the new results.
    • Developers using eigen should check the assertion failures and determine if they can be wrapped in an exception handler.
    • Buildbot needs to solve the missing sym links problem; check the old-style DB view was properly constructed and update to *_latest_tag/*_latest_trunk views.
      • verified old-style DB view was properly constructed

1350

  • Why: Provide running but hung pipeline as debug fodder to gdb for developers
  • Setup: /home/buildbot/slave/trunkVsTrunk_lsst/work/weeklyPR/pipeline_2011_0607_135044
  • Codeset: trunk; see /lsst3/weekly/datarel-runs/wp_trunk_2011_0607_135044/config/weekly.tags
  • Dataset: full Tuesday dataset
  • Output: /lsst3/weekly/datarel-runs/wp_trunk_2011_0607_135044
  • Database: buildbot_PT1_2_u_wp_trunk_2011_0607_135044
  • Status: Hung almost immediately. Developers looked at the processes and then were done.
  • Resolution: Terminated hung pipeline. Developers will recompile with no optimization enabled and then retest using the prov.py restart process.

6 June 2011

2301

  • Why: Test Galaxy Modeling
  • Setup: /home/buildbot/slave/trunkVsTrunk_lsst/work/weeklyPR/pipeline_2011_0606_230142_KILLED
  • Codeset: trunk; see /lsst3/weekly/datarel-runs/wp_trunk_2011_0606_230142/config/weekly.tags
  • Dataset: full Tuesday dataset
  • Output: /lsst3/weekly/datarel-runs/wp_trunk_2011_0606_230142_KILLED
  • Database: TBD Deleted: buildbot_PT1_2_u_wp_trunk_2011_0606_230142
  • Status: Killed - hung in source measurement for first data record on each of 5 pipeline processes. Live trace:
    (gdb) thread 4
    [Switching to thread 4 (Thread 0x44d84940 (LWP 18973))]#0  compute_d (G=...,
        g0=<value optimized out>, CE=<value optimized out>,
        ce0=<value optimized out>, CI=<value optimized out>,
        ci0=<value optimized out>, x=...) at src/qp/QuadProg++.cc:494
    494         for (j = 0; j < n; j++)
    (gdb) where
    #0  compute_d (G=..., g0=<value optimized out>, CE=<value optimized out>,
        ce0=<value optimized out>, CI=<value optimized out>,
        ci0=<value optimized out>, x=...) at src/qp/QuadProg++.cc:494
    #1  QuadProgPP::solve_quadprog (G=..., g0=<value optimized out>,
        CE=<value optimized out>, ce0=<value optimized out>,
        CI=<value optimized out>, ci0=<value optimized out>, x=...)
        at src/qp/QuadProg++.cc:331
    #2  0x00002aaab8a67ffc in lsst::meas::multifit::QPSolver::solve (
        this=<value optimized out>, x=...) at src/qp.cc:97
    #3  0x00002aaab8a61fbd in lsst::meas::multifit::Evaluation::ConstrainedSolver::solve(lsst::ndarray::EigenView<double const, 2, 2> const&, lsst::ndarray::EigenView<double const, 2, 2> const&, lsst::ndarray::EigenView<double const, 1, 1> const&, lsst::ndarray::EigenView<double, 1, 1>) ()
       from /home/buildbot/buildbotSandbox/Linux64/meas_multifit/svn22230/lib/libmeas_multifit.so
    #4  0x00002aaab8a5a8e7 in lsst::meas::multifit::Evaluation::ensureCoefficients
        (this=0x44d81c80) at src/Evaluation.cc:219
    #5  0x00002aaab8a5b055 in lsst::meas::multifit::Evaluation::ensureModelVector (
        this=0x2aaac0a110d0) at src/Evaluation.cc:225
    #6  0x00002aaab8a5b2ed in lsst::meas::multifit::Evaluation::ensureResiduals (
        this=0x2aaac0a110d0) at src/Evaluation.cc:237
    #7  0x00002aaab8a5b6ed in lsst::meas::multifit::Evaluation::ensureObjectiveValue (this=0x2aaac0a110d0) at src/Evaluation.cc:265
    #8  0x00002aaab8a76abd in getObjectiveValue (this=0x44d82360,
        evaluator=<value optimized out>, nTestPoints=<value optimized out>)
        at include/lsst/meas/multifit/Evaluation.h:176
    #9  lsst::meas::multifit::BruteForceSourceOptimizer::solve (this=0x44d82360,
        evaluator=<value optimized out>, nTestPoints=<value optimized out>)
        at src/BruteForceSourceOptimizer.cc:56
    #10 0x00002aaab8aae380 in boost::shared_ptr<lsst::afw::detection::Photometry> lsst::meas::multifit::ShapeletModelPhotometry<8>::doMeasure<lsst::afw::image::Exposure<float, unsigned short, float> >(boost::shared_ptr<lsst::afw::image::Exposure<float, unsigned short, float> const>, boost::shared_ptr<lsst::afw::detection::Peak const>, boost::shared_ptr<lsst::afw::detection::Source const>) ()
       from /home/buildbot/buildbotSandbox/Linux64/meas_multifit/svn22230/lib/libmeas_multifit.so
    #11 0x00002aaab5968375 in lsst::afw::detection::MeasureQuantity<lsst::afw::detection::Photometry, lsst::afw::image::Exposure<float, unsigned short, float>, lsst::afw::detection::Peak>::measure (this=0x2aaac10a02e0, peak=)
        at /home/buildbot/buildbotSandbox/Linux64/afw/svn21941/include/lsst/afw/detection/Measurement.h:476
    #12 0x00002aaab5d1a193 in lsst::meas::algorithms::MeasureSources<lsst::afw::image::Exposure<float, unsigned short, float> >::apply(boost::shared_ptr<lsst::afw::detection::Source>, boost::shared_ptr<lsst::afw::detection::Footprint const>)
        ()
       from /home/buildbot/buildbotSandbox/Linux64/meas_algorithms/svn21974/lib/libmeas_algorithms.so
    #13 0x00002aaab590a993 in _wrap_MeasureSourcesF_apply__SWIG_0 (
        self=<value optimized out>, args=<value optimized out>)
        at python/lsst/meas/algorithms/algorithmsLib_wrap.cc:21086
    #14 _wrap_MeasureSourcesF_apply (self=<value optimized out>,
        args=<value optimized out>)
        at python/lsst/meas/algorithms/algorithmsLib_wrap.cc:21197
    
  • Resolution:
    • Developers preferred to recreate error on a local system. Production was aborted.

1 June 2011

1717

  • Why: Use new suite of 9 images taken from PT1.2 Production image dataset
  • Setup: /home/buildbot/slave/trunkVsTrunk_lsst/work/weeklyPR/pipeline_2011_0601_171743
  • Codeset: trunk; see /lsst3/weekly/datarel-runs/wp_trunk_2011_0601_171743/config/weekly.tags
  • Dataset: full, new set of 9 images taken from PT1.2 Production dataset
  • Output: /lsst3/weekly/datarel-runs/wp_trunk_2011_0601_171743
  • Database: buildbot_PT1_2_u_wp_trunk_2011_0601_171743
  • Status: Processed 501 out of 502 input sensor records.
    • Final processing counts:
       icSrc: 501
       psf: 501
       sdqaCcd: 1004
       src: 499
       calexp: 501
       icMatch: 501
       postISR: 16064   (16064/32 = 502 sensors input)
       apCorr: 501
       csv-SourceAssoc: 129
      
    • Memory Exception Error
      • Input data tangentially responsible for failure: {'snap': 1, 'raft': '0,2', 'sensor': '1,2', 'visit': 886236101}
      • Last persistable output prior to memory exception: Ending persisting sourceSet_persistable as src with keys {'raft': '0,2', 'sensor': '1,2', 'visit': 886236101}
      • Error Traceback
        harness.slice.visit.stage.tryProcess FATAL: Traceback (most recent call last):
          File "/home/buildbot/buildbotSandbox/Linux64/pex_harness/svn20013/python/lsst/pex/harness/Slice.py", line 563, in tryProcess
            stageObject.applyProcess()
          File "/home/buildbot/buildbotSandbox/Linux64/pex_harness/svn20013/python/lsst/pex/harness/stage.py", line 353, in applyProcess
            self.process(clipboard)
          File "/home/buildbot/buildbotSandbox/Linux64/ctrl_sched/svn20014/python/lsst/ctrl/sched/pipeline.py", line 511, in process
            self.tellJobDone(clipboard)
          File "/home/buildbot/buildbotSandbox/Linux64/ctrl_sched/svn20014/python/lsst/ctrl/sched/pipeline.py", line 498, in tellJobDone
            self.tellDataReady(clipboard)
          File "/home/buildbot/buildbotSandbox/Linux64/ctrl_sched/svn20014/python/lsst/ctrl/sched/pipeline.py", line 441, in tellDataReady
            possible = client.tellDataReady(possible, completed)
          File "/home/buildbot/buildbotSandbox/Linux64/ctrl_sched/svn20014/python/lsst/ctrl/sched/pipeline.py", line 207, in tellDataReady
            self.dataSender.createDatasetEvent(self.name, report, fullsuccess))
          File "/home/buildbot/buildbotSandbox/Linux64/ctrl_sched/svn20014/python/lsst/ctrl/sched/utils.py", line 123, in send
            self.trxr.publishEvent(event.create())
          File "/home/buildbot/slave/trunkVsTrunk_lsst/work/svn/ctrl_events_21011/python/lsst/ctrl/events/eventsLib.py", line 939, in publishEvent
            return _eventsLib.EventTransmitter_publishEvent(*args)
        LsstCppException: 0: lsst::pex::exceptions::MemoryException thrown at src/Citizen.cc:332 in long unsigned int lsst::daf::base::defaultCorruptionCallback(const lsst::daf::base::Citizen*)
        0: Message: Citizen "2116043: 0xbe5a528 lsst::daf::base::PropertySet" is corrupted
        
  • Resolution:
    • Time to merge that Ticket fixing the memory issue.
    • Time for Buildbot to figure out why 'latest_trunk' and 'latest_tags' are not being automatically set on job completion.