Changes between Version 70 and Version 71 of FaultDetection


Ignore:
Timestamp:
07/29/2008 10:44:23 PM (11 years ago)
Author:
daues
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • FaultDetection

    v70 v71  
    1111Long-lived processes that execute application codes and perform computation in the pipeline framework will be monitored by watchdog daemons that check on their activity and state.  The watchdog daemons will listen on  messages from the running processes (for example, through subscription to appropriate event topics or channels), and will also employ heartbeat monitoring to periodically register if a process is still active or not. In addition to checking if the process is alive, the heartbeat monitor will examine the state of the outputs and properties of the process to assess whether it is performing  adequately or has started operating at a level that is inordinately subpar. Processes that the watchdog daemon has determined 1) to   have exited abnormally, 2) to have started operating slowly or inefficiently,  3) to be hanging, or 4) to be exhibiting runaway behavior that consumes exorbitant resources will be halted as needed and restarted by the watchdog.  Repairs or transitioning of parallel communicators will be an important part of the recovery of the system when failed processes are removed and/or restarted. Communicators might proceed with a subset of running processes after failed threads have been pared away (a case of degraded service or capability), or they may be fully reconstituted with fresh processes added to replace failed ones. The decision on which of these paths to select will be made based on context, e.g., for the real time processing communicators will not be reconstituted if the time required will cause the system to miss a deadline and fall unacceptably behind, whereas large scale archive center reprocessing may operate in a more thorough manner and rebuild the communicators.  
    1212 
    13 === Checksum Validation at each Stage and across Levels === 
     13=== Checksum Validation within a Data Access Framework === 
    1414 
    1515Detection of errors in the transfer of image files across networks and into archive spaces and file systems will be accompished using checksum validation after each operation. The process of ensuring the integrity of the raw data will begin with the creation of a checksum at an early juncture, perhaps before the arrival of the image into the context of the LSST DM system proper.  Redundant data will be generated at a very early stage as well.