Changes between Initial Version and Version 1 of FaultToleranceDocumentUseCases


Ignore:
Timestamp:
07/31/2008 08:21:47 AM (11 years ago)
Author:
rlaher
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • FaultToleranceDocumentUseCases

    v1 v1  
     1= LSST DMS Fault-Tolerance Use Cases = 
     2 
     3== Introduction == 
     4 
     5The subsections below present use cases for faults in the LSST DMS.  These use cases  
     6answer the question: what are the possible things that could or might happen  
     7to cause a fault in the LSST DMS?  Another way to look at it is that a  
     8fault-tolerance use case essentially comprises the "alternate course" of a basic  
     9use case. 
     10 
     11Of course, it is impossible to make a complete list of everything that could   
     12possibly go wrong.  However, listing the problems from past experience on prior  
     13ground-based astronomical pipeline-processing projects seems to be a good  
     14starting point for building up a set of credible fault-tolerance use cases for the  
     15LSST DMS.  And delving into the details will ensure that the LSST DMS will be  
     16robust, as it is commonly held that well over half of a project's complexity is  
     17caused by dealing with alternate courses of action or, simply, faults. 
     18 
     19These use cases are primarily fomulated from the perspective of what could or  
     20might go wrong to interfere with the transmission and/or preservation of the  
     21raw data (images and metadata), and/or production of the processed data  
     22products (nightly alerts, and science catalog data).  Prevention of loss of raw  
     23image data and metadata is an important and absolutely essential job of the  
     24LSST DMS. 
     25 
     26Another class of use cases of deep concern is the generation of nightly alerts  
     27and preservation of science catalog data.  This is attuned toward meeting LSST  
     28science requirements, which is obviously important, and special attention to  
     29meeting functional requirements is also critical. 
     30 
     31It is assumed that the LSST DMS has subsystems in the summit observatory,  
     32base facility, archive center, and data access centers.  These subsystems 
     33are data-connected via a short-haul network between summit and base, 
     34and a long-haul network from base to archive center and data access centers.   
     35DMS subsystems include facilities, staff, hardware, software, database, and  
     36data.  Furthermore, it is assumed that acquisition of the raw data is outside  
     37of the purview of the DMS, but, once the raw data is acquired, it falls into the  
     38DMS domain. 
     39 
     40Any loss of raw images and their metadata are absolutely not allowed under 
     41project requirements and, hence, any such loss discussed below in the context  
     42of use cases is really only temporary loss.  The data backup plan must therefore 
     43include highly reliable storage media, aggressive checksum validation, file  
     44storage cross-validation with database records, and geographically distributed  
     45redundant copies of the data.  Redundant copies of the data should also be  
     46validated.  Both the raw-image-data files and raw-image-metadata database  
     47tables must be backed up according to the plan. 
     48 
     49Note that the terminology "SDQA fault" used below basically refers to mistakes made by the  
     50SDQA subsystem in 1) missing problems with the data, and 2) falsely identifying  
     51problems with the data.  Since no detection algorithm is perfect, it is expected that  
     52neither SDQA-detection completeness nor reliability will be 100%.  Nevertheless,  
     53the SDQA system will be tuned to achieve the best possible compromise between  
     54completeness and reliability for the LSST DMS overall. 
     55 
     56Finally, any generic software fault could also apply to the SDQA software, and to 
     57fault-tolerance-related software, such as watchdog monitors, etc. 
     58 
     59== Prioritization of Faults == 
     60 
     61We classify LSST DMS faults in terms of their priority and, to cover all 
     62cases, designate three levels of priority.   
     63 
     64Priority-1 faults require the highest level of attention and resources, and are  
     65those that:  
     66 
     67     1. Delay transmission of the raw images and their metadata from the summit to the base facility 
     68     2. Prevent reliable storage of raw images and their metadata 
     69     3. Result in only temporary unavailability of raw images or their metadata, rather than complete loss.   
     70 
     71Indeed, the data will be irrecoverably lost unless the data backup plan is comprehensive, reliable and 
     72bullet-proof.   
     73 
     74Priority-2 faults require lower levels of attention and resources than priority-1 
     75faults, and are those associated with  
     76 
     77     1. Inability to meet the 60-s requirement for nightly alert generation 
     78     2. Loss of science catalog data  
     79 
     80One rationale for the content of the priority-2 level is that the nightly alerts and  
     81science catalogs, derived from the raw data, are the primary processed data  
     82products of the LSST DMS.  These derived data products can be recomputed from  
     83the raw data, but at some dollar cost, as well as failing to meet the 60-second  
     84requirement.  A robust plan for minimizing possible faults that hinder meeting  
     85the time constraint (for item 1) and reliably replacing lost science catalog data  
     86from redundant data backups (for item 2) is of paramount importance. 
     87 
     88Priority-3 faults require still lower levels of attention/resources than priority-1  
     89and priority-2 faults, and include things that can go wrong during the data  
     90release processing, which can lead to loss of processed images, their metadata,  
     91science catalog data, and other database metadata (until the associated raw data  
     92are reprocessed), especially in recent processing history, and temporary reduction  
     93in data-processing throughput. 
     94 
     95 
     96 
     97== Use Cases for Priority-1 Faults ==      
     98 
     99The following are use cases that cover loss of raw image data and metadata. 
     100 
     101     1. Faults in temporary summit storage 
     102         * Raw image data are lost 
     103         * Raw image metadata are lost  
     104         * Raw image data/metadata associations are lost 
     105     2. Faults in temporary base storage 
     106         * Raw image data are lost 
     107         * Raw image metadata are lost  
     108         * Raw image data/metadata associations are lost 
     109     3. Faults in primary archive storage  
     110         * Raw image data are lost 
     111         * Raw image metadata are lost  
     112         * Raw image data/metadata associations are lost 
     113     4. Faults in redundant archive storage  
     114         * Raw image data are lost 
     115         * Raw image metadata are lost  
     116         * Raw image data/metadata associations are lost 
     117     5. Uncorrected errors in TCP network data transfer 
     118 
     119Metadata about the raw-image data include, but are not limited to, all copies of  
     120database records indicating where the primary and redundant copies are stored. 
     121 
     122Data loss includes data corruption, which effectively renders the data useless. 
     123 
     124Data corruption includes unrecoverable errors found by disk ECC 
     125and silent data corruption (not detected by disk ECC). 
     126 
     127== Use Cases for Priority-2 Faults == 
     128 
     129The following are use cases that cover faults that prevent nightly alert  
     130generation within 60 seconds and  loss of science catalog data. 
     131 
     132     1.   Nightly alerts are not generated because of facility fault 
     133     2.   Nightly alerts are not generated because of human fault 
     134     3.   Nightly alerts are not generated because of hardware fault 
     135     4.   Nightly alerts are not generated because of resource fault 
     136     5.   Nightly alerts are not generated because of software fault 
     137     6.   Nightly alerts are not generated because of database fault 
     138     7.   Nightly alerts are not generated because of data fault 
     139     8.   Nightly alerts are not generated because of SDQA fault 
     140     9.   Sources and/or objects are misidentified or inaccurate because of SDQA fault 
     141     10. Sources and/or objects database records are lost from primary storage 
     142     11. Sources and/or objects database records are lost from redundant storage 
     143 
     144Possible facility, human, hardware, resource, software, database and data faults  
     145are detailed separately below.  In some cases, the specific fault leading to  
     146processing failure can be classified in multiple categories.  Note that database 
     147faults are put in a separate category because of their special nature and the  
     148specialization required to address them. 
     149 
     150== Use Cases for Priority-3 Faults == 
     151 
     152The following are use cases that cover loss of processed image data, especially  
     153in recent processing history. 
     154 
     155     1. Data release processing fails because of facility fault 
     156     2. Data release processing fails because of human fault 
     157     3. Data release processing fails because of hardware fault 
     158     4. Data release processing fails because of a resource fault 
     159     5. Data release processing fails because of software fault 
     160     6. Data release processing fails because of database fault 
     161     7. Data release processing fails because of data fault 
     162     8. Data release processing fails because of SDQA fault 
     163     9.   Sources and/or objects are misidentified or inaccurate because of SDQA fault 
     164     10. Sources and/or objects database records are lost from primary storage 
     165     11. Sources and/or objects database records are lost from redundant storage 
     166 
     167Possible facility, human, hardware, resource, software, database and data faults  
     168are detailed separately below.  In some cases, the specific fault leading to  
     169processing failure can be classified in multiple categories.  Note that database 
     170faults are put in a separate category because of their special nature and the  
     171specialization required to address them. 
     172 
     173== Underlying Causes of Faults == 
     174 
     175=== Facility Faults === 
     176 
     177 * Natural disaster (fire, earthquake, flood, tornado, etc.) 
     178 * Man-made catastrophe (radiactive contamination, airline crash, poisonous gas, etc.) 
     179 * Act of war (attack, seige, sabotage, etc.) 
     180 * Security 
     181     * Computer firewall breach (hacker, virus, etc.) 
     182     * Unauthorized computer-room access 
     183 * System resets (e.g., checksum mismatches correlate with this) 
     184 * Air-conditioning malfunction 
     185 * Electrical fuse blown 
     186 
     187=== Human Faults === 
     188 
     189 * Staff problems (malicious intent, negligence, retention/turnaround, labor strike, slow down or sick out, etc.) 
     190 * Pipeline-operator procedural error 
     191 * Specialist unavailability (e.g., DBA or mySQL expert during crisis) 
     192 * Slow turnaround in fixing/delivering software bugs  
     193 
     194=== Hardware Faults === 
     195 
     196 * Summit-base fiber link/interfaces failure ("short haul") 
     197 * Global data-transer link/interfaces failure ("long haul") 
     198 * CPU failure 
     199 * RAM failure 
     200 * Local disk failure 
     201 * Power supply failure 
     202 * Network switch failure 
     203 * Network disk problems 
     204     * Catastrophic failure (disk media, disk controller, etc.) 
     205     *  Corrupted data 
     206         * Latent sector errors caught by disk ECC 
     207         * Silent corruption (checksum mismatches) 
     208 * Hardware upgrade not compatible with software (portability issues, backward uncompatibility, etc.) 
     209 * Unsuccessful machine reboot 
     210 
     211=== Resource Faults === 
     212 
     213 * Power failure (black out, brown out, etc.) 
     214 * Insufficient disk space 
     215 * Disk performance degradation (can occur for disks > 90% full, fragmentation, etc.) 
     216 * Disk threshing caused by insufficient memory 
     217 * Disk/network speed mismatch (bandwidth, maximum number of reads/writes per second, etc.) 
     218 * Database resource faults 
     219    * Bandwidth limitations caused by resource over-allocation 
     220    * Too many database connections 
     221    * Performance degradation caused by  
     222       * Large tables filling up 
     223       * Too many queries running 
     224       * Large queries running 
     225       * Usage statistics not updated 
     226       * Insufficient table-space allocation 
     227       * Progressive index computation slowdown 
     228       * Transaction logging disk space filling up 
     229       * Transaction rollback taking too long 
     230       * Miscellaneous mistunings 
     231 * Insufficient disk-space allocation 
     232 * Network bandwidth limitation (sustained or peak specifications exceeded) 
     233 * Memory segment fault (stack size exceeded, insufficient heap allocation, misassignment of large-memory process to small-memory machine, etc.) 
     234 * OS limits exceeded (queue length for file locking, number of open files per process, etc.) 
     235 * Bottleneck migration (e.g., increase in processor throughput hammers database harder) 
     236 
     237=== Software Faults === 
     238 
     239 * Software inadequacies and bugs flushed out by data-dependent processing 
     240 * Incorrect software version installed 
     241 * Incompatibility with operation system software 
     242 * OS, library, database software, or third-party-software upgrade problem  
     243 * Cron job, client, or deamon inadvertently stopped 
     244 * Environment misconfiguration or loss (binary executable or third-party software not in path, dynamic library not found, etc.) 
     245 * Processing failures due to algorithmic faults 
     246   * Division by zero 
     247   * No convergence of iterative algorithm 
     248   * Insufficient input data 
     249 * Processing failures related to files 
     250   * Can't open file 
     251   * File not found 
     252 * Processing failures related to sockets 
     253   * Port number not available 
     254   * Socket connection broken 
     255 * Processing failures related to database (also see section on database faults below) 
     256   * Can't connect to database 
     257   * Missing stored function 
     258 * Faults associated with user-contributed software 
     259 * Problems with user retrieving data from archive 
     260 * Problems reverting to previous build (incomplete provenance of software and builds) 
     261 
     262=== Database Faults === 
     263 
     264 * Database server goes down 
     265 * Database client software incompatible with database server software 
     266 * Bugs in upgraded versions of database server software 
     267 * Can't connect to database 
     268 * Can't set database role 
     269 * Can't execute query 
     270 * Can't execute stored function 
     271 * Missing stored function 
     272 * Queries take too long 
     273 * Table locking 
     274 * Transaction rollback error 
     275 * Transaction logging out of disk space 
     276 * Record(s) missing 
     277 * More than one record unexpectedly returned 
     278 * Inserting record with primary key violation or missing foreign key 
     279 
     280=== Data Faults === 
     281 
     282 * Uncorrected errors in TCP communications 
     283 * Missing or bad input data 
     284      * Bad images (missing,  noisy data, or instrument-artifact-contaminated pixels; not enough good sources for sufficient asterometric and/or  photometric calibration; etc.) 
     285      * Missing/unavailable database data (e.g., PM and operations activities not syncronized) 
     286      * Bad or wrong calibration data used in processing 
     287      * Unavailability of calibration images (missing observations, calibration-pipeline error, etc.) 
     288          * Use lower quality fallback calibration data (affects SDQA) 
     289          * Missing fallback calibration data 
     290      * Unavailability of configuration or policy data files 
     291 * Failure to flag dead, dying, or hot pixel-detectors in data mask 
     292 * Publicly release data is found to have problems after it is already released 
     293 
     294=== SDQA Faults === 
     295 
     296 * Incorrect or mistuned QA-metric threshold setting(s) for automatic SDQA 
     297 * Failure to do sufficient manual SDQA on a particular data set 
     298