Changes between Initial Version and Version 1 of FaultToleranceWorkshopRussLahersNotes

07/18/2008 02:06:22 PM (12 years ago)



  • FaultToleranceWorkshopRussLahersNotes

    v1 v1  
     1= Russ Laher's Notes from the LSST Fault Tolerance Workshop = 
     3== Day 1: July 15, 2008 == 
     5FT = fault tolerance 
     8There is no interface document for ingesting raw images. 
     9  * scheduled vs. received raw image 
     10  * checksum (MD5) 
     11  * uncorrected errors in TCP communications are common 
     12  * multiple copies of raw data 
     13  * backup copy of raw data off-site 
     14  * procedure for buying backup storage to put on the mountain 
     15  * N *days backup storage on mountain 
     16  * higher reliability hardware for raw *data storage 
     17  * plan for replacing disk hardware and migrating data 
     18  * scrubbing every year or six months (full blown and statistical), track that this 
     19    has been done for specific images; track error rate; dedicated scrubbing vs.  
     20    incorporating the scrubbing into the pipeline processing. 
     23Define Scope Infrastructure vs. Middleware vs. Applications 
     25Interface should be requirements for reliability.  Off the shelf vs. highly reliable  
     26hardware.  Requirements specified frequency, duration, extent -- not just percentage 
     29Should shoot for high reliabilty for maximizing science return and cost savings. 
     31Need requirements for 
     33 * Servers 
     34 * Pipeline machines 
     35 * Disk Storage 
     36 * Network 
     38Scheduled maintenance downtime.  Won't be taking data 100% of the nights.  Issue  
     39of when to do the scheduled maintenance. 
     41Unscheduled downtime.  Strive for low levels of this. 
     43Algorithmic faults. 
     45Worst-case catch-all for unclassified faults. 
     48User-contributed software.  Run on separate machines and sandbox with firewall  
     49isolation.  Can't impact operations processing.  Another option is VM with virtual 
     53Need system to mark data that is found bad later after it has been distributed 
     54to the public. 
     57Guiding principle for fault tolerance: 
     59 1. Reducing human intervention for common problems -- automated 
     60 2. Tracebility 
     61 3. Real-time processing vs. reprocessing 
     62 4. Address/fix problems so that failures are correlated 
     64Detecting class of faults 
     651. Resource failures fail or succeed when pipeline run is repeated (network load). 
     67Database transactions. 
     69Issues surrounding faults: 
     70 * Location 
     71 * Repeatabilty 
     72 * State 
     74Failure can be in terms of getting done, but not getting done with required time period. 
     76Failures can be  
     77 1. Loud, noisy, or run-away 
     78 2. Silent 
     79 3. Corrupting 
     82Should we look at correlation of failures? 
     85Fault of FT system!!! 
     88What is the role of SDQA results in overall fault tolerance and  
     89what is the impact on 
     90middleware design? 
     93High-level requirements related to FT.  How should we interpret these requirements? 
     95 * SDR 
     97     * Very little, except for 60-s alerts and OTT1. 
     99 * FDR  
     100   * Storage: 0% data loss  (raw data, metadata); 98% availability 
     101   * Communications: 0.1% alert-publication failure; 98% availability.   
     102   * p. 21 TBD has FT implications 
     103   * p. 40 - There are pipeline requirements relevant to FT. 
     104   * Lots of requirements about data release - p. 14 
     105   * Software licenses have to be kept (p. 19 is an impossible requirement). 
     106   * p. 59-61 - Reliability requirements 
     107   * Can't finish four hours after night's observations end, because the nightly pipelines have to be executed on all images first. 
     108   * Scheduling of observations has to be fedback to, for example, calibration-pipeline execution and production of calibration images. 
     111K.-T. proposed two documents outcome from this workshop: 
     1131. Overall - Hardware and SDQA components 
     1142. Middleware specific 
     116What do we mean by FT? 
     118What are the criteria for failures that we want to address? 
     120Criteria such as something that causes data products to not be available to public. 
     124Strategies for meeting goals 
     130FT methodology (different philosophical approaches) 
     132 * master driven system 
     133 * peer to peer, independent fault checkers 
     136Instead of one local disk per multi-core CPU (box), have a SAN clustered to, say, 
     137three CPUs. 
     139Intrinsic failure rate for image processing (or portions of image). 
     141Hardware redundancy to reprocess an image segment (amplifier). 
     143Reprocessing since last checkpoint. 
     145At one granularity is it practical to drop/lose a portion of the processed image 
     146(amplifier or smaller). 
     149Triply redundant processing done in a Monte Carlo fashion (or just one amplifier, the same amplifier). 
     152Rendevous of data (FWHM of PSF overlapping adjacent CCD, ghost images,...) 
     154Understand the consequences of failure 
     157== Day 2: July 16, 2008== 
     159Sample Pipeline Exception -- check that data is accessible to pipeline 
     161     * Possibilities 
     162       * template images must be at base and cached on disk 
     163       * database data must be cached 
     164       * calibration images must be available 
     165       * policy files must be available 
     167     * Detection strategy 
     168       * Test run pipeline prior to commencement of processing 
     169       * Check for file existence and retrieve from alternate location, if necessary 
     170       * Check whether database query ran successfully 
     172OCS says they are pointing one place, but then is really pointing somewhere else 
     174Possible fault unique to LSST, which image data are not stored as FITS files, a 
     175mismatch between image and image metadata. 
     177Mountain catalog storage strategies to maximize utility of available disk storage 
     178and give some fault tolerance:  
     179 * store small portion of catalog of bright sources 
     180 * store two copies of either summer or winter sky 
     182Use case for SDQA: 
     183 * Image metadata is missing, garbage, or inconsistent with image data 
     184 * WCS may fail (may have limitations or need bootstrapping) 
     187Common practices: 
     188 * Watchdogs deployed on separate machines 
     189 * Redundancy (hardware, database server, database replication) 
     193Processor failure 
     195Disk failure 
     196  * Can't open/close file 
     197  * RAID monitoring and continuous scrubbing (block-level checksuming) 
     198  * Query/monitor ECC bad-block activity increases (limited value) 
     199  * Silent corruption detect by pipeline-external checksum verification 
     200  * Multi-level checksum verification of file data and memory data 
     204Database problems 
     205  * record(s) missing 
     206  * more than one record unexpectedly returned 
     207  * too many database connections 
     208  * can't connect to database (permission problem, server down) 
     209  * can't set database role 
     210  * cant execute query (role missing grant) 
     211  * table locking 
     212  * queries take too long (database tuning or statistics need updating) 
     213  * server down 
     214  * inserting record with primary key validation 
     215  * not enough disk space allocated for large table (inefficiency) 
     216  * transaction logging out of disk space 
     220Corruption of communication between nodes 
     224General FT 
     226Testing and comparison RBT 
     227Verification/testing (watchdogs) 
     228Duplicating things in space and time (eliminate single points failure) 
     229Mechanisms for detection failure 
     230Detection vs. response mechanisms 
     231Redundanct execution of processes 
     232Limit overwrites 
     233Separate mutable vs. non-mutable data 
     235Prevention of failure 
     237Response to failure 
     239Reconfigure system on the fly 
     242CMSD cluster technology, separate from hardware, for communication, 
     243with replicatable master server (Anthony) 
     246Double the capacity without checkpointing, or a few additional 10% with 
     247checkpointing, is needed to meet 60-s alert requirement (zero failures). 
     248Redo affected CCD, not just amplifier.  Need extra boxes for small number 
     249of failures a minute late. 
     251High-speed SAN 
     254Action item: 
     256Spreadsheet the nightly-pipeline data volume and rate through a core.  Need to  
     257size the requirement throughput to meet the 30 s.  There will be addition 30 s 
     258budgeted for source association, alert generation and transfer down the 
     2612 x 11.5 GB / 30 s = 767 MB /s   (internal memory bandwidth is not an issue) 
     264reading AND writing 
     268Define classes of failures, redundancy, hot spare, check-pointing, impact on system. 
     270What specifically needs to be monitored 
     272Maintenance throughout mission 
     273   disk defragmentation 
     274   disk replacement 
     275   add transaction-log space 
     276   add file-storage space 
     277   database tuning 
     278   database data verification 
     279   database indexing 
     281Engineering automated maintenance 
     283Human monitoring  component 
     285Enumerate specifically every fault that needs to be handled 
     287Requirements document (or section in planning document) 
     289Use Cases document (or section in planning document) 
     291Number of personnel needed for LSST operations 
     293Four major areas of LSST fault tolerance: 
     294 1. Middleware 
     295 2. Database 
     296 3. Hardware 
     297 4. Facility 
     299SDQA FT is out of band (not defined to generate "exceptions" in the sense of this workshop, but, rather, "alerts"). 
     301Application software exceptions cannot be handled automatically -- there must 
     302be human intervention to fix the problem.  If something can be done automatically 
     303to fix the problem, the fix will be algorithmic and should be handled within the 
     304application layer (either in C++ code or Python script). 
     306Applications developers must follow robust code.  We have to deal with software 
     307exceptions from the applications layer.  Code checker software.  CCB policing. 
     308Coding guidelines.  Regression testing. 
     310Specific application exceptions can be subclassed from the middleware base 
     311class for catch-all application exceptions. 
     313Detecting and validating dependencies of pipelines on specific. 
     315Some middleware exception handling relates to I/O: 
     317File systems 
     319Memory allocations 
     320Database access 
     322Middleware API for getting calibration files is needed. 
     325Store subversion revision numbers of third-party software tar balls. 
     327Application software does no I/O.  Its input data are only read from the clipboard, 
     328and its output data are only written to the clipboard.  Clipboard just holds pointers 
     329to objects. 
     331Variance on processing time for data-dependent data reduction. 
     333Three products from this workshop: 
     335 1. Near term summary 
     336 2. PDR presentation 
     337 3. Operations plan (beyond PDF) 
     339Design plan, but not development/implementation plan 
     341Use DC3 to evaluate the feasibility of check-pointing? 
     343Need to cost out clusters with SANs (Storage Area Network, a high speed,  
     344special-purpose network that connects to storage devices). 
     346Estimate how often a box will fail -- use industry data. 
     348Hardware includes rack power supplies, and can include rack-isolated cooling systems, 
     349switch, line card, disk storage, box (multi-core CPU, CPU cache, RAM, local disk)