Changes between Initial Version and Version 1 of FaultToleranceWorkshopRussLahersNotes


Ignore:
Timestamp:
07/18/2008 02:06:22 PM (11 years ago)
Author:
rlaher
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • FaultToleranceWorkshopRussLahersNotes

    v1 v1  
     1= Russ Laher's Notes from the LSST Fault Tolerance Workshop = 
     2 
     3== Day 1: July 15, 2008 == 
     4 
     5FT = fault tolerance 
     6 
     7-------------------------------------------- 
     8There is no interface document for ingesting raw images. 
     9  * scheduled vs. received raw image 
     10  * checksum (MD5) 
     11  * uncorrected errors in TCP communications are common 
     12  * multiple copies of raw data 
     13  * backup copy of raw data off-site 
     14  * procedure for buying backup storage to put on the mountain 
     15  * N *days backup storage on mountain 
     16  * higher reliability hardware for raw *data storage 
     17  * plan for replacing disk hardware and migrating data 
     18  * scrubbing every year or six months (full blown and statistical), track that this 
     19    has been done for specific images; track error rate; dedicated scrubbing vs.  
     20    incorporating the scrubbing into the pipeline processing. 
     21-------------------------------------------- 
     22 
     23Define Scope Infrastructure vs. Middleware vs. Applications 
     24 
     25Interface should be requirements for reliability.  Off the shelf vs. highly reliable  
     26hardware.  Requirements specified frequency, duration, extent -- not just percentage 
     27availability.   
     28 
     29Should shoot for high reliabilty for maximizing science return and cost savings. 
     30 
     31Need requirements for 
     32 
     33 * Servers 
     34 * Pipeline machines 
     35 * Disk Storage 
     36 * Network 
     37 
     38Scheduled maintenance downtime.  Won't be taking data 100% of the nights.  Issue  
     39of when to do the scheduled maintenance. 
     40 
     41Unscheduled downtime.  Strive for low levels of this. 
     42 
     43Algorithmic faults. 
     44 
     45Worst-case catch-all for unclassified faults. 
     46 
     47 
     48User-contributed software.  Run on separate machines and sandbox with firewall  
     49isolation.  Can't impact operations processing.  Another option is VM with virtual 
     50walls. 
     51 
     52 
     53Need system to mark data that is found bad later after it has been distributed 
     54to the public. 
     55 
     56 
     57Guiding principle for fault tolerance: 
     58 
     59 1. Reducing human intervention for common problems -- automated 
     60 2. Tracebility 
     61 3. Real-time processing vs. reprocessing 
     62 4. Address/fix problems so that failures are correlated 
     63 
     64Detecting class of faults 
     651. Resource failures fail or succeed when pipeline run is repeated (network load). 
     66 
     67Database transactions. 
     68 
     69Issues surrounding faults: 
     70 * Location 
     71 * Repeatabilty 
     72 * State 
     73 
     74Failure can be in terms of getting done, but not getting done with required time period. 
     75 
     76Failures can be  
     77 1. Loud, noisy, or run-away 
     78 2. Silent 
     79 3. Corrupting 
     80 
     81 
     82Should we look at correlation of failures? 
     83 
     84 
     85Fault of FT system!!! 
     86 
     87-------------------------------------------- 
     88What is the role of SDQA results in overall fault tolerance and  
     89what is the impact on 
     90middleware design? 
     91-------------------------------------------- 
     92 
     93High-level requirements related to FT.  How should we interpret these requirements? 
     94 
     95 * SDR 
     96 
     97     * Very little, except for 60-s alerts and OTT1. 
     98 
     99 * FDR  
     100   * Storage: 0% data loss  (raw data, metadata); 98% availability 
     101   * Communications: 0.1% alert-publication failure; 98% availability.   
     102   * p. 21 TBD has FT implications 
     103   * p. 40 - There are pipeline requirements relevant to FT. 
     104   * Lots of requirements about data release - p. 14 
     105   * Software licenses have to be kept (p. 19 is an impossible requirement). 
     106   * p. 59-61 - Reliability requirements 
     107   * Can't finish four hours after night's observations end, because the nightly pipelines have to be executed on all images first. 
     108   * Scheduling of observations has to be fedback to, for example, calibration-pipeline execution and production of calibration images. 
     109 
     110 
     111K.-T. proposed two documents outcome from this workshop: 
     112 
     1131. Overall - Hardware and SDQA components 
     1142. Middleware specific 
     115 
     116What do we mean by FT? 
     117 
     118What are the criteria for failures that we want to address? 
     119 
     120Criteria such as something that causes data products to not be available to public. 
     121 
     122 
     123 
     124Strategies for meeting goals 
     125Plans 
     126Recommendations 
     127 
     128 
     129 
     130FT methodology (different philosophical approaches) 
     131 
     132 * master driven system 
     133 * peer to peer, independent fault checkers 
     134 
     135 
     136Instead of one local disk per multi-core CPU (box), have a SAN clustered to, say, 
     137three CPUs. 
     138 
     139Intrinsic failure rate for image processing (or portions of image). 
     140 
     141Hardware redundancy to reprocess an image segment (amplifier). 
     142 
     143Reprocessing since last checkpoint. 
     144 
     145At one granularity is it practical to drop/lose a portion of the processed image 
     146(amplifier or smaller). 
     147 
     148 
     149Triply redundant processing done in a Monte Carlo fashion (or just one amplifier, the same amplifier). 
     150 
     151 
     152Rendevous of data (FWHM of PSF overlapping adjacent CCD, ghost images,...) 
     153 
     154Understand the consequences of failure 
     155 
     156 
     157== Day 2: July 16, 2008== 
     158 
     159Sample Pipeline Exception -- check that data is accessible to pipeline 
     160 
     161     * Possibilities 
     162       * template images must be at base and cached on disk 
     163       * database data must be cached 
     164       * calibration images must be available 
     165       * policy files must be available 
     166 
     167     * Detection strategy 
     168       * Test run pipeline prior to commencement of processing 
     169       * Check for file existence and retrieve from alternate location, if necessary 
     170       * Check whether database query ran successfully 
     171 
     172OCS says they are pointing one place, but then is really pointing somewhere else 
     173 
     174Possible fault unique to LSST, which image data are not stored as FITS files, a 
     175mismatch between image and image metadata. 
     176 
     177Mountain catalog storage strategies to maximize utility of available disk storage 
     178and give some fault tolerance:  
     179 * store small portion of catalog of bright sources 
     180 * store two copies of either summer or winter sky 
     181 
     182Use case for SDQA: 
     183 * Image metadata is missing, garbage, or inconsistent with image data 
     184 * WCS may fail (may have limitations or need bootstrapping) 
     185 
     186 
     187Common practices: 
     188 * Watchdogs deployed on separate machines 
     189 * Redundancy (hardware, database server, database replication) 
     190 
     191 
     192 
     193Processor failure 
     194 
     195Disk failure 
     196  * Can't open/close file 
     197  * RAID monitoring and continuous scrubbing (block-level checksuming) 
     198  * Query/monitor ECC bad-block activity increases (limited value) 
     199  * Silent corruption detect by pipeline-external checksum verification 
     200  * Multi-level checksum verification of file data and memory data 
     201 
     202 
     203 
     204Database problems 
     205  * record(s) missing 
     206  * more than one record unexpectedly returned 
     207  * too many database connections 
     208  * can't connect to database (permission problem, server down) 
     209  * can't set database role 
     210  * cant execute query (role missing grant) 
     211  * table locking 
     212  * queries take too long (database tuning or statistics need updating) 
     213  * server down 
     214  * inserting record with primary key validation 
     215  * not enough disk space allocated for large table (inefficiency) 
     216  * transaction logging out of disk space 
     217 
     218 
     219 
     220Corruption of communication between nodes 
     221 
     222 
     223 
     224General FT 
     225 
     226Testing and comparison RBT 
     227Verification/testing (watchdogs) 
     228Duplicating things in space and time (eliminate single points failure) 
     229Mechanisms for detection failure 
     230Detection vs. response mechanisms 
     231Redundanct execution of processes 
     232Limit overwrites 
     233Separate mutable vs. non-mutable data 
     234 
     235Prevention of failure 
     236 
     237Response to failure 
     238 
     239Reconfigure system on the fly 
     240 
     241 
     242CMSD cluster technology, separate from hardware, for communication, 
     243with replicatable master server (Anthony) 
     244 
     245 
     246Double the capacity without checkpointing, or a few additional 10% with 
     247checkpointing, is needed to meet 60-s alert requirement (zero failures). 
     248Redo affected CCD, not just amplifier.  Need extra boxes for small number 
     249of failures a minute late. 
     250 
     251High-speed SAN 
     252 
     253-------------------------------------------- 
     254Action item: 
     255 
     256Spreadsheet the nightly-pipeline data volume and rate through a core.  Need to  
     257size the requirement throughput to meet the 30 s.  There will be addition 30 s 
     258budgeted for source association, alert generation and transfer down the 
     259mountain. 
     260 
     2612 x 11.5 GB / 30 s = 767 MB /s   (internal memory bandwidth is not an issue) 
     262^ 
     263| 
     264reading AND writing 
     265-------------------------------------------- 
     266 
     267 
     268Define classes of failures, redundancy, hot spare, check-pointing, impact on system. 
     269 
     270What specifically needs to be monitored 
     271 
     272Maintenance throughout mission 
     273   disk defragmentation 
     274   disk replacement 
     275   add transaction-log space 
     276   add file-storage space 
     277   database tuning 
     278   database data verification 
     279   database indexing 
     280 
     281Engineering automated maintenance 
     282 
     283Human monitoring  component 
     284 
     285Enumerate specifically every fault that needs to be handled 
     286 
     287Requirements document (or section in planning document) 
     288 
     289Use Cases document (or section in planning document) 
     290 
     291Number of personnel needed for LSST operations 
     292 
     293Four major areas of LSST fault tolerance: 
     294 1. Middleware 
     295 2. Database 
     296 3. Hardware 
     297 4. Facility 
     298 
     299SDQA FT is out of band (not defined to generate "exceptions" in the sense of this workshop, but, rather, "alerts"). 
     300 
     301Application software exceptions cannot be handled automatically -- there must 
     302be human intervention to fix the problem.  If something can be done automatically 
     303to fix the problem, the fix will be algorithmic and should be handled within the 
     304application layer (either in C++ code or Python script). 
     305 
     306Applications developers must follow robust code.  We have to deal with software 
     307exceptions from the applications layer.  Code checker software.  CCB policing. 
     308Coding guidelines.  Regression testing. 
     309 
     310Specific application exceptions can be subclassed from the middleware base 
     311class for catch-all application exceptions. 
     312 
     313Detecting and validating dependencies of pipelines on specific. 
     314 
     315Some middleware exception handling relates to I/O: 
     316 
     317File systems 
     318Sockets 
     319Memory allocations 
     320Database access 
     321 
     322Middleware API for getting calibration files is needed. 
     323 
     324 
     325Store subversion revision numbers of third-party software tar balls. 
     326 
     327Application software does no I/O.  Its input data are only read from the clipboard, 
     328and its output data are only written to the clipboard.  Clipboard just holds pointers 
     329to objects. 
     330 
     331Variance on processing time for data-dependent data reduction. 
     332 
     333Three products from this workshop: 
     334 
     335 1. Near term summary 
     336 2. PDR presentation 
     337 3. Operations plan (beyond PDF) 
     338 
     339Design plan, but not development/implementation plan 
     340 
     341Use DC3 to evaluate the feasibility of check-pointing? 
     342 
     343Need to cost out clusters with SANs (Storage Area Network, a high speed,  
     344special-purpose network that connects to storage devices). 
     345 
     346Estimate how often a box will fail -- use industry data. 
     347 
     348Hardware includes rack power supplies, and can include rack-isolated cooling systems, 
     349switch, line card, disk storage, box (multi-core CPU, CPU cache, RAM, local disk) 
     350