Changes between Initial Version and Version 1 of FaultToleranceWorkshopRaysNotes


Ignore:
Timestamp:
07/16/2008 05:01:48 PM (11 years ago)
Author:
RayPlante
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • FaultToleranceWorkshopRaysNotes

    v1 v1  
     1= Ray's Notes from the Fault Tolerance Workshop =  
     2 
     3=== Discussion of [wiki:FaultToleranceInterfaces] === 
     4 
     5Discussion of sched. vs unsched. outages: 
     6  * page mainly addresses unsched. 
     7  * scheduled addressed by  
     8      *  telescope stream not always on 
     9      *  rundundancy can allow scheduled maintanence of parts of system 
     10 
     11 Infrastructure:: 
     12   develop requirements on hardware 
     13 
     14 Application Layer:: 
     15   specify the services and mechanisms available to application layer 
     16 
     17 Isolation of systems:: 
     18 
     19 
     20=== Failure Modes ([wiki:FaultToleranceUseCases]) === 
     21 
     22characteristics: 
     23   *  repeatability 
     24  *  location 
     25  *  stateless vs. stateful 
     26  *  scale (one node vs. systems) 
     27  *  causes slow execution 
     28  *  noisiness: silent, corrupting, or run-away 
     29  *  correlated failures 
     30 
     31What is the scope of this document? 
     32 
     33How is the role of SDQA analysis in overall fault-tolerance?  What is 
     34the impact on middleware?   
     35 
     36  Ease of implementation at each level:  you prefer to push problem 
     37  solving as low down as possible where it is easiest.  Higher levels 
     38  you may have more information, but it's more complicated to 
     39  address.   
     40 
     41=== Requirements === 
     42 
     43Science requirements:  section 3.5, DP and mgtmt reqs 
     44 
     45 OTT1:: 
     46   Not mentioned: How often must we meet the design/minimum  [[BR]] 
     47    Gregory: use these 2 to define a heuristic (e.g. 1 sigma within [[BR]] 
     48    design spec, 3 sigma within minimum). 
     49 
     50 DMFR:: 
     51   Section 2: [[BR]] 
     52    OTT1 drives sizing model: [[BR]] 
     53    0% data loss  (information associations part of this) [[BR]] 
     54    0.1% alert publication failure (1 minute or 2 minute goal?) [[BR]] 
     55    what does 98% availability mean?   
     56     
     57   Section 4: [[BR]] 
     58   App Layer:   [[BR]] 
     59    reprocessing scopes capacity    [[BR]] 
     60    24 hour downtime needs clarification [[BR]] 
     61    detection lists update latency is still TDB; what is detection list? [[BR]] 
     62    metadata summary updated every 6 months (goal/rationale?) [[BR]] 
     63    most stringent DM-APP-DP-AL-2:  alerts 60 s after 2nd image [[BR]] 
     64    DQ report available w/in 4 hours of end of obs. night. [[BR]] 
     65    performance report w/in 4 hours ": has AC info, bu AC processing stretched over day [[BR]] 
     66    What is the catch-up capacity?  if there is a DQA detected problem, how long do we have available to correct it? [[BR]] 
     67     
     68 
     69 Human Error:: 
     70  treat like any other error [[BR]] 
     71  best practices for human-based control/configuration 
     72 
     73 
     74know how to buy back extra capacity.   
     75 
     76 tapes:: 
     77   read back processes; all data or sampling 
     78 
     79If we can meet deadline 
     80  * Redo entire visit 
     81  * redo since checkpoint 
     82  * full redundant option 
     83 
     84If not, 
     85  * redo portion of visit 
     86 
     87triply redundant processing of a few amplifiers? 
     88 
     89every day re-run the same test pipeline on the production system, run to detect changes in result. 
     90 
     91Goals for Wed. 
     92  Working List of common practices. 
     93 
     94 
     95Common Practices: 
     96   * test runs prior to production runs 
     97   * backups/duplication of data to enable swap in response 
     98   * mechanisms for detecting failure (heartbeats, checksums, ...) 
     99       * on-the-fly checksum checking during read-in 
     100       * sampled redundant processing 
     101       * independent and dependent heartbeats 
     102   * rescheduling of failed processes, allow for automated re-configuration 
     103   * redundant execution of processes, fail-over to back up processes (db) 
     104   * prevention techniques: 
     105       * choice of fault-tolerant hardware 
     106   * limit need for roll-backs: 
     107     *  limit over-writes 
     108     *  separate read-only vs. over-write 
     109 
     110recovering patterns 
     111  *  doubling cpu, allowing one to fail 
     112  *  check-pointing with SAN with fail-over 
     113  *  drop portions, use spare capacity to redo & deliver late 
     114  *  drop entire frame 
     115 
     116ramifications:  
     117  * dropping portions: minimize the scale of correlated processing; 
     118       full focal plane correlations means losing whole visits 
     119  * how long does it take to reconfigure system? 
     120 
     121maintanence practices: 
     122  *  operations manual 
     123  *  allow time for development of maintanence scripts/documentation 
     124  *  any hardware health maintenance 
     125  *  documenting fault-tolerance patterns 
     126 
     127DC3: 
     128  * Can we evaluate the feasability of check-pointing? 
     129 
     130 
     131