Changes between Initial Version and Version 1 of FaultToleranceWorkshopJaceksNotes

07/17/2008 06:51:18 PM (11 years ago)



  • FaultToleranceWorkshopJaceksNotes

    v1 v1  
     1= Jacek's notes from the Fault Tolerance Workshop = 
     3[wiki:FaultToleranceWorkshop Fault Tolerance Workshop] 
     6For sure we can't afford high-end hardware 
     8Design system for alerts such that no scheduled outages 
     9are required during data taking 
     11Availability requirements: need to define acceptable frequency and extent of outage 
     12 * e.g. ok to have 1 msec outage every 1 sec?, 
     13 * ok to have 3 day outage every 300 days? 
     14(both examples give 99% availability) 
     17High reliability is not only good for science, but saves on budget 
     18 * bad reliability requires experts available on call 24x7 to fix problems, that is expensive 
     20we should provide "service level agreement" to infrastructure people, see example "sample requirements" on FaultToleranceInterfaces page 
     23need to pick something to start costing (will we use redundancy? retry?, ...) 
     25need to provide FT within existing budget (challenge) 
     27sandboxing is useful, but adds lots of complexity 
     29will users be bringing and executing their own code inside pipeline? 
     30 * unclear 
     31 * will probably have clusters of machines for user code that is not part of production 
     33certain thing "have to run", some "would be nice to run", system should allow such division 
     35reruning has two modes: 
     36 * rerun to fix problems. Maybe we want to overwrite bad data in this case. Might keep multiple versions, flag bad versions (or trash) 
     37 * rerun to reprocess data with better code/calibration etc. It is a new version, can't rewrite 
     40fault tolerance will take advantage of sophisticated provenance, ft is customer of provenance 
     43requirement: don't release data until you have provenance recorded 
     45real time processing: is it better to stick to deadline and if we miss it, skip it, or is it better to deliver alert about interesting event later (eg 3 min after deadline)? 
     46 * look at time distribution, look at system capacity, see how much we can afford 
     48we really have 3 different deadlines:  
     49 * 1 min (real time alerts) 
     50 * 24 h (nightly redone at main archive) 
     51 * 6 months (deep detection) 
     53it is much easier to diagnose problem if we do not fail over 
     55it is probably fine if things fail for a moment from time to time if that results in building much simpler system 
     57need to try to build firewall against mistakes that will make large portions of data non-usable (eg entire night) 
     58 * does this task belong to middleware or app/sdqa? 
     59 * app people will give us rules to apply, middleware needs to provide code, we need to think what  middleware should support, eg in babar users could  manually load calibration constants, which were used  during data processing. Mistake made: there was no way to check what constants were used --> there was a missing piece in the middleware code 
     61human mistakes will dominate, especially at the beginning 
     65== failure types == 
     67many of things of FaultToleranceUseCases are really characteristics of failure, not types 
     70aggregation of failures: every step takes 0.5 sec longer than average (these are not failures), but overall it looks like a failure because total time will be above threshold 
     73correlations of failures, eg. due to batches of same hardware, environment, (eg high temperature in the room) 
     75security plan will cover some of the issues 
     77don't have to catch every failure at fine grain level 
     79who is watching the watchers? (catching failures of  
     80the fault tolerance system) 
     83role of sdqa and impact on ft design? 
     86 * pushing ft lower: easier (eg we can just buy fault tolerant network), 
     87 * pushing ft higher: complex and hard to implement but it is the only way to implement a complete fault tolerance, that means going towards application code 
     90== requirements (Gregory) == 
     93 * srd.pdf (science requirements document) 
     94 * dm functional requirements.pdf, docushare document-5438 
     97any information lost that prevents us from processing image should be treated as loosing the image 
     1000.1% alert publication failure 
     10298% availability 
     103 * need to define requirements better, eg what about lots of very short failures vs one long failure... 
     106"less than 24h downtime" requirement 
     107 * doesn't make sense 
     108 * eg it does not tell us how much diesel fuel we might want to buy to keep generators running, power outage can be very long... 
     109 * this requirement is derived from 60 sec real time alerts (based on one of the diagrams) - this is wrong 
     110 * what about 23 h down, 5 min up, 23 h down, 5 min up, is this acceptable? 
     111   --> raise this with Jeff 
     114we think the plans are that DR1 will be always kept on disk 
     115  --> ask Jeff if it is captured somewhere in the official requirement document? 
     118DM-APP-DP-CA-7 does it imply sources in real time?!? 
     119 --> Ask Jeff 
     122DM-APP-DP-CA-11 is confusing too 
     127== architecture == 
     129for nightly, we expect core per amp, so 3000 cores, expecting to have 16 core per box, so ~200 boxes 
     131if we are going to meet deadline, if something fails we can: 
     132 * redo entire image if failed, or 
     133 * checkpoint, redo since last checkpoint, or 
     134 * reprocess twice in parallel (full redundancy) 
     136we should not checkpoint to local disk, so we need SAN 
     137 * a reasonable option: many small SAN clusters, say 2 disks per 8 machines 
     140can we drop pieces of image? 
     141 --> ask application people 
     143 if so, what granularity? amp? ccd? raft? 
     144   --> ask application people 
     146 if we can drop parts, maybe we could just ignore failures of single machines? 
     149are there any dependencies between different images? (ghost images) 
     152it might be useful to have a node dedicates to redoing processing of the very same image every day to uncover problems with new code 
     155full single copy of templates is 225 TB 
     156 --> check is we are planning to have enough storage at base camp? 
     1590.1 arc sec = 1.5 pixels = point source 
     162maybe we should keep two copies of catalog at base camp, and update one copy with a day-delay, in case we mess up the latest copy, we can revert to the day-old copy 
     165--> find where are documents describing hardware at base camp? 
     169 * monitor (watchdogs) 
     170 * redundant execution / auto fail over / rescheduling 
     171 * limit updates 
     172 * keep immutable and mutable data separate 
     175--> check xproof / xrootd 
     178mapReduce worry: must do checkpointing between map and reduce stages (lots of io), but some data is read only (eg calibration), in non-MR world we can keep such data in memory and avoid IO 
     181there is 16 amplifiers per ccd, we can collocate all amps from a ccd on a single 16 core box 
     184we need 80 MB of IO per image (80 bits per pixel in image x 1 Mpixel image per amp) 
     185 * 80 because: 32 + variance (32) plus masked (16) 
     186 * x 16 cores per machine 
     187 * x 2-3 images we need per stage (raw, calibration, template) 
     188 * have ~10 stages, some need less io 
     190 * need to decide how much time from the 30 sec/visit we want to devote to io 
     192 * also, if we have 1 min requirement to deliver alert, and 30 sec per visit, io is x2 
     194 * bottleneck: getting data off the chip and out of the box 
     195   * checkpointing would require: 11.5 GB 
     196   * need ~1-2 sec per stage to checkpoint 
     198 * RAM seems ok, 8 GB / box sufficient to do image processing 
     202quad resolution in template images 
     203 --> this is not captures in storage estimages, follow up! 
     206postage stamps coming out of calibrated images or difference? 
     207 * calibrated 
     209at main archive we will save calibrated images, at base camp we only need to save postage stamps 
     212options we have: 
     213 * double the capacity (full redundancy) 
     214 * checkpoint, it requires tens % more hardware to handle extra io, plus need high speed SAN 
     215 * maybe it is ok to design system that continues if one amp fails? 
     216    --> need to talk to app people 
     220how long it'll take to reconfigure system after node failure 
     221 * if fault tolerant MPI does not allow us to do it quickly, might consider other approaches 
     222 * xrootd looks like a good candidate to consider 
     225we should capture requirements and representative usecases in the doc for pdr 
     228need to consider catastrophic failures separately 
     229 * example: database 
     230 * in case of database, we probably want to maintain 2 synchronized databases (redundant spare) 
     231   * hardware for redundant db servers already in baseline for base camp 
     234how to make sure components are fault tolerant? 
     235 * we can define classes of fault tolerance and decide which class each component belongs to 
     236 * or developers must deliver ft components 
     237 * or we can have ft experts that consult/help 
     241provenance worries that Gregory has: 
     242 * if we are serious about provenance, we need to capture all dependencies, eg all shared libraries, we might also want to keep contents of external libraries 
     243   * example, innocent library changed runtime flag which switch the way 64-bit int is treated (80 bit representation instead of 64), affected floating point calculations for the executable 
     244 * that is hard, pushing us towards sandboxing 
     245 * also, how are we going to configure machine and install software on it based on provenance info? 
     246 * also, how do we know if code does not open some random files in random places 
     248--> put in requirements: all I/O to disk goes via middleware. 
     249 * this simplifies provenance 
     250 * and catching failures 
     251 * but we need to be careful, because don't want to introduce extra copying of data 
     253application code may decide to change algorithm during execution, if that happens, it needs to have a way to report it for provenance  
     256remote disk similar speed as local disk because network improves? 
     257 * yes, but local disks derandomize io 
     261checkpointing: development cost to implement checkpointing is modest, we will implement it, it will be configurable: we can turn it on or not 
     265new usecases: 
     266 * marked piece of data as good, later after it went to community we discover it is bad for some kinds of analysis 
     267 * we rerun program twice and each time we get different result 
     268 * someone find something interesting in data with bad calibration in nightly catalog 
     269 * user retrieves image 
     270 * network problems to ncsa 
     271 * we were told to prepare for position x, the real observed position is y 
     272 * system log filled out (eg /var/log/messages) 
     273 * aggregation of failures: every step takes 0.5 sec longer than average (these are not failures), but overall it looks like a failure because total time will be above threshold