wiki:InfrastructureWGMeetingD20100301
Last modified 9 years ago Last modified on 03/02/2010 11:34:06 AM

We will be having our regular bi-weekly Infrastructure WG telecon on Monday, March 1, at 12N CT (10A PT).

Agenda

  • Existing Resource Usage Update (as of Feb26)
    • TeraGrid Resources (Startup Allocation)
      • Service Units
        • Abe: Allocated: 30K SUs; Remaining ~29.4K SUs
        • Lincoln (nVidia Tesla GPUs): Allocated: 30K SUs; Remaining 30K SUs
      • Disk Storage
        • Allocated: 5TB; Remaining: 5TB
      • Tape Storage
        • Allocated: 40TB; Remaining: 40TB
  • DC3b Infrastructure for the Performance Tests
    • http://dev.lsstcorp.org/trac/wiki/DC3bHardwareRequirements
    • LSST-11 DC3b Hardware
    • Compute
    • Database Disk
      • We're good on this. 15TB for database on lsst10 as of Feb 22, which covers the requirements for all of DC3b.
    • Tape Storage
      • Total (raw) tape needed is (203TB, 288TB, 449TB)
      • The tape gap is 0 for PT1, 88TB for PT2, 249TB for PT3 (contingent upon 200TB raw from TG)
      • Pricing
        • $62/TB (for single copy) [$25/tape=400GB]; for 300TB is ~$19K (LTO-3)
        • $31/TB (for single copy) [$50/tape=1.6TB]; for 300TB is ~$9K (LTO-5) [plus faster bandwidth than LTO-3]
        • Note1: Working with PI for possible purchase strategies, which include PI subsidizing during tape usage and LSST subsidizing LTO-5 tapes, to avoid the need to buy LTO-3 (old technology) tapes.
      • Additional notes
        • Note2: Mass storage will exist post-2010 (ongoing talks with NCSA PI); New system by Oct1 (estimated)
        • Note3: Estimated DC3b data loss from tape failures expected to be 2-3% per year
      • waiting on the TG allocation decision before buying any tapes
  • DC3b User Access
    • DC3bUserAccess
    • Unique Identifier for Logical Set of Related Files
      • discussion w RHL -- pending feedback from him
      • DC3bUserAccess (this is what we're talking about, but don't look at it yet)
    • Bulk Upload Into Catalog
      • DC3bUserAccess
      • assuming standard mysql utilities
      • What are the storage requirements?
    • Web Data Server update
    • Image Cutout Service update (K-T)
    • Sample Scripts
      • who?
    • Web Interface update (who?)
      • IPAC; interface to scripts; reuse existing portals
    • LSST-54 Connections Speeds between lsst10 and the SAN Storage. We need 300 MB/s. What are our options?
      • Do we really need 300MB/s? (Jacek)
        • Currently: lsst10 150MB/s; Scan of Object table is 5m; Source table is 1h2m; ForcedSource is 1h36m (dbDC3bHardware?)
      • Adapter slots on lsst10 will not support 8Gb HBA
        • in the process of getting price estimates for a new database server
      • New server to support 300MB/s will cost $4K (2 Quad core; 16G RAM, HBA Emulex dual channel 4Gbps PCIe)
    • Database replication strategy (Jacek)
  • ImSim Data Management with iRODS (Arun)
  • Output Data Management with REDDnet
    • conf call on feb24
    • contacts established
    • 2x24TB (48TB) going to both NCSA and SLAC; depots exist at Caltech and elsewhere
    • big focus on monitoring by team at Vandy
      • perfSONAR suite (I2) (snmp), BWCTL (iperf), MRTG (snmp), Nagios, and custom
      • monitors availability, throughput, latency, general health, alerts
    • single virtual directory structure -- sandbox for lsst created
    • L-Store
      • provides client view / interfaces (get/put/list/etc.)
      • defines the virtual directory structure
    • StorCore
      • partitions REDDnet space into logical volumes (think: LVM)
      • L-Store uses StorCore for resource discovery
    • Web interfaces for both StorCore and L-Store
    • Example code available (contact mike to get a copy)
      • upload.sh file1 file2 dir1 dir2 remotefolder
      • download.sh removefile1 remotefile2 localdestination
      • ls.sh remotefile or remotedirectory
      • mkdir.sh remotedir1
      • additional commands to "stage" files across depots
  • Update on LSST Database Performance Tests Using SSDs (Arun/Jacek?)
    • is a link available for more info? (the pdf?)
  • Update on Lawerence Livermore database scalability testing (DanielW)
    • Description: LLNL has provided a number of nodes (currently 25) as a testbed for our scalable query processing system. Being able to test over many nodes allows us to understand where our query parallelism model succeeds and fails, and helps us develop a prototype that can handle LSST database query needs. So far, use of this many-node cluster has uncovered problems in scalability in job control, threading, messaging overhead, and queuing, which we have been incrementally addressing in each new iteration (3 so far).
    • Status: developing and testing a new model since tests in Jan showed bottlenecks at >4 nodes
  • Mass Storage Access Requirements
    • Do we need access to mss either to or from any lsst* machine or ds33?
    • lsst10 for catalog backups / replication to REDDnet?
    • implications for server software installations/configurations and authentication
  • Directory Structure for Image Repositories
    • DavidG is drafting a proposal
  • DNS Names for DC3b Servers
    • Web Data Server
    • Schema Browser
    • Primary Catalog Database
    • iRODS Server
    • support email address(es)
    • current thoughts:
      • "data.lsst.org" for the web data server (the http interface to the image files) as well as the schema browser web app
      • "db.lsst.org" for the primary database server at ncsa (aka lsst10)
      • Email address: "dc-support at lsst.org" -- support at lsst.org *may* be too generic
  • New Shared Memory HPC Machine at NCSA (coming online soon)
    • SGI-based; details not yet available
    • can we leverage this? does shared memory open up any opportunities for us?
  • Skype
    • any interest?
    • percentage of webcam-enabled InfraWG-ers?

  • Cost Sheet Update
    • Baseline verion is v45
    • Current version now v74
    • Summary of Changes
      • LSST-94 Floorspace tab: Increase rack depth from 3.0 to 3.5 (+50sf both sites) (+19K/+153K)
      • LSST-95 Floorspace tab: Add calculation for gross floorspace for the base site
    • Questions & Notes
      • Ramp up: One of the things in the cost sheet that I wonder about is our "ramp up", i.e. we're currently planning on buying 1/3 of the hardware 3 years early, 2/3 two years early, etc. I wonder if 3 years early is a little too soon.
    • Upcoming Changes
      • Priority is updating the Power & Cooling estimates
        • LSST-10 Update Power & Cooling at Base Site (info already received from RonL)
        • LSST-47 Power Costs at BaseSite: Use Historical Data to Model Future Power Prices
        • LSST-36 Update Power & Cooling at ArchSite
        • LSST-36 P&C and Floorspace at PCF (rates, payment approach, green features of PCF)
      • LSST-78 Move the 3% CPU spare from document 2116 "CPU Sizing" to document 6284 "Cost Estimate"
      • LSST-79 Add tape library replacement to ArchAOS and BaseAOS
      • LSST-28 Optimal CPU Replacement Policy
      • LSST-14 Processor Sizing Update (Doc2116 LSST CPU Sizing)
      • LSST-37 Missing controller costs for disk
    • Next steps with cost sheet
      • Full review each of the elements of the cost sheet (boxes of the mapping document)
        • More readable description of the formulas being used
        • Identification and documentation of assumptions
        • Identification and documentation of external data input
      • Serves two significant purposes
        • Allows for better internal reviews (validation of models and information used)
        • Provides justifications for external reviews
      • Results in an updated (or replacement of) Document-1684 and related documents ("Explanation of Cost Estimates")
  • InfraWG Ticket Update

Notes

Attendees: JeffK, K-T, JacekB, DanielW, DickS, MikeF

  • should hear back on our TG proposal very soon
    • proceed with resolving the tape gap after that
  • user access
    • bulk load - interface & storage questions
      • interface: assuming std mysql utilities
      • storage: assuming a non-significant amount of storage will be required
        • two data points: "DM not providing anything during DC3b" and "10% of (operational) system" -- we will strive for something in between
        • MikeF will propose a statement re user expectations for storage available
    • sample scripts
      • will sequence this activity until a little later
    • web interface
      • IPAC responsibility -- Suzie is the contact person
    • lsst10/SAN 300MB/s
      • Jacek's action to define requirement; MikeF has cost estimate, in holding pattern
  • DB test with SSDs
    • pdf info ok to publish
  • DB scalability testing
    • hoping to get time on a 64 node cluster at SLAC
    • MikeF will get with Jacek regarding db software being planned for lsst10
  • consensus on mailing list dc-support at lsst.org
  • Action items reflected in JIRA tickets.

Useful Links

Attachments