wiki:InfrastructureWGMeetingD20100607
Last modified 9 years ago Last modified on 06/08/2010 10:42:36 AM

We will be having our regular bi-weekly Infrastructure WG telecon on Monday, Jun 7, at 12N CT (10A PT).

Agenda

  • NFS
    • changes implemented on Saturday Jun 5.
    • report any issues to lsst-admin at ncsa.uiuc.edu
  • New Server now available: lsst11
    • 2 dual core AMD Opterons 2.2 GHz, 8G RAM, 500GB RAID0 locally attached, nfs:/lsst
  • Status of disk storage
  • SAN update
    • September deadline
    • costs and options
  • ImSim Data Management with iRODS Update (Arun)
    • (this will be deferred to a separate meeting this week)
    • Sites: UWash, Purdue, SLAC, NCSA
    • ImSimDataManagement
    • Apr12: Expect data flowing by the end of this week (i.e. Apr16)
    • May 24: 144 focal planes on lsst2
  • Output Data Management with REDDnet Update (MikeF)
    • http://docs.google.com/View?id=dgvmjj2x_16f4mvfmd6
    • 2x24TB (48TB) going to both NCSA and SLAC; depots exist at Caltech and elsewhere
      • Update: Will be shipped by the end of next week (May 7)
    • big focus on monitoring by team at Vandy
      • perfSONAR suite (I2) (snmp), BWCTL (iperf), MRTG (snmp), Nagios, and custom
      • monitors availability, throughput, latency, general health, alerts
    • single virtual directory structure -- sandbox for lsst created
    • L-Store
      • provides client view / interfaces (get/put/list/etc.)
      • defines the virtual directory structure
    • StorCore
      • partitions REDDnet space into logical volumes (think: LVM)
      • L-Store uses StorCore for resource discovery
    • Web interfaces for both StorCore and L-Store
    • Example code available (contact mike to get a copy)
      • upload.sh file1 file2 dir1 dir2 remotefolder
      • download.sh removefile1 remotefile2 localdestination
      • ls.sh remotefile or remotedirectory
      • mkdir.sh remotedir1
      • additional commands to "stage" files across depots
    • Status
      • depots at NCSA installed (2 @ 24TB each), SLAC still waiting for arrival
    • Next Steps:
      • How-to doc for LSSTers
      • mss integration
  • GPFS into the baseline (replacing Lustre)
    • being proposed to TCT
  • GPFS as baseline parallel filesystem
    • review of pricing
  • Shifting priorities for InfraWG
    • La Serena (Jul19)
    • AHM (Aug)
    • PDR (Nov)
  • PDR documentation update
    • La Serena (Jul19)
    • AHM (Aug9)
    • PDR Ready (Nov10)
  • TeraGrid TRAC Allocation (Apr 1, 2010 to Mar 30, 2011)
    • Existing usage as of Jun7
      • Service Units
        • Abe: Allocated: 1509K SUs; Remaining 1508K SUs
        • Cobalt: Allocated: 1.2K SUs; Remaining -11.2K SUs???
      • Disk Storage (NCSA Lustre)
        • Allocated: 5TB; Remaining:
      • Disk Storage (GPFS-WAN)
        • Allocated: 20TB; Remaining:
      • Tape Storage
        • Allocated: 400TB; Remaining:
  • InfraWG Ticket Update
  • DC3b User Access
    • DC3bUserAccess
    • Unique Identifier for Logical Set of Related Files
      • discussion w RHL -- pending feedback from him
      • DC3bUserAccess (this is what we're talking about, but don't look at it yet)
    • Bulk Upload Into Catalog
      • DC3bUserAccess
      • assuming standard mysql utilities
      • assuming storage requirements are not significant
        • MikeF proposing stmt re user expectations for storage
    • Web Data Server update
    • Image Cutout Service update (K-T)
    • Sample Scripts
      • IPAC (Suzy)
    • Web Interface
      • IPAC owns this
      • IPAC (Suzy); interface to scripts; reuse existing portals
      • Briefing from the Gator discussion on March 23 (Jacek)
    • Database Server(s) at SLAC (Jacek)
      • 2 servers; qserv; 15-20TB
      • expected to be ready before May
      • Apr12 status: still on track for having the secondary database server(s) ready by May1
    • Database replication strategy (Jacek)
  • DC3b User Support
    • IPAC owns this
    • Separate item from above
      • User Access is about systems and software; User Support is about receiving questions/problems from human beings
    • Active discussion going on among SuzyD, DickS, MikeF
    • One line summary: Ticket system would be good, KB would be good, no labor resources available, planning on an email address, discussions continue
    • Support email address: dc-support at lsst.org
      • Scope: user support for the data challenges
      • support at lsst.org is too generic
      • all dc issues -- do not try to have user select "category" of issue
    • Bring in Ephibian?
      • both recommendations & implementation
  • Update on LSST Database Performance Tests Using SSDs (Arun/Jacek?)
    • LSST expects to manage some 50 billion (50*109) objects and 150 trillion (150*1012) detections of these objects generated over the lifetime of the survey. This data will be managed through a database. The current baseline system consists of off-the-shelf open source database servers (MySQL) with a custom code on top, all running in a shared-nothing MPP architecture.
    • To date, we have run numerous tests with MySQL to project performance of the query load we expect to see on the production LSST system, including low volume, high volume and super high volume queries (simple queries, full table scans and correlations, respectively). Based on these tests we estimated hardware needed to support expected load. All these tests were done using spinning disks.
    • Having the opportunity to redo these tests with solid-state technology (solid state disks, or SSD) would allow us to understand potential savings and determine whether SSD could help us simplify the overall architecture of the system by approaching things in a “different” way than on spinning disk.
    • The tests we expect to run include:
      • Selecting small amount of data from a very large table via clustered and non-clustered index (this is related to low volume queries).
      • Verifying whether we can achieve speed improvements for full table scans comparable to raw disk speed improvements seen when switching from spinning disk to SSD (this is related to high volume queries).
      • Testing architecture that involves heavy use of indexes, including composite indexes instead of full table scans for high volume queries.
      • Executing near neighbor using indexes on subChunkId without explicit subpartitioning.
    • We expect to run these tests using USNOB data set, which, including indexes and other overheads fits on ~200 GB.
    • Status: waiting on accounts
  • Update on Lawerence Livermore database scalability testing (DanielW)
    • Description: LLNL has provided a number of nodes (currently 25) as a testbed for our scalable query processing system. Being able to test over many nodes allows us to understand where our query parallelism model succeeds and fails, and helps us develop a prototype that can handle LSST database query needs. So far, use of this many-node cluster has uncovered problems in scalability in job control, threading, messaging overhead, and queuing, which we have been incrementally addressing in each new iteration (3 so far).
    • Status: developing and testing a new model since tests in Jan showed bottlenecks at >4 nodes
    • Hoping to get time on a 64 node cluster at SLAC
    • software will be installed on lsst10 after testing
    • [Jacek] New Resource: A 64-node cluster at slac (used to be for PetaCache tests), which we will be able to use for lsst related scalability tests (kind of permanently). Total of 128 CPUs, 1 TB of memory (16 GB per node), 2 TB of total local storage (34 GB per node).
  • Server Administration at NCSA
    • With the upcoming DC3b runs and the increased need for system reliability with introduction of end user access to our DC data, we're tightening up the processes and procedures related to the administration of the LSST servers at NCSA
    • New email address: lsst-admin at ncsa.uiuc.edu
      • Scope: technical issues, questions, problems with the servers located at NCSA
    • Define Roles & Responsibilities
    • qserv to be installed on lsst10 after PT1
  • Using GPUs to Accelerate Database Queries
    • [TimA] "I just ran across Accelerating SQL Database Operations on a GPU with CUDA, which is the first application I've seen of GPUs to SQL. I haven't read it carefully yet, but a quick skim suggests that they transfer tables of a few million rows into the GPU memory, transforming them into column form on the way. The model is that repeated SELECTs are done on these tables, so that the transfer time is unimportant. Optimistic, no doubt, but even including the transfer time they get speedups over 20X."
    • http://www.cs.virginia.edu/%7Eskadron/Papers/bakkum_sqlite_gpgpu10.pdf

Notes

Attendees: RobynA, BillB, DanielW, RayP, KTL, JacekB, MikeF

  • Mike discussed the current situation and plans regarding SAN storage (in response to earlier queries by K-T)
  • Ray is in the process of removing obsolete data (to free up some space)
  • Mike will be sending a note proposing some directories that can be omitted from the Tivoli backups (to free up server resources)
  • (Jacek) REDDnet servers arrived at SLAC
  • (Jacek) The preparation of the SLAC secondary database server is progressing; Jacek will contact Mike when ready to start replicating
  • Switching baseline from Lustre to GPFS; proposal to SAT or TCT; Mike sent followup note to Robyn and Gregory

Useful Links

Attachments