wiki:ImSimDataManagement
Last modified 9 years ago Last modified on 03/23/2010 10:20:12 AM

Goals and Scope

Arun is setting up an iRODS infrastructure for the management of ImSim data for DC3b.

There are four sites that will compose the iRODS network: UWash, Purdue, SLAC, NCSA.

Scope of data for iRODS is ImSim data only. See picture at right.

Site Contacts

  • NCSA: Mike Freemon
  • SLAC: Stuart Marshall (Primary), Garrett Jernigan (Secondary)
  • UWash: Andy Connolly
  • Purdue: John Peterson
  • Email list: lsst-imsimdata at lsstcorp dot org

Mar 10 Meeting

Agenda

  • definition of the data flow / replication rules
  • what are the resources needed at each site?
    • servers - how many? dedicated? specs?
    • storage - how many TBs at each site?
    • integration with mass storage - authentication?
  • details regarding catalog data format (mysqldump format?); how to import into primary database?
  • directory structures? need to define?
  • date/time of followup meeting?

Notes

  • Attendees: AndyC, JohnP, ArunJ, JacekB, K-T, GarrettJ, StuartM, MikeF
  • PT1 starts May1
  • Reviewed and discussed in detail Arun's email of Feb25 (which contained a number of technical and sizing questions)
  • File estimates (from my admittedly rough notes)
    • 24 million files (per amp) at 1MB each
    • 1.4 million files (per chip) at 4-5MB each
    • 7000 trim files at 250MB each
    • 7000 derived catalog files at 100MB each
  • Central database to be located on the lsst2 server
  • Arun does not need root on servers
  • Each participating server should have enough storage to buffer a few days' worth of data production (around 1TB) and should have as fast an external network connection as possible. Otherwise, no special CPU or memory requirements are needed.
  • 1TB of storage for lsst2 server should be sufficient (the iRODS server at NCSA)
  • New email list
  • Definition of directory structure happening via lsst-data list
  • Will continue via email, no conference call to be scheduled

Feb 23 Meeting

Agenda

  • discussion of the data flow / replication rules
  • what are the resources needed at each site?
    • servers - dedicated? specs?
    • storage - how many TBs at each site?
    • integration with mass storage - authentication?
  • contact info for each site?
  • details regarding catalog data format (mysqldump format?); how to import into primary database?
  • directory structures? need to define?
  • date/time of followup meeting?

Notes

  • Attendees: JacekB, AndyC, ArunJ, JohnP, K-T, GarrettJ, StuartM, MikeF
  • Discussed scope (see above)
  • ArunJ is leading the technical implementation of the iRODS infrastructure
  • Defined site contacts (see above)
  • Basic flow
    • Catalogs originate at UWash, transferred to Purdue, SLAC (and others such as teragrid, google)
    • Simulation runs at Purdue, SLAC, and elsewhere
    • Images transferred from Purdue, SLAC to NCSA
    • Catalog data transferred from UWash to NCSA
  • Common (centralized) iCat database? or decentralized? Arun to define this.
  • iRODS needs to ensure integrity of the data
  • ArunJ will gather input and info (flows, capacity sizing, bandwidth sizing) and provide infrastructure requirements (number of servers, server specifications) to the site contacts
  • MikeF will raise the directory structure issue