wiki:DC3APMMtgDataProducts
Last modified 10 years ago Last modified on 08/08/2009 12:19:50 AM

Data Products Breakout

Back to DC3bScopingMeeting

DC3b scoping decisions

Part of DC3b:

  • Database schema updates "as needed", including introducing FaintSource, WCSSource, reworking schema for Object, Source and DIASource, exposure tables, adding calibration related tables, mops related, schema for synthetic sources of data.
  • defining units in database schema
  • release and serve DC3b data products (2 catalogs: cfhtl and sim). Scope of serving: for production purposes (including photometric and astrometric calibration) as well as for post-run analysis. Expectation is that most runs will be small (on the order of DC3a size) with only a few production-scale runs at much larger than DC3a size.
  • Scalable architecture. A prototype will be used for numbers for PDR. An initial version with the smallest feasible feature set will be developed for DC3b production and post-production analysis.

Conditionally part of DC3b (if we find extra resources):

  • graphical user interfaces
  • basic VO functionality such as footpring service
  • photo z. We don't have manpower to do it, but it would be great if someone from science collaboration would help. If we don't find the manpower, this will be done in DC4

Not part of DC3b:

  • Need to look at operational considerations: serving the released data products (images and catalogs) in a DAC mode, possibly also abroad (eg in UK)

Tasks for Database Schema

  • Schema updates for all source-related tables and the Object table [JB 5% 9/15/09-1/31/10]
  • Schema updates for all exposure-related tables [JB 5% 9/15/09-1/31/10]]
  • synthetic sources of data [JB 10%, 11/01/09-11/30/09]
  • defining units [JB <1% 10/01/09 - 1/31/10, Tim]
  • standardizing names [JB 20%, 10/01/09-10/31/09]

Tasks for Scalable Database

Due dates account for expected loads on other DC3b, PDR, and non-LSST tasks and also include time for reviews and testing. Note that the prototype is not expected to have the same level of test coverage as actual DC3b code.

  • demonstrate near neighbor query on 2-3 nodes (query broken into subqueries, multiple databases, subqueries run in parallel, data partitioned by hand, no cross-server joins, results streamed to client, using xrootd, minimal tools for partitioning). [mostly Daniel, now - 08/14/2009]
  • evaluate hadoopDB [JB, now - 9/16/09, 12/01/09-12/31/09]
  • test mysql scalability limits and decide on architecture (sub-partitioning vs static 1-level partitioning with large number of small tables) [JB, now - 8/31/09]
  • documentation for PDR, including description of detailed architecture [JB 40%, 9/01/09-9/16/09]
  • decide on the communication layer (xrootd vs gearman) [KTL/DW/JB < 3%, due 8/20]
  • run near neighbor query on lsst10, test performance (includes porting to linux) [10% JB, 9/20-9/30]
  • partitioning
    • research htm/healpix/stomp/dif [10% SM, now-8/31]
    • understand how to partition MovingObject table [KTL/JB < 3%, 9/15-9/30]
  • implement generic query partitioner [30% SM, 9/1 - 10/31]
  • complete tools for partitioning data [20% SM 10% JB, 9/1 - 10/31]
  • basic, generic query parser in place ("select from where", supports queries that select single row, select from multiple partitions, join with non-partitioned tables) [25% DW, 9/5/09 - 10/31/09]
  • Demonstrate joining Object table with non-partitioned tables (via shared volume) [10% DW 10/20/09-10/31/09]
  • dc3b syntax parser and aggregation [11/1 - 12/31 25% DW, 10% SM]
    • syntax support [DW]
    • collating query results [DW]
    • aggregation (sum, average, group by, sort by, ... ) [DW]
    • query parser: support aggregation {DW], basic geometry support (bounding box and circle) [SM]
  • task manager [1/1/10 - 1/31/10]
    • decide how to present results to users [JB 5% 1/5/10-1/10/10]
    • receiving and running multiple queries simultaneously [5% KTL 10% DW 1/1/10-1/31/10]
    • interactions with user (location of query results, query status/cost?) [5% KTL 10% DW 1/1/10-1/31/10]
  • data release tools [10% JB, 20% DW, 2/15/10 - 03/15/10]
    • implement admin tools for releasing data (reindexing, marking read only, etc)
    • implement monitoring of the system (for troubleshooting/debugging)
  • documentation [KTL 10%, DW 10%, JB 10%, 04/01/10 - 04/15/10]

Issues

  • Serving released DC3b data products: unclear how much effort we should put into providing support for science collaborations once we release DC3b catalogs. This includes writing documentation, dealing with accounts, maintaining the system, helping with federating level 3 data products.
  • Scalable architecture: use it in DC3b or not? If we don't use it, queries will be very slow. See action items below
  • Schema updates. Need to decide how to move forward: in particular need to resolve the issue how to combine what we already have with the needs driven by the application c++ classes.
  • Topic not discussed, discussion needed:
    • provenance: what is part of DC3b?
    • bad pixel mask
    • sky pixelization: not sure if/what is needed
    • schema for synthetic sources of data
    • improving AP performance - in DC3b?
    • images: presequencing - in DC3b?
    • postage stamps cut out from science exposures - in DC3b?
    • serving non-DR data products? (like WCSObjects)

Actions

  • Database schema related
    • decide how to move forward with major pending schema update [coord by Jacek, due 05/31]
    • defining units in database schema - coordinate input from various app people [Tim, due 07/15]
    • decide which representation is used in cases there are multiple options (examples: orbital elements). Requires following up with application team and science collaborations. [Tim, due ~06/30?]
    • IAU - tentatively decided to remove from Object schema. Check with science collaborations [Tim, 06/30]
    • examine implications of object id creation scheme (start with 1) [Jacek, 09/30]
    • ask galaxy and weak lensing sciCollabs about representation of photo z probability distribution function [David Wittman, 06/15]
  • serving released DC3b data products
    • decide how much effort we are putting into support [Jeff/Tim?, 06/15]
    • document assumptions made related to performance one should expect when analysing released DC3b data products [Jacek 07/10]
  • scalable architecture
    • Post to Docushare two documents discussing input data. Roc's document covers mostly images, Tim's document covers mostly catalogs. [Tim and Roc, 05/26]
    • do detailed size analysis [Jacek, 6/10]
    • decide if we use scalable architecture in DC3b (would speed up post-DC3b queries) [Jacek, 6/15]
    • arrange hardware needed to run DC3b and serve DC3b data products (extra storage on lsst10?, more servers?). [Jacek, 02/27]
      • check if SLAC hardware can be reused [Gregory, 6/15]
      • check if SLAC hardware (Sun) can be shipped to NCSA and hosted there [Gregory + Mike Freeman, 6/30]
  • check with UK collaborators regarding UI / VO interfaces [Jacek, 5/31]
  • discuss the "topics not discussed" listed under "issues" above [Jacek, 6/10]

Further details

  • Releasing and serving data
    • Ultimately want to officially do data release (static set of tables)
      • sizes discussed at science plenary II
    • Maybe two releases: cfht and sim separate
    • Expect to put out a release, and continue working on a different 'experimental' release
      • want to pass a release to science collaborations in May 2010
      • Released DC3b data products should have didicated hardware
      • "unspoken" assumption - NCSA
    • If DC3b successful, we should expect complex, big quries
    • Level 3? Yes, people will try.
      • It is not our responsibility
      • we should help these who will want to try it
  • Pluggable classifiers? No it is beyond the DC3b scope
  • user interfaces?
    • plain SQL
    • fancy, graphical UI are nice, but not required
    • simple VO compliant footprint service would be very useful
      • user will want to know if the data covers a certain footprint
      • might be useful to implement window function (how do I see what area of the sky this data touches)
      • work with Ray, might involve UK people
  • different representations of the same thing
    • certain things can be represented in many ways
    • we will support one, plus conversion tools
    • for example in case of of orbit elements, we expect to get help from solar science, converge on standard and provide conversion tool to convert to other representations
      • possibly might need to carry full covariance to be able to convert
    • another example: sheers vs moments
    • Tim will follow up with science collaborators
  • It is inevitable we will need to add new columns to Object table
  • IAU (char[34] in Object table)
    • this does not work
    • name must be updated if we change the astrometric solution
    • required linking objects cross release
      • We are not going to link objects between releases. Not practical.
    • proposing to drop this
      • need to keep Kirk in the loop, he proposed to add this
    • action item: submit proposal (drop IAU id) to science collab (Tim)
  • object ids:
    • Tim proposes sequential numbering (starts with 1)
    • need to examine implications (eg is it possible to build object ids if we run in parallel)
  • multiple model support in Object table
    • fitting at least two models for each object
      • different positions for different models
        • Robert strongly disagrees, we need single position so we can compare
  • per Robert, schema should be driven by application C++ classes
  • need to write down what app C++ classes will measure, use that do drive schema design
  • bottom up design, via emails on lsst-data and app telecon
  • Suzy worries about workload on Robert.
  • timescale? 1st/2nd week of June should have translated dc3b scope to c++ schema [Robert to lead]
  • photoZ
    • in dc3b? No
    • 500 numbers per object to represent photo z probability distribution function
      • compressing is possible
      • action item: post to galaxy and weak lensing sciCollabs [David Wittman, expects to get the answers in ~2 weeks]
  • FaintSource schema
    • skybackground could probably be computed
    • but we need fluxSigma
    • yes, it makes sense to split per filter
  • DR and nightly science exposure: any difference?
    • science exposures produced by DR (nightly and coads) the same
    • but science exposure produced by nightly at basecamp and DR/coads different
  • Alert table
    • Are alerts connected with a particular data release?
    • likely no, so static