Last modified 8 years ago Last modified on 10/21/2010 05:05:09 PM

Organizing Data Products in DC3b

The aim of this document is to specify an organization of DC3b data products (input, output, intermediate, temporary, etc.) in terms of a directory structure.

While data accessing software will have minimal hardcoded assumptions about the location of data, using a consistent directory structure is very helpful to real people--those creating configuration (policy) files, those preparing input data, and those reviewing the results. Having a consistent structure means that for many important kinds of access, a person can determine its location by its most important metadata without the need of a special tool or database lookup.

A consistent structure is not to say that there will be a central location where--i.e. a single directory under which--all data resides on a machine. Rather, data collections from our total holdings will exist in part in a variety of locations and machines depending on the context. A consistent structure does mean that given a subset of the total archive, the organization within that subset will match that every other instance of that subset. The whole defined directory tree need not appear in any instance of a subset.

Note: This document is initially being presented as a "strawman" proposal. Please enter your objections and suggestions into this page.

High-level Organization and General Pattern

In general, the characteristics that will be used to form the full path to a data file fall into three categories:

collection names
groups reflecting the data's overall use in DM activity. Examples include the LSST simulated data, the CFHT-LS data, etc.
data product types
the kind of data contained in the dataset from an astronomical point of view. Examples include raw images, difference images, psf-matched kernels, single-frame measurement source lists, etc.
data identifiers
names that uniquely identify some specific attribute of the data.

Datasets will be organized into hierarchical collections--reflected by the directory structure--according to these classes of characteristics. The full path (minus some assumed root directory) can be used as a unique, logical identifier for a dataset. Thus, when a portion of the archive data is moved to another location, one should preserve the logical identifier in one of the following ways:

  • preserving the entire directory structure (minus the assumed root directory); recommended.
  • collapsing a portion of the directory structure into a single directory where slashes (/) are replaced with dashes (-).
    • for example, collapsing upper directories into one provides a logically named data collection, e.g. CFHTLS-D3-raw-v39213-fg is a directory containing an entire focal plan exposure.
  • collapsing a trailing portion of the path--the filename and some of its immediate parent path--into a filename where slashes (/) are replaced with dashes (-).
    • for example, the web service delivering individual images with lower directories collapsed into the file name: v111392.3-fz-r32-c3-a07.fits or diffim-v707493-fi-c023.fits

In the next section, we detail the specific form of the directory path for each collection and data type path known at this time. In general, the form of the path will reflect the following guidelines:

  • the first directory in the full path for input data collections will be "obs". The second directory will be the general collection name.
    • input data collections: obs/ImSim, obs/CFHTLS
  • the first directory in the full path for pipeline product collections will be general collection name.
    • pipeline product collections: datarel, alertgen
  • the general collection name may be followed by one or more appropriate sub-collection names
    • Example: obs/CFHTLS/D4. This reflects the fact that the data from each deep field can be treated as indepenedent collections
    • Example: obs/CFHTLS/calib. Since the association of calibration data to the raw data they are applied to is not obvious and may be catagorized by a different set of identifiers.
  • The collection levels are followed by a subdirectory identified by it data type name.
    • Examples: obs/ImSim/raw, obs/CFHTLS/D4/raw, datarel/cal, datarel/postisr, datarel/psfkernel, etc.
  • The remaining directory levels are named after the identifier values.
    • The directory name should start with one to three letters representing the type of identifier, followed by the value of the identifier.
      Known identifiers:
      • v: visit id
      • s: snapshot id (0 or 1)
      • R: raft id
      • S: sensor id
      • c: CCD id
      • C: channel id
      • a: amplifier id
      • tA: skytile id using the "A" tiling system
    • the most general identifiers--that is, the ones that, in general, sensibly aggregate more data--should appear first. The most specific identifier may be incorporated into the dataset filename.
    • Generally, the identifier directory levels should correspond to those identifiers that form a "primary key"--i.e. that are sufficient to uniquely identify the dataset of a given type. Other identifiers that are not part of the "primary key" but are otherwise useful for identifying the contents of the dataset (e.g. the filter id) may either be incorporated into the name of the highest directory where that identifier is applicable or into the filename.

Individual Collection specifications

Input Data Collections

LSST Simulation Data: ImSim

  • general collection name: obs/ImSim
  • no sub-collections
  • data type names: raw, dark, flat, bias, fringe
  • unlike other dataset types, some ID information will be replicated in the filename

Raw Images: raw

  • format: obs/ImSim/raw/v%(visitid)-f%(filterid)/E%(snapid)/R%(raftid)/S%(sensorid)/imsim_%(visitid)_R%(raftid)_S%(sensorid)_C%(channelid)_E%(snapid).fits
    • Note use of underscores and redundant identifiers and trailing snapid by ImSim team.
    • raftid -- raft name: "(x)(y)" (specified to data butler as "(x),(y)" e.g. "3,2")
    • sensorid -- CCD name within raft: "(x)(y)" (butler = "(x),(y)" e.g. "1,2")
    • channelid -- amplifier name within CCD: "(y)(x)" (butler = "(y),(x)" e.g. "0,4")
    • snapid -- "000" or "001" (butler = 0 or 1)
  • example: obs/ImSim/raw/v85751839-fr/E000/R23/S11/imsim_85751839_R23_S11_C00_E000.fits

Filter-independent calibration images: dark, bias, mask

  • format: obs/ImSim/%(dtype)/v%(visitid)/E%(snapid)/R%(raftid)/S%(sensorid)/imsim_%(number)_R%(raftid)_S%(sensorid)_C%(channelid)_E000.fits
    • dtype -- the data type name: dark, bias, or mask
    • Note that currently (and likely throughout DC3b), there will be only one set of calibration data for the entire collection; thus this degenerates to
      • obs/ImSim/bias/v0/E000/R%(raftid)/S%(sensorid)/imsim_0_R%(raftid)_S%(sensorid)_C%(channelid)_E000.fits, and
      • obs/ImSim/dark/v1/E000/R%(raftid)/S%(sensorid)/imsim_1_R%(raftid)_S%(sensorid)_C%(channelid)_E000.fits
  • example: obs/ImSim/dark/v1/E000/R23/S11/imsim_1_R23_S11_C00_E000.fits

Filter-dependent calibration images: flat, fringe

  • format: obs/ImSim/%(dtype)/v%(visitid)-f%(filterid)/E%(snapid)/R%(raftid)/S%(sensorid)/imsim_%(number)_R%(raftid)_S%(sensorid)_C%(channelid)_E000.fits
    • dtype -- the data type name: flat or fringe
    • as with the bias and dark images, this pattern will degenerate for DC3b into:
      • obs/ImSim/flat/v2/E000/R%(raftid)/S%(sensorid)/imsim_2_R%(raftid)_S%(sensorid)_C%(channelid)_E000.fits, and
  • example: obs/ImSim/flat/v2-fr/E000/R23/S11/imsim_2_R23_S11_C00_E000.fits

CFHT Legacy Survey

  • general collection name: obs/CFHTLS
  • four sub-collections for deep field target observations: D1, D2, D3, D4, W1, W2, W3, W4,
  • one sub-collection for calibration data: calib

Raw Images: raw

  • format: obs/CFHTLS/%(field)/raw/v%(visitid)-f%(filterid)/s%(snapid)/c%(ccdid)-a%(ampid).fits
  • example: obs/CFHTLS/D1/raw/v707493-fi/s0/c023-a07.fits

Filter-independent calibration images: dark, bias, mask

  • format: obs/CFHTLS/calib/%(dtype)/v%(dateid)/c%(ccdid)-a%(ampid).fits
    • dtype -- the data type name: dark, bias, or mask
    • dateid -- the run identifier; this is not represented by an actual date but corresponds to a time range when data was acquired.
    • ccdid and ampid are collapsed into a single visit directory since there are only 2 amps per CCD.
  • example: obs/CFHTLS/calib/bias/v05Bm02/c03-a1.fits

Filter-dependent calibration images: flat, fringe

  • format: obs/CFHTLS/calib/%(dtype)/v%(dateid)-f%(filterid)/c%(ccdid)-a%(ampid).fits
    • dtype -- the data type name: flat or fringe
    • filterid -- "u", "g", "r", "i", "i2", "z", without trailing "MP" specification
  • example: obs/CFHTLS/calib/flat/v04Bm01-fz/c03-a1.fits

Data Release Product Collection (for DM-Only Run-Specific Output)

  • general collection name: datarel-runs
  • sub-collection name: %(run-id)

Data Release Product Collection (for Publicly Released Output)

Post-ISR image

  • format: datarel/postISR/v%(visitid)-f%(filterid)/s%(snapid)/R%(raftid)-S%(sensorid)-C%(channelid).fits
  • example: datarel/postISR/v707493-fi/s0/R23-S11-C07.fits

Post-ISR CCD image

  • format: datarel/postISRCCD/v%(visitid)-f%(filterid)/s%(snapid)/R%(raftid)-S%(sensorid).fits
  • example: datarel/postISRCCD/v707493-fi/s0/R23-S11.fits

Visit image

  • format: datarel/visitim/v%(visitid)-f%(filterid)/R%(raftid)-S%(sensorid).fits
  • example: datarel/visitim/v707493-fi/R23-S11.fits

Calibrated exposure

  • format: datarel/calexp/v%(visitid)-f%(filterid)/R%(raftid)-S%(sensorid).fits
  • example: datarel/calexp/v707493-fi/R23-S11.fits

SFM Source list

  • format: datarel/src/v%(visitid)-f%(filterid)/R%(raftid)-S%(sensorid).boost
  • example: datarel/src/v707493-fi/R23-S11.boost

PSF-matching kernel

  • format: datarel/PSFmatch/v%(visitid)-f%(filterid)/c%(ccdid).fits
  • example: datarel/PSFmatch/v707493-fi/c023.fits

Template Co-Add

  • format: datarel/tempcoadd/tA%(skytile).fits
  • skytile id format TBD
  • format: datarel/tempcoadd/someID.fits

Difference image

  • format: datarel/diffim/v%(visitid)-f%(filterid)/c%(ccdid).fits
  • example: datarel/diffim/v707493-fi/c023.fits

Difference source list

  • format: datarel/difflist/v%(visitid)-f%(filterid)/c%(ccdid).lst
  • example: datarel/difflist/v707493-fi/c023.lst

Associated Moving Source set

  • format: datarel/movsource/v%(visitid)-f%(filterid)/c%(ccdid).lst
  • example: datarel/movsource/v707493-fi/c023.lst

Masked exposure

  • format: datarel/maskexp/v%(visitid)-f%(filterid)/c%(ccdid).fits
  • example: datarel/maskexp/v707493-fi/c023.fits

Detection Co-add

  • format: datarel/detcoadd/tA%(skytile).fits
  • skytile id format TBD
  • format: datarel/detcoadd/someID.fits

Deep Detection List

  • format: datarel/detlist/tA%(skytile).lst
  • skytile id format TBD
  • format: datarel/detlist/someID.lst

Astrometric model

  • format: datarel/astromet/tA%(skytile).lst
  • skytile id format TBD
  • Note: this description assumes that a single file will be produced, rather than a separate one for each source, although that's not yet decided.

New object list

  • format: datarel/nobjlist/tA%(skytile).lst
  • skytile id format TBD
  • format: datarel/nobjlist/someID.lst

Forced DIASource list

  • format: datarel/fdia/v%(visitid)-f%(filterid)/c%(ccdid).lst
  • example: datarel/fdia/v707493-fi/c023.lst
  • Note: this description assumes that a single file will be produced, rather than a separate one for each source, although that's not yet decided.

Forced source list

  • format: datarel/fsource/tA%(skytile).lst
  • skytile id format TBD
  • example: datarel/fsource/someID.lst
  • Note: this description assumes that a single file will be produced, rather than a separate one for each source, although that's not yet decided.

Final objects (ObjectAssoc):

  • format: datarel/object/tA%(skytile).lst
  • skytile id format TBD
  • example: datarel/object/someID.lst
  • Note: this description assumes that a single file will be produced, rather than a separate one for each object, although that's not yet decided.

Final objects (PhotoCal):

  • format: datarel/pcalobj/tA%(skytile).lst
  • skytile id format TBD
  • example: datarel/pcalobj/someID.lst