wiki:Dc3abMetadataDiscussion
Last modified 10 years ago Last modified on 05/28/2009 04:20:28 PM

Notes on metadata breakout 2009-05-20

The agenda was:

  • Metadata and FITS header handling and member variables (still)
  • Update of database schema to reflect science
  • Provenance:

o Processing history information (part of provenance or separate)? o Apps access to provenance

  • Coordinate system religious war (XY0/subimages, trimming, WCS, FITS)
  • Focal plane, CCD, and amplifier geometry "database" and C++ instantiation
  • CCD properties database (gain, defects, etc.) and C++ instantiation
  • Validity date ranges of CFHT defect lists
  • RADECSYS and EQUINOX need to be specified by astrometry.net data

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

  • Metadata and FITS header handling and member variables (still)
  1. We discussed the different things that we call, "metadata":
  1. Information describing the inputs, that is, at least historically, often provided as keyword/value metadata.

The canonical example is an image's Wcs.

  1. Information associated with the pipeline inputs, that needs to be somehow associated with the generated outputs.

An example would be camera-specific keywords in an input FITS file. It is possible that there is some other way that this information is tracked, but if not, an approach would be to ingest it into keyword/value pairs and pass it unchanged through the pipeline. If this data is available directly, associating the outputs with the input information could be an aspect of provenance.

  1. Information aboout how the processing proceeded. This comes in two flavours: i/ Status information at a finer granularity than a stage. For example, we currently set bitflags for every Source; the ip_diffim coude would like to provide summary statistics for every Footprint used to determine a Kernel.

ii/ Processing status and summary statistics (often of interest to the QA process) at the granularity of a stage, and thus typically that of a segment or CCD, but possibly the entire focal plane.

There are also log messages, and anything that belongs in this second category could just be put out to the logs and harvested later; but this doesn't seem like a wise design.

The first of these categories, (a), consists of information that is expected to be present in order to process the data, and whose values must be interpreted in order for the processing to proceed. Again, Wcs provides an example: the fact that there's a keyword/value pair "CRPIX1/100" is of no interest; what matters is that we are able to build a class, image::Wcs, that can map pixel to world coordinates. These objects are analogous properties of the data such as the PSF that we measure as the pipeline proceeds.

We propose that such inputs not be considered as metadata for the purposes of this discussion.

If we are presented with information that we do not need to interpret, but merely pass along, we are agnostic as to the best implementation. It can be fitted into the name/value pairs we discuss below (but maybe not well --- e.g. fits cards are richer, with ordering and comments).

This leaves us with the question of what to do with the information about how the processing proceeds. In DC3a, this is handled in three ways:

1/ By defining sdqaRatings 2/ By setting bits within objects such as Sources 3/ By writing log messages.

In order to come up with a rational proposal (or request to Jacek and K-T) we would like a clearer enumeration of what we are really planning to generate for DC3b. Each pipeline stage owner is tasked with providing RHL with a list of all the data that they currently produce for DC3a (and which they want archived; trace messages for pipeline debugging are out of this request's scope). We would also like a guess at what each pipeline stage for DC3b will produce.

Please specify the type of the information that you with to persist; a database is not good at storing values that could be ints, strings, floats, or jpegs so we need a clear idea of what we're going to need. I hope that we'll be able to restrict ourselves to numerical (or NULL) values.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

  • Update of database schema to reflect science

This was thought to mean, "to reflect current C++ classes", and was thought to be non-contentious. We need to help Jacek decide what needs to be persisted. An example is Exposure; this will acquire information about the contained pixels in DC3b, and we need to ensure that the information is saved, or is available via appropriate joins.

Jacek: did we understand this item aright?

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

  • Provenance:

o Processing history information (part of provenance or separate)? o Apps access to provenance

We did not discuss this directly, but it's related to the discussion of metadata above.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

  • Coordinate system religious war (XY0/subimages, trimming, WCS, FITS)

The issue here is that we often operate upon a sub-set of the data, and it is not always appropriate to use the native (i.e. 0-indexed C++-indexing) coordinate system. We decided to use:

  • The native coordinate system when addressing pixels
  • The native coordinate system when using a Wcs to convert a pixel position to a world coordinate system
  • A coordinate system relative to the 0-based origin of the largest regularily-gridded containing image (usually the CCD) for all other purposes (e.g. positions of objects; descriptions of bad columns)

A corollary of this is that all algorithms must be able to work with Images that have an origin other than (0, 0) -- this is available via the getXY0(), getX(), and getY() methods. Jim Bosch proposed that we capture these transformations via objects associated with the Images; we should think about the pros and cons of such an approach.

(P.S. We currently persist the (X0, Y0) values via a second WCS in the fits header, WCSA)

Andy Becker: we think that you were unhappy with this; please comment.

Dick Shaw, with his standards hat on, remarks: I think your impressions w.r.t. the relationship between the WCS reference pixel and the image array coordinate origin are basically correct. The reference is chapter 8 in http://fits.gsfc.nasa.gov/fits_standard.html.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

  • Focal plane, CCD, and amplifier geometry "database" and C++ instantiation
  • CCD properties database (gain, defects, etc.) and C++ instantiation
  • Validity date ranges of CFHT defect lists

The desire is to have a defined source of all knowledge about a given camera; the examples of classes of knowledge that we came up with were:

  • logical location of segments within a CCD image (i.e. pixel offsets)
  • Geometry of the segments within a CCD (if, e.g. the right hand set of amps are separated from the left by a pixel 50% wider than normal)
  • Location of a CCD within the focal plane
  • Electronic properties of CCDs (gain, readnoise, non-linearity, bad pixel maps, ...)
  • Location of the appropriate calibration frames for a given exposure

We discussed whether there were pre-existing de-facto standards for storing this sort of information and decided that there weren't; in some cases FITS conventions could be used to store the desired information, but not in especially natural ways. Dick Shaw later pointed me to section 2.3 in http://iraf.noao.edu/projects/ccdmosaic/imagedef/imagedef.html as something aligned with the mythical 4th FITS WCS paper about coordinate transforms that apply to pixels prior to the mapping from pixels->world.

We need to collaborate with the other parts of the project to come up with a sensible design for this store of information, and with an API to make use of it. Ideally, we'd share C++ classes, but this may not be practical. Where possible, we should of course follow pre-existing standards.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

  • RADECSYS and EQUINOX need to be specified by astrometry.net data

This needs to be addressed. Is a ticket filed?

Results/Decisions?