wiki:ScienceDataQualityAnalysisUseCases
Last modified 10 years ago Last modified on 06/10/2009 03:04:17 PM

Use Cases for Science Data Quality Analysis

This page supplies a first cut at listing the use cases for SDQA. SDQA operations will have the following three "dimensions":

When (nightly, daily, periodically)

Where (Summit, Base, Archive, Data Centers)

How (automatically, manually)

We must keep these in mind as the use cases become more detailed but, for now, we list use cases without specifically calling out when, where or how they will be executed during operations.

Example Use Cases

  • Check basic integrity of image data -- e.g. missing or severely impacted pixels, rows, columns, segments. To be performed at the image level and cumulatively, through the night.
  • Compare extracted image-level metrics with specifications in Science Requirements Document -- e.g. delivered seeing (PSF size & shape), photometric and astrometric precision
  • Search for presence of artifacts -- e.g. ghosts, glints, stray light, cosmic rays, meteors, satellite trails, aircraft trails, noctilucent clouds
  • Check quality of difference imaging -- e.g. check robustness of kernel-building within each image "footprint"
  • Compare average background DC level with expectation
  • Check for structure in background -- e.g. gradients, ripples
  • Check robustness of source extractions as function of position in the image -- e.g. edge versus center and sparse field versus crowded field
  • Check robustness of source extractions against size (point, extended)
  • Check for unanticipated WCS solutions and for mismatches between rafts
  • Check for coincidence of image problems (such as above) and parameters out of range in engineering data and in ancillary telescope information
  • Check catalog entries -- compare specification with uncertainties on positional associations of sources and with uncertainties on object properties derived from source associations, such as flux/magnitude, position and shape
  • Generate alerts by thresholding all of the above
  • Set flags and enter text comments at image, object and source levels
  • Search for image and catalog artifacts that become apparent only on long timescales
  • Perform astrophysical sanity checks on large samples of catalog entries
  • Generate statistics on data accountability -- e.g. lost images, lost nights, out-of-specification distribution of filters used

Image Quality

Science image DQA requires measurements of the background level, cosmic ray statistics, characterization of the PSF, and photometry across the image. The emphasis is on self-contained measurements done within a given image, with minimal reference to the database or difference images.

Image quality DQA should be performed nightly, in order to optimize system performance and scheduling.

The science drivers for image quality are categorized as follows, together with their associated metrics:

Source Detection

  • System Response
    • atmospheric transparency
    • instrumental parameters: bias, gain, response, exposure time

The atmospheric transparency and stability of instrumental response need to be determined independently from separate calibration data, for example with a fish-eye lens to monitor cloud cover, a seeing monitor telescope, and routine dark and flat field observations.

  • Image Dynamic Range
    • depth
    • saturation

A grid of faint point source calibrators would allow an empirical estimate of the depth (in magnitudes) of any given image. A second set of very bright sources could be measured to determine the image dynamic range, linearity, and saturation.

  • Background Level
    • mean
    • gradient across image
    • higher order terms

The first step in background measurement is to identify and mask out all sources. The zeroth order metric is the mean background level in unmasked regions. The background metric can either be fine grained, and measured at the pixel level or chunked into zones. The fine grained analysis would be useful for determining the distribution of background values, including its mean and standard deviation. A zonal analysis, for example splitting an image into nine zones would be the most efficient way to characterize low order background variations across an image. More sophisticated analysis such as Fourier transforming the entire image to identify fringing or periodic phenomena may be possible, but perhaps too time consuming.

  • Stray Light

Characterization of stray light is difficult to do in an automatic fashion. It may be lumped in with background determination.

  • Cosmic Ray Rejection

Cosmic ray hits need to be identified automatically and affected pixels flagged in each image. The statistics of cosmic ray hits as a function of pattern and size would be useful for assessing the likelihood of false transient detection.

Source Photometry, Morphology, and Astrometry

  • Stellar PSF (seeing)
    • FWHM
    • ellipticity
    • PA

PSF measurement requires source detection, ellipse fitting, and classification of sources into point sources or extended sources. It is not necessary to measure all sources in every image to get an accurate distribution of PSF parameters in an image. Relying on the database to identify point source calibrators would eliminate the need to measure every object in every image for the purpose of SDQA. Sampling a subset of stellar sources will be sufficient to estimate the stellar PSF and its variation across an image. A reference grid of stellar probes could be chosen from the LSST database, vetted for problem sources and refined as the database builds up.

  • WCS

The quality of WCS determination is vital for several science goals, including source detection, moving object detection, difference imaging, stellar photometry, and measuring cosmological lensing effects. The same grid of stars used for PSF measurement could be used for assessing the WCS, if they have well determined positions and proper motions in the LSST database. The reference positions and proper motions will be refined as the survey progresses.


ISR

Currently implemented

  • Number of cosmic rays found
  • Number of saturated pixels found

TBD

  • Number of pixels clipped during linearization (should be zero, but some CFHT data violates that and may still be usable)
  • Number of pixels of cosmic rays found
  • Number of saturated objects found
  • Image statistics (useful metadata, but perhaps not SDQA). Note that we don't have a good way to put a distribution into SDQA. A mean value and uncertainty does not suffice to describe a complex distribution.

Image Subtraction

Single Image

  • Comparison of the Kernel sums for each Footprint
  • Difference image statistics within each Footprint used to derive the Kernel
  • Difference image statistics around other (secondary) Footprints *not* used to derive the Kernel
  • Total number of detected sources in the difference image
  • Number of positive-going vs. negative-going sources in the difference image
  • Amplitude of the spatially varying coefficients in the background/Kernel models
  • Number of masked pixels on the warped template (portion that overlaps the science image)
  • Number of masked pixels on difference image (perhaps in several categories, e.g. edge, saturated, other)

Collection of Images

  • Comparison of the global Kernel sums as a function of time
  • Comparison of the global Kernel sums with the photometric zero points

Efficiency Tests

  • Inject positive-going and negative-going flux (with the footprint of the PSF) into each image before ip_diffim (or create a fork to process these images separately)
  • Place them on known stars, near (as well as at the very center of) galaxies, and then randomly across the image
  • Monitor the recovery of these known objects as a function of flux and location

  • Create simulated images with known variability
  • Monitor the recovery of these known objects as a function of flux and location
  • Calculate the false event rate (all sources detected that were not explicitly added to the data)
  • Compare to SRD on false event rate

NightMOPS

  • Inject synthetic Solar System Object orbits known to be present in the images to be processed. Make sure that NightMOPS returns those positions to AP.

DayMOPS

Overall Strategy

  • Generate a catalog of orbits for synthetic Solar System Objects. Inject DiaSources generated form these synthetic orbits in the DiaSource and DiaSourceIDForTonight tables.

IntraNightLinking (tracklet creation)

For DiaSources to be linkable into tracklets, they need to be brighter than the LSST limiting magnitude and belong to the same night. For MOPS to be able to link DiaSources into tracklet, there need to be at least 2 DiaSources per tracklet (in the same night).

  • Check the tracklets created (mops_Tracklet table) and make sure that
    • All synthetic DiaSources that could have been linked into tracklets were indeed correctly linked.
    • All tracklets that were formed are either entirely made of non synthetic DiaSources or entirely by (the correct) synthetic DiaSources (no mixed tracklets).
  • Compute the efficiency and accuracy figures.

Attribution (augmenting existing MovingObjects by adding DiaSources from last night)

  • Check that
    • All synthetic MovingObjects present and detected in last night's images have been associated with the corresponding synthetic tracklets/DiaSources.
    • No synthetic MovingObject is associated with non synthetic tracklet/DiaSource.
    • No non-synthetic MovingObject is associated with synthetic tracklet/DiaSource.
    • Each synthetic MovingObject is associated with its corresponding synthetic tracklets/DiaSources only.
  • Compute the efficiency and accuracy figures.

InterNightLinking (linking tracklets into new MovingObjects)

For MOPS to be able to link tracklets into new MovingObjects, there need to be at least 3 tracklets (each will have at least 2 DiaSources) belonging to >= 3 nights within 30 days.

  • Check that
    • All synthetic tracklets that could have been linked, have been.
    • All new synthetic MovingObjects are associated to their corresponding synthetic tracklets/DiaSources only.
    • No new non synthetic MovingObject is associated to synthetic tracklets/DiaSources.
  • Compute the efficiency and accuracy figures.

OrbitManagement (merge MovingObjects created at different times but that have the same orbit)

  • Check that
    • All synthetic MovingObjects corresponding to the same entry in the synthetic Solar System Model are merged into one.
    • No >=2 synthetic MovingObjects corresponding to the same entry in the synthetic Solar System Model are present in the database.
  • Compute the efficiency and accuracy figures.

Precovery (augmenting new MovingObjects by adding DiaSources from past nights)

  • Check that
    • All new synthetic MovingObjects have been associated with the corresponding synthetic tracklets/DiaSources from past nights.
    • No new synthetic MovingObject is associated with non synthetic tracklet/DiaSource.
    • No new non-synthetic MovingObject is associated with synthetic tracklet/DiaSource.
    • Each new synthetic MovingObject is associated with its corresponding synthetic tracklets/DiaSources only.
  • Compute the efficiency and accuracy figures.

MOID (computing impact probabilities with Earth)

  • Check that
    • Impact probabilities for synthetic MovingObjects are consistent with known values.

Overall Orbit Quality

  • Compute difference between synthetic MovingObject parameters with respect to corresponding entries in the synthetic Solar System Model catalog.

Multifit

The following metrics that could be produced for sets of objects, on a per sky-tile basis. Although we list them here as SDQA metrics, it is possible they could be better recorded elsewhere.

  1. Distribution of goodness of fit
  2. Distribution of centroid shift between input, model from fitting to the coadd, and the model from fitting to full image stack.
  3. Distribution of the ratio of masked pixels to used pixels.
  4. Percentage of objects for which we fail to converge on a model
  5. Percentage of objects that where not fit due to exceeding a size threshold
  6. Percentage of objects for which a point source model is 'better' than its extended galaxy model

Likely the most useful metric is the fourth, in that it can be used to automatically determine whether this sky tile needs to be re-processed, either with better calibration information, or with a different algorithm.


Use Cases Originating from the Camera Team

Here are some thoughts from Camera land for DQA We intend to monitor many camera items out selves such as

*SDS bias *SDS dark current *SDS image pixel value histogram *SDS CTE sanity check (2049,2050th rows vs nominal bias) *SDS read noise

We'd like out of pipelines pixel metrics that out the result of data correlations Such as

  • DM PSF vs focal plane/raft position
  • TCS predicted optical PSF from WFS data
  • DM sky gradient

We would then use these to look for correlations with our internal metrics

There is another that might best come from a calibration pipeline

  • gain estimate

Not sure what's the best place to do that one


Database Integrity and Consistency

We need to check database integrity and application-level data consistency in the database.

Database-level

Note that very likely we will not use standard database referential integrity checking techniques (like foreign keys) for two reasons:

  • performance
  • distribution (in many cases our tables will be spread across many physical tables, likely located in separate databases)

which mean we will need to do the checking in our software.

Database-level integrity/consistency checking will include things like detecting dangling pointers, or missing entries.

Examples:

  • a source without an object
  • a source without an exposure
  • an object pointing to non-existing source

Application-level

In many cases the individual value will be valid, but in the bigger application-specific context it will be invalid. Some usecases that fall into this category:

  • consistency of denormalized values (denormalization is usually done for performance reasons)
    • Example: filterId in Source is different than filterId of its corresponding exposure
  • impossible combinations
    • Example: a source contains a valid pointer to an object and to a moving object
  • validity of values in the application context
    • Example 1: ra/dec for a source is outside of the region covered by its corresponding exposure.
    • Example 2: ra/dec for an object and its sources except one are near by, one source is far away