wiki:ImageAccess
Last modified 10 years ago Last modified on 08/07/2009 01:24:37 AM

Image Access for Data Release Production

The Data Release Production in DC3b will require access to several different kinds of image data, ranging from the complete set of raw images to stacks of "postage stamps" pertaining to individual objects. This page attempts to describe the various types of images used by the Production and how the Production will access them.

Summary of Needed Development Tasks

(All times are design and coding only and assume 40% resource loading. Add 50% overhead for testing, reviews, etc.)

Tasks 7-13 (7 wks) are critical. Tasks 1-5 (5.5 wks) can be postponed, using fixed lists and cutout areas with existing software for writing and reading images for small runs.

  1. Determine raw and calibrated exposure segment images overlaying a given sky tile. (1 wk; KTL)
  2. Maintain LRU cache for raw segment images, retrieving desired images from tape if not present in cache. May be implemented using iRODS. (2 wks; KTL, perhaps with help from Arun; prerequisite for 4)
  3. Maintain LRU cache for Calibrated Science Exposure segments. (1 wk; KTL)
  4. Write Calibrated Science Exposure and difference image segments to disk and/or tape (for long-term preservation and analysis) and enable retrieval of same. (0.5 wk; partly done by 2; KTL, perhaps with help from Arun)
  5. Determine and retrieve cutout areas of segments overlaying a given sky tile. When the sky tile overlaps multiple segments, multiple cutouts are returned, since no stitching is done by the middleware. Same code is used to extract postage stamp stacks. (1 wk; KTL)
  6. Define sky pixelization for co-adds/templates. (? wks; Apps task, TA; prerequisite for 1, 5, 8, possibly 9)
  7. Write to co-add/template pixel store on a sky tile basis. This may include writing the number of segment images used to produce each co-add pixel, but it will not include the detailed segment image identifiers for each co-add pixel. (2 wks; KTL)
  8. Determine template pixels overlaying a given calibrated exposure segment, sky tile, or RA/dec bounding box. (1 wk; KTL)
  9. Retrieve template pixels. (1 wk; KTL)
  10. Determine if local disk can be used to increase performance for postage stamp generation. Requires writing code to write exposure stacks to local disk. (3 wks; KTL)

APIs

Please see ImageAccessApis for proposed APIs.

Stages

Algorithmic stages processing pixel data are assumed to be:

  • ISR
  • Image characterization
  • Co-add/template generation
  • Image differencing
  • Deep detection
  • Multifit measurement

These can be partitioned into four pieces with differing image access needs: ISR/Image Characterization, Co-Add/Template? Generation, Image Differencing, and Deep Detection/Multifit? Measurement.

Sky Tiles

We define a "sky tile" to be an area of the sky that can be processed through a given stage of the Data Release Production on a single node. The tile may be subdivided into a "central" region and a "border" region. The processing of the central region is the primary task of the node; the border region is available to ensure that that there are no edge effects in the processing of the central region. Note that this definition does not rule out a zero-width border region if the stage's algorithm naturally has no edge effects or uses some form of inter-node communication to avoid them.

Tiles will likely overlap; even central regions may possibly overlap, although it is probably desirable for them to be non-overlapping. Processing of a central region may require information from nearby central regions, although again it is desirable for this not to be the case.

There are two types of sky tiles used in the Data Release Production: a small one for co-add/template/postage stamp generation and a larger one for deep detection. The other stages have primary processing units other than tiles, although retrieval of images will implicitly be driven by the co-add and deep detection tiles. The co-add sky tile needs to be small enough that a stack of images can be held in memory; the detection sky tile can be substantially larger, as only one image plane need be held in memory at a time. A detection tile need not share boundaries with the co-add tiles that overlap it, but it would make things easier, so we assume this.

A DC3b detection sky tile on the LSST cluster (500 MB per node) is likely to be as large as 10 megapixels (100 MB with mask and variance). A DC3b co-add sky tile is likely to be only 100 kilopixels (1 MB with mask and variance), as a stack of depth up to 150 will need to be kept in memory to process it. Compare these with an amplifier segment (1 megapixel) and a CCD (8 to 16 megapixels, depending on whether CFHT or simulated LSST data is being processed).

Pixelization

Many issues with co-add generation and use can be simplified if the entire sky is divided into non-overlapping pixels. These pixels need not be equal-area, although that may be desirable. The key characteristic is that every (geometric) point on the sky belongs to exactly one pixel. This pixelization scheme could be Q3C, HTM, HEALpix, or even square tangent plane projections with an algorithm to decide which plane to use at a boundary. We assume that one of these schemes is used, and that sky tiles and their central and border regions are defined in terms of the scheme.

ISR and Image Characterization

The primary purpose of ISR and Image Characterization is to produce Calibrated Science Exposures for later stages.

ISR is assumed to work on units of one amplifier segment-sized raw image. Image characterization is assumed to work on the in-memory output of ISR. For production, it is possible that an entire focal plane will have to be processed at the same time (using inter-slice communication) for at least image characterization and possibly ISR. This requires that sufficient nodes are available to process the entire focal plane at once. For DC3b, we will likely use algorithms quite similar to those in DC3a which only required that an entire CCD be processed at the same time. This reduces the number of nodes, and therefore images, required at one time.

Raw exposures will be provided as a set of files, one per segment, representing the CCD. These will be accessed by ISR from a shared disk subsystem that is a cached staging area from mass storage (tape), since the total set of raw exposures will be too large to keep on disk permanently. An LRU cache replacement strategy is expected to be adequate. The expected data access pattern is to retrieve all CCDs (across time) associated with one sky tile, then advance to a neighboring sky tile, perhaps using some sort of space-filling curve. This results in a minimum cache size for DC3b on the order of 16 MB * 150 deep * 3 (guesstimate at average overlap factor) = 7 GB. The size of a sky tile is likely to be about the same as a CCD, so the cache size will scale with the number of processing nodes. We will likely need a much larger cache, on the order of the calibrated exposure cache below, to minimize reloading. Note that, at any cache size, multiple reads from tape of some files will still be required.

Calibrated Science Exposures, which include post-ISR pixels and metadata such as the WCS and PSF determined by image characterization, will be produced at amplifier segment granularity. Only those segments overlapping the sky tile of interest will be retained; the others will be discarded (or perhaps not produced in the first place). Calibrated Science Exposure files will be written to and accessed from a shared disk subsystem that is organized as a cache. An LRU cache replacement strategy is expected to be adequate. Segments that are already in the cache will obviously not be recalculated. The minimum size of this cache is 150 deep * 3 overlap * sky tile size in pixels * 10 bytes/pixel * processing nodes, or perhaps 1.5 TB for DC3b on the LSST cluster.

ISR/image characterization could work on a demand ("pull") basis, with misses in the Calibrated Science Exposure cache triggering reads into the raw image cache and subsequent processing. On the other hand, it is likely to be more efficient for the work of the ISR/image characterization pipeline to be synchronized with the other pipelines. Such synchronization is possible because the pattern of sky tiles to be processed is known in advance.

Generated Calibrated Science Exposure segment images will likely need to be written to disk and/or tape (for larger runs) for long-term storage. While only the cache is needed for the purposes of the Data Release Production, post-run analysis will need access to these images.

Assumptions

  • Raw images are stored on tape.
  • ISR/image characterization algorithms will work on at most one amplifier segment per node.
  • Algorithms will work on at most one focal plane at a time; less if fewer nodes are available than amplifiers.
  • Algorithms will determine all needed image characterization information such as sky background and PSF and associate them with each Calibrated Science Exposure image.
  • All ISR/image characterization processing is deterministic and consistently repeatable.

Required Development

  • Determine raw segment images overlaying a given sky tile.
  • Maintain LRU cache for raw segment images, retrieving desired images from tape if not present in cache. May be implemented using iRODS.
  • Maintain LRU cache for Calibrated Science Exposure segments.
  • Write Calibrated Science Exposure segments to disk and/or tape (for long-term preservation).
  • Retrieve Calibrated Science Exposure segments from disk and/or tape (for analysis).

Co-Add/Template? Generation

Each node performing co-add/template generation is assumed to work on one stack of rectangular cutouts from the segment-sized calibrated exposures, with each cutout representing the minimum set of pixels that includes the area that overlaps the co-add sky tile. Note that any stitching of these cutouts is expected to be done by the application, not by the image access system.

Stacks for DC3b are at most 150 deep, typically 50. Stacks in production may range up to tens of thousands of images for deep-drilling fields. Co-add stacks may need to include all filters, as co-adds may be polychromatic.

The co-add/template generation algorithm will produce one or more co-adds/templates in the pixelized space. These will be written to shared storage.

Co-add Generation and Outlier Rejection

Some classes of co-add generation are fully iterative; these do not need to have all images of a patch of sky available at once, and can generally receive them in any order. With this type of algorithm, we could likely produce a co-add for a reasonably large sky tile on a single node by feeding it all the necessary images sequentially. These algorithms do not allow for outlier rejection, however, which means they will rely more on good masking and will allow transient objects to get "averaged into" the co-add (it is not clear whether this is desirable or not from a difference imaging standpoint).

Co-add generation algorithms which do allow outlier rejection will generally require all images of a patch of sky to be available simultaneously, meaning the co-add will have to be produced using small sky tiles.

Vertical Image Chunks

Good outlier rejection may not require having access to all of the images of a patch of sky, however. We should be able to use an outlier-rejection stacking technique to combine images in small groups (10-20 images?), and then combine these sub-co-adds together in an iterative fashion. With this, we probably could build a reasonably large sky tile on a few nodes hierarchically, without splitting the sky into small sky tiles.

If we want transient objects to be averaged into co-adds (or rejected from co-adds) in a consistent way, vertical chunking may have to be temporally-aware.

Assumptions

  • The core region of the co-add of a given sky tile can be deterministically and repeatably computed from the stack of calibrated exposures overlaying it.
  • The algorithm performs reasonably at the junctions of the core regions of sky tiles.
  • If sky tile core regions overlap, computing the same pixel multiple times has negligible effect on its value.

Required Development

  • Determine calibrated exposure segment images overlaying a given sky tile.
  • Determine and retrieve cutout areas of segments overlaying a given sky tile. When the sky tile overlaps multiple segments, multiple cutouts are returned, since no stitching is done by the middleware.
  • Define sky pixelization for co-adds/templates.
  • Write to co-add/template pixel store on a sky tile basis. This may include writing the number of segment images used to produce each co-add pixel, but it will not include the detailed segment image identifiers for each co-add pixel.

Image Differencing

Image differencing is assumed to work on one segment-sized Calibrated Science Exposure per node and a segment-sized cutout from a template image. The calibrated exposures will be taken from the cache from the previous section. The template will be taken from the shared template store.

Image differencing can use an arbitrary order of images, although a spatial order is perhaps desirable to allow re-use of template images.

Generated difference images will likely need to be written to disk and/or tape (for larger runs) for long-term storage. While only the DIASources from each difference image are needed for the purposes of the Data Release Production, post-run analysis will need access to the images themselves.

Required Development

  • Determine template pixels overlaying a given calibrated exposure segment, sky tile, or RA/dec bounding box.
  • Retrieve template pixels.
  • Write difference images to disk and/or tape (for long-term preservation).
  • Retrieve difference images from disk and/or tape (for analysis).

Deep Detection and Multifit Measurement

Postage Stamp Stacks

Multifit measurement is assumed to work on one stack of postage stamps (typically 50x50 pixels) per node. These postage stamps include the footprint of the detected object to be measured. Ideally, these stacks are exactly as deep as the calibrated image stacks they are extracted from, but for large objects we may have to subdivide the stack "vertically". While we hope to operate on smaller postage stamps using the entire stack (in each filter), for larger regions it will be necessary to split the postage stamp datacube into smaller chunks, multifit these separately, and combine them. This would work best as a sequential operation; we would want a single node to operate on one datacube chunk, and then the next chunk of the same postage stamp, etc.

Processing

The ideal way to perform these stages is to read in a stack of (co-add) sky tile cutouts from the calibrated images; perform the co-add/template generation; write out the co-adds/templates for use by image differencing and others; perform deep detection on the detection co-add still in memory from the generation step (combined with its neighbors to form a detection sky tile); and then extract one postage stamp stack at a time from the calibrated image stack still in memory, performing multifit measurement on that. Unfortunately, it may be impossible to keep the calibrated image stack, the postage stamp stack, and multifit measurement's internal data structures in memory at the same time.

Sky Tile Overlap and Duplicate Removal

In order to be useful for deep detection, detection sky tiles will have to overlap at some level. Specifically, we need this overlap to be large enough that every object is completely contained by at least one tile (except super-large nearby galaxies, but we need to define what that cutoff is). Because this border region is fixed by the size of objects, it's a smaller fraction of a sky tile if we increase the sky tile; this can be seen as the main motivation to making sky tiles large, at least for detection.

Detection of objects in the core region of a sky tile is by definition not problematic. The difficulty is identifying which objects are in the core and which are in the border. There is further difficulty if detection sky tiles overlap in their core regions. In these cases, it may be necessary to conservatively detect objects in areas of possible overlap and then remove the duplicates from the resulting catalog before further processing. Although the sky pixelization assures a consistent view of a given object from node to node, the surrounding context may vary depending on the sky tile geometry around the object. Hence, the characteristics of the detection may also vary. Accordingly, duplicate removal may be non-trivial, requiring a spatial association step, possibly with additional heuristics. Such a step may be required anyway, however, to match the detection catalog against the difference imaging catalog.

Proposed Methodology

Because of the memory and synchronization limitations mentioned in the previous two sections, it appears that the best method for generating the postage stamp stacks will be to retrieve each stamp from the corresponding segment exposure(s) on disk. If the multifit measurement sky tile is the same size as the co-add sky tile, this disk could be local to the node. If not, images will have to be taken from the shared cache.

Required Development

  • Determine if local disk can be used to increase performance for postage stamp generation.
  • If so, write exposure stacks to local disk.
  • Extract a postage stamp stack from an exposure stack.