Last modified 11 years ago Last modified on 01/08/2008 11:48:36 AM

Processing pixels with the database

LSST Database

In general, today's RDBMS systems are not ideal for storing large volumes of unqueryable data like images. The overhead paid for storing such objects in the database, as opposed to a simpler filesystem, is usually too great.

On the other hand, it is desirable to provide the user with a single system view in which the database (containing metadata, Objects, Sources, provenance, etc.) and the images are unified. This document describes some of the challenges and possible solutions.


Manual model

A query is issued to the database to determine which images are needed. The resulting list of image pathnames and cutout coordinates are then used in a separate step to obtain the actual pixels for processing. This does not provide the desired single system view.

SQL-based model

An SQL query is issued to the database; part of that query specifies operations on the image pixels. Behind the scenes, the database obtains the relevant image pathnames and performs the desired operations on the image data via some sort of external execution.

Pipeline-based model

A pipeline policy is developed that includes image selection predicates. These predicates are used by a pipeline stage to query the database and retrieve the appropriate images. Further pipeline stages operate on these images.

Bigtable model

The images are stored in (image-typed) columns of the database and appropriate operations are defined within the database's query language. Images are handled in the same manner as any other data type.

Key Questions

The key questions to be answered are:

  • Which images need to be processed?
  • How should those images be combined?
  • What should be done with the result?

Ideally, all of these questions should be answered by combining operators in a single language, although it may be more likely that a declarative and a procedural language will have to be combined to provide sufficient expressiveness.

Access Patterns

For deep detection, we expect images to be accessed primarily in a sequential fashion, with all the images covering a given portion of the sky processed before the images covering another part of the sky.

For more general queries, though, will random access to images be required? Perhaps only pixels around certain objects will be of interest.

Will operations in queries be performed on single images or multiple images? If multiple, will they be localized in space or could they potentially be widely separated?

Science Issues

While it is easy to extract a given exposure from the archive, it is much more difficult to generate an image covering a given region of the sky. Some of the issues include:

  • Dealing with overlapping images, including possible edge effects.
  • Non-square pixels that may not precisely align.
  • Differing epochs for pixels with different observational characteristics.
  • Non-rectangular intersections between images because of masks, etc.


If the above science issues can be resolved, it may be desirable to present the user with a more intuitive model in which the "best" value of any pixel in any part of the sky (in some reference frame) at a given time can be determined. One way of doing this might be to transpose the image data so that all the values of a given pixel over time are stored together. Unfortunately, this may be difficult. Not only is aligning pixels difficult, but we also have 300 Mpixels per square degree, or about 10 Tpixels overall. At 1000 exposures per pixel on the average, this would be 1E15 values, each occupying at least 2 bytes. While this much data for the raw images can easily be stored on tape within the project budget, it is less clear that the transposed data can be stored on an easily queryable medium at reasonable cost.

Streaming Images

One possible solution might be to continuously stream the images from tape through an analysis pipeline, repeating after the last one is processed. Queries on the images would be expressed in some language and executed by modules plugged into the pipeline. Interesting pixels or whole images, perhaps those within a certain radius of objects of interest or those needed to do regional comparisons, could be cached by a module. New modules could be plugged in at any time; the time to query completion would be essentially constant, independent of load, query selectivity, or query access patterns.

The same queries could be run against a small section of sky kept on disk for testing purposes, enabling rapid turnaround in development.

The image processing results, which might be in the form of columns or rows to add to the Object and Source tables, would likely have to go into a private space that could still be linked with the main database for further queries.