wiki:DC2ApSimple
Last modified 11 years ago Last modified on 01/09/2008 01:28:21 PM

Simplified Association Pipeline

LSST Database

Motivation

To date, the Association Pipeline algorithms have read into memory all of the information about Objects and historical DIASources that are potentially within the field of view. 99.96% of this data is irrelevant for the spatial cross-match, 95% is unused for post-match filtering and alert processing, and 97.4% is written back without change. It seems like a waste to do this much I/O of data that is not touched. Eliminating this I/O not only speeds up the Association Pipeline; it also simplifies it and its interfaces with the rest of the DC2 system.

Proposal

We continue to do the cross-match in application code. Instead of storing complete Objects (with photometric redshifts) in binary format in chunk files, we only store the critical information: Object identifier (long long), right ascension (double), and declination (double). This shrinks down the chunk files so that only a maximum of 72MB of data need be read per FOV, which can be done in a couple of seconds on a single present-day machine. The cross-match itself should take less than 1 second on a single machine. Note that this is still at least 50 times faster than doing the same operation in the database, even with the reduced information.

The binary chunk files total 144GB for the entire sky, considering only Objects that are bright enough to be detected in a single exposure. This amount can easily be dumped from the database before the night's observations, taking only a couple of hours on a single present-day machine, less if parallelized. (Even if we consider all Objects that will eventually be found in deep detections, this increases by only a factor of 8 and can be handled with parallelism.)

The result of the Association Pipeline will be three lists or catalogs: DIASources with associated Object identifiers (possibly more than one), DIASources with associated Moving Object identifiers (again, possibly more than one), and DIASources without associated Objects that are to be made into new Objects. These results would be written as TSV (tab-separated values) files for ingest into the database.

The first catalog, of DIASources with associated Objects, needs to be further processed in order to decide which Object is the primary one, update it (including its historical DIASources), and issue an alert if justified based on the Object and historical DIASources. This may potentially require reading the full Object and historical DIASource data. But this catalog is only expected to contain a maximum of 150K Object identifiers, so the amount of data that needs to be read is at most 555MB, and the amount of data to be written after update is at most 285MB. This is still much better than reading and writing 11.1GB of data, and it can be parallelized. This processing is also officially out of scope for DC2.

One objection is that the large read of data for the Object data is not a simple sequential scan, as the matched Object identifiers will be scattered. This can be handled, however, by separating the Object catalog on disk into two sections: the known variable Objects and the rest. Almost all of the 150K Objects emerging from the cross-match will be found in the former table, and, with it clustered spatially, those Objects will be densely packed in the table. The relatively few unusual transients can be looked up rapidly in the main Object catalog.

In fact, the known variable Object data pertaining to the field of view of the exposure can be pre-loaded into memory from disk as soon as the field of view is known, so the bulk of the I/O is done outside the critical path. This leaves just a few random reads of known Objects that were fixed but have now changed to be read from disk.

Other operations against the database may need to perform a single UNION on the variable Object catalog and the normal Object catalog (or the same for historical DIASources). This is still far preferable to the complex algorithms needed to determine and UNION chunk tables in the partitioning schemes we were considering previously.

Effect on DC2

The cross-match algorithmic code, already written, need not change other than to remove unneeded data fields from the Object and remove historical DIASources. Chunk reading code needs to remove this excess data as well. A version of the pre-nightly-processing database dump code has been written, although some control code is needed to iterate over chunks. The AP output code, which was in the process of being written as part of handling chunks in shared memory, is now vastly simplified as it involves writing much less data, does not need to worry about overlapping chunks in consecutive fields of view, and persists the data in a simple TSV form. Database ingest code will simply be a series of LOAD DATA INFILE statements.

The entire AP (not including the Object update and alert generation, which are out of DC2 scope) should now be fast enough that it can be run in a single Stage, and even on a single Slice, within a Pipeline, significantly simplifying the interface to the pipeline framework. We would no longer require complex event handling to deal with the three input events; instead, we can wait until all three events have arrived and only then begin processing.

Conclusion

By avoiding reading useless data and clustering the remaining data appropriately, we should be able to substantially speed up and simplify the Association Pipeline for DC2. While this is a late change, it involves mostly removing portions of things already written and simplifies the portions not yet written, especially the previously complex interactions with the pipeline framework.