wiki:Winter2014/Bosch/Butler+Mapper
Last modified 6 years ago Last modified on 10/08/2013 06:39:24 PM

Bosch's Refactoring Thoughts: Butler/Mapper

Comments are welcome, but please add them in italics and prefix with your initials or username; insert them directly after the text you want to comment.

Defining Datasets

Currently, when a new Task class is created or existing Tasks are modified to produce new outputs, the new output data products need to be added all camera mappers. This is already annoying, but as the number of cameras increases this is clearly unsustainable.

Instead, I think new datasets should be defined by the Tasks (probably via class attributes), using a set of standardized template strings defined by the camera-specific mappers. Mappers could also choose to override specific dataset paths, but this should be unusual except perhaps for the raw data.

Iteration and Completion for Coadd

Coadd data IDs are the same for all cameras, due to our reliance on the skymap package. I believe it's safe to rely on this assumption in the mapper and butler, which should allow us to enable iteration and completion of the sort currently supported by raw data IDs using the registry.

Of course, many data IDs will be formed by combining skymap keys with raw data IDs (i.e. "visit"+"tract" for coaddTempExp), and we will want iteration and completion for these as well.

Data ID Flexibility and Camera Hierarchies

The current flexibility of data keys can sometimes be problematic for algorithms that rely on structure in the data that is considered to be camera-specific by the butler. The coaddition and background matching algorithms center around a concept of "visit", and currently have to twist themselves in knots to define what combination of data IDs are shared by a visit, because "visit" isn't a camera-generic concept in our current system. Photometric and astrometric self-calibration requires even more knowledge about the structure of a data ID, making it virtually impossible to implement in a camera-generic way in the current stack.

For raw data IDs, we should start by having a mapper subclass that assumes a simple data ID with numeric "visit" and "sensor" keys for raw data. We'd want a system of aliases that would allow camera-specific forms of the data ID to be mapped to this and to allow those aliases to be used in mapper paths, but within algorithmic code we'd be able to use the "visit+sensor" form with no loss of generality. Similarly, we should have all cameras support a "filter" key, with alternate terms like "band" allowed as an alias. This mapper subclass would work (as an intermediate base class, of course) for every camera I can think of aside from SDSS, which would require a custom subclass because of the drift-scan approach and the fact that multiple bands are observed simultaneously. Pictorally, I'm imagining a mapper hierarchy that looks like this:

           Mapper
          /      \
SdssMapper      SimpleCameraMapper
              /        |          \
       LsstMapper  SubaruMapper  CfhtMapper (etc)
                    /       \
             HscMapper     SuprimecamMapper

As before, whenever possible, algorithms would rely only the requirements in Mapper; when necessary (i.e. ISR, coaddition, ubercal) algorithms would be written assuming a SimpleCameraMapper, with the subtask-reconfiguration approach used to support others such as SDSS.

The Role of DataRef

I'd like to see DataRef treated as simply the combination of a data ID with a Butler, and something we'd use in place of data ID dictionaries virtually everywhere; as soon as possible, we'd convert all data ID dictionaries to DataRefs. Most of our routines would take DataRefs, not data ID dictionaries, and if they do accept data ID dictionaries, they'd simply use these to create or expand DataRefs.

I think this is mostly the role DataRef already plays, but it's clunky in some places (i.e. for coadd data IDs, because of the reliance on the registry), and its history as a later addition is apparent in others.

I think it could also be nice to provide a set-like API for splitting and combining various aspects of a data ID. I've always been a bit disappointed that Python dictionaries don't have more of this themselves, and there are places in our code (i.e. coaddHelpers.py in pipe_tasks) that would clearly benefit from it.

Numeric Data ID Mappings

We currently support converting a small set of data IDs into unique, fully numeric IDs, and provide information about how many bits these IDs require so they can be combined to form larger IDs (i.e. CCD exposure IDs are part of source IDs). Many aspects of this feature need to be improved:

  • We need a public API for the reverse mappings, and a way to define these mappings such that either the forward or reverse mapping is defined from the other.  RHL The API should support being passed a numpy array of IDs, returning a list of fields (or, better, a dict)
  • We should do a complete audit of what numeric IDs we might want, and ensure these can be made available for all cameras without individual cameras having to implement any more of them than necessary.
  • We need a better correspondence between the names and IDs used in cameraGeom and the names and IDs used by the mapper. All camera-specific names and IDs should be defined in only one place and used by both cameraGeom and the mapper.

It'd be particularly convenient if the mappings between numeric and dictionary data IDs were available directly from a DataRef object.

Aggregate Datasets and Caching

Some of our datasets, particularly exposures, have many subcomponents, such as WCSs, PSFs, and even bounding boxes. So far, we've implemented this by considering these to be "pieces" of an exposure to be extracted without loading the full exposure. The current pipeline flow is starting to demonstrate some of the shortcomings of this approach, however:

  • The WCS of an exposure can also be logically considered to be part of other datasets, such as the source table derived from that exposure. (The same is true of the PSF, photometric calibration, background, etc., but I'll use WCS as my example here).
  • Ubercal creates a new WCS to be associated with each exposure, and while these may be stored separately on disk, we'd like to be able to create a new dataset that would combine the calexp pixel data with the ubercal WCS etc. (See also Dataset Flavors and Versioning, below).
  • Some complex serialized objects, such as CoaddPsf themselves contain many other objects, in this case the Psfs and Wcss of the exposures that went into a coadd. Considering that a the CoaddPsf of a neighboring patch will contain many of the same Psfs and Wcss, it's highly desirable to store each of these only once within a data repo, and moreover, to be able to use an existing in-memory object instead of re-reading it from disk and creating a duplicate.

I think fully supporting the last point would require making the a butler for a particular data repository a singleton that maintains a weak reference to all previously-loaded datasets. It may also require providing a C++ interface for a limited subset of butler functionality (as complex C++ objects may need to query for a previously-loaded instance of a subcomponent in their own deserialization routines). For this to work, it may be necessary to have a mappings from dictionary to numeric ID (and back) for 'all' datasets, not just those for which it is deemed useful for public API reasons, as dictionary-like data IDs are difficult to deal with in C++, and may not work well as keys for a cache of previously-loaded datasets. I believe the requirements on this sort of dict-numeric mapping are somewhat different from the dict-numeric mapping discussed previously, however, and it may be necessary to treat them differently. I'll also discuss this issue on my persistence framework page.

On the question of how to store aggregates, while it's somewhat convenient to package related things up into the same file as we've been doing with exposures and multi-extension FITS files, I'm personally willing to jettison some of that as long as we have very easy ways to repackage them back up for export. I would guess that the easiest way to implement this change, however, would be to continue to store most aggregate datasets as we do now, but allow some new datasets to be built from components stored in files more closely associated with another dataset. For example, a "calexp" FITS file could remain largely the same, holding both three image planes and additional HDUs for other components, while an "ubercalexp" would be created on-the-fly by loading the image planes from the "calexp" and attaching an updated WCS and Calib loaded from another file or files.

Dataset Flavors and Versioning

Many of our datasets are similar to others. We have many varieties of coadd, which may differ only in the config parameters to create them, but which nevertheless must be able to coexist within the same data repository. Similarly, postISRCCD, while largely unused today, can in some respects be considered a stand-in for a calexp, and in the above section an "ubercalexp" dataset that combines the calexp with ubercal results was introduced. For datasets that are in some sense substitutable (I'm being intentionally vague about what precisely that means), we should perhaps instead use the same dataset name and instead add some sort of "flavor" and/or "version" data ID key. This is largely an issue relating to how datasets are defined (which, as I argue above, should be a Task responsibility), not how the Butler or Mapper classes are designed, but we should ensure that those designs permit the use of this sort of data ID key, and perhaps they should include some features to support this sort of usage. The ability to provide default values (i.e. "flavor=vanilla") for certain data ID keys may be helpful, for instance, or support for placeholder values (i.e. "version=latest"). Clearly this idea needs more thought than I've given it, but I think we need something like this to deal with our proliferation of datasets due to coadd flavors and ubercal in a sane way.

Dataset agnostic access to outputs (RHL)

I have quite a lot of code to e.g. plot 3-colour diagrams, and it runs on the results of SFM and coadds.  Doing this required a hack as the code thinks about a "src" but that's not what the coadd outputs are called.  I'd like the butler to allow this, either by a set of aliases (similar to the table slots) or by a sort of "using" directive (i.e. using "coadds";  s = butler.get("src", ...))

The solution may the same as to Jim's "Dataset flavors and versioning", but the use case is different.

Lazy/Proxy Loading

I'd like to see this feature removed or at least disabled by default. The comments around its implementation imply it was intended for use with the Clipboard, and in my own use I always pass immediate=True; I can think of no use cases in the current task system which a lazy proxy is valuable.

Replace PAF Files

Everyone wants this, and it's clearly time for this. As noted above, I believe much of the content that currently goes into a mapper policy file should instead go into Task definitions, as these should be responsible for defining their own outputs. I think my first choice for what remains would be to put this in the Mapper class Python code as well, probably via class attributes. I do not advocate using pex_config for this, as it is not configuration information and doesn't need the complex history management and override system that pex_config provides - simple data structures, in the form of dictionaries and lists, would be better. If the amount of content here does get sufficiently larger that it starts to make the code in Mapper files harder to read, then I'd advocate adopting a standard text-based file format for hierarchical data (JSON and YAML come to mind).

Versioning for cameraGeom

The information in cameraGeom's PAF files needs to be versioned, probably in a very similar way to how detrend files are handled now. I'll talk about this more on my cameraGeom page, but it's worth mentioning here as this will likely require Butler/Mapper support as well.

Avoid Closures

The current CameraMapper relies heavily on "closures" (local functions that carry state from local variables). I find these make the code hard to follow and debug, and I'd much prefer another way of implementing overrides.

Checksums

From Paul Price: we'd like the butler to maintain a checksum of all the files it has loaded. I'm not quite sure what his use case is, or how it would play out for aggregate datasets; I'm not sure whether we want checksums of files or of datasets when the two aren't 1-to-1. (Paul, feel free to edit this section however you'd like).

RHL: I think the desire is to be able to say, "Tell me all the inputs that you used to reduce this piece of data, with their checksums". Then do the same on some P'ton machine and diff the results. I'm hoping that KT's provenance will handle this for us (including the diffs)

Audit for Unused Legacy Code

If the refactoring is done as an edit of existing code rather than by starting from scratch, there are a lot of things in the Butler and Mapper that look outdated and should probably be removed or replaced. Some things that caught my eye (though I may have just not understood the purpose of some of them):

  • It seems like many Mapper methods (keys(), validate(), others?) aren't very useful because they make the (outdated) assumption that all data IDs have the same keys, and instead end up operating on a merged list of all keys for all data IDs.
  • The docs for CameraMapper describe some assumptions about the layout of cameras and their associated data IDs. My impression is that those assumptions have since been relaxed - is that correct?
  • What is "default level" and "default sublevel" functionality used for? Is it only valid for raw-like (i.e. camera-describing) data IDs?
  • What is the "skypolicy" in CameraMapper for? Are we using the skytiles in the registry to support any kind of spatial lookup on the raw data? If so, how do we address the fact that raw images are not required to have initial WCSs, and processed versions of those images may go through a succession of improved WCSs.

Dataset agnostic access to outputs (RHL)

I have quite a lot of code to e.g. plot 3-colour diagrams, and it runs on the results of SFM and coadds.  Doing this required a hack as the code thinks about a "src" but that's not what the coadd outputs are called.  I'd like the butler to allow this, either by a set of aliases (similar to the table slots) or by a sort of "using" directive (i.e. using "coadds";  s = butler.get("src", ...))

The solution to this may be the same as Jim's