Last modified 6 years ago Last modified on 01/07/2013 11:47:42 PM

Review for Table-Based Persistence Framework

(Note: I've now used up all unordered two-item combinations of the words "Table", "Persistence", and "Exposure" - so I must be done submitting Winter2013 design reviews?)

This is yet another design review spun off from wiki:Winter2013/ExposurePersistenceReview, for a new semi-general-purpose persistence framework I have added that converts arbitrary objects to a sequence of afw::table::BaseCatalogs. This is already fully implemented on #2477, which also includes implementations of this persistence framework for the DoubleGaussianPsf, Wcs, and TanWcs classes. It will be used in #2470 and #2482 to save these objects and others to the end of an Exposure FITS MEF.

The interface is primarily in three header files, which I won't re-summarize here.

It's also worth reading the overview in the Doxygen page describing the framework (sorry, don't have a link to it in processed form, since it's on a branch):

The goals for this framework are:

  • Store arbitrary objects in FITS files (or at least arbitrary Psf and Wcs objects).
  • Don't be too opaque (and ideally, be self-describing).
  • Be reasonably efficient when storing many objects with a small number of types.

I've also wondered whether this could be the start of a new path towards saving arbitrary objects to SQL databases (if anyone still wants to do that), via a mapper from afw::table to SQL (something I think Serge has partially built), but that was not something I was designing towards.

Some obvious downsides of the current plan:

  • There's lot of per-class overhead in terms of boilerplate, which makes it a pain to implement persistence for class hierarchies that have a lot of small classes and composition (as I'm discovering with Kernel). I've hard a hard time finding a way to reduce this.
  • Storing individual objects or a collection of distinct objects is not very efficient, because we need at least one HDU per schema.

One requirement that I didn't appreciate in the earlier reviews on this metatopic was the need to save objects in a way that preserves pointer relationships - this is what necessitated the "archive" classes, which track which objects have been saved by mapping pointers to saved integer IDs. This is the same general approach used in most off-the-shelf persistence frameworks, including boost::serialization and pickle, and it was probably a mistake not to start with it. This helps with making many-object persistence efficient when many of those objects are actually the same (which is very likely when we have CoaddPsfs and multiple ExposureCatalogs stored in the same file), but it's even more important that reconstituted objects have the same pointer relationships they had before persistence, in case those relationships are important to the object's behavior.

An obvious question is whether we really want our own, in-house persistence framework. This seemed to be a requirement if we wanted to put things in FITS files; I looked briefly at what it would take to implement a boost::serialization-style archive class that actually wrote to FITS, and it seemed quite a bit more difficult than cooking up my own and leveraging all the FITS-writing code I've already written for afw::table. I also hope that while this new persistence framework isn't exactly simple, it's still orders of magnitude less complex than boost::serialization, and considerably less opaque either that or pickle.

One smaller question is whether it's desirable to have Persistable[Facade] add readFits and writeFits methods to all its subclasses. These are just a convenience for saving individual objects, which we usually won't do in production code (instead we'd save multiple objects in a single archive), and they do pollute interfaces of derived classes a bit. On the other hand, having to construct an archive in order to save anything might be very annoying to developers who just want to be able to dump an object to disk easily in order to inspect it. A compromise might be to put free functions that read and write individual Persistables to FITS in the afw::table::io namespace.

Related Issues

These aren't really part of the same review, but they're somewhat related and I wanted to solicit opinions from many of the same people:

  • The ticket I'll be submitting after #2477 is #2470, which adds all these things to Exposure and also removes support for reading and writing multi-file MaskedImages? and Exposures. That old file format was only being used in unit tests, but it was being used extensively there, because that was the format for almost all of the files in afwdata. I've now converted afwdata (on a ticket branch), but rolling it out could be tricky, as the new afwdata will not work with the old afw (though it could if we were willing to double the size of afwdata).
  • Much of the new persistence and Exposure code I've recently written could be simplified or would work better if Psf and Wcs were either immutable or possibly copy-on-write. Frankly, if that were already the case, it would have also made my job easier on some of these recent tickets. I'd like to open tickets now to get that done eventually, but I wanted to get some sign-off from others that it's a good approach before proceeding.

Review Meeting Minutes

Present: Jim, Russell, Perry, K-T

Doc problems:

  • Mentions Facade<base> instead of Facade<derived> in places
  • Mentions Factory instead of Facade in places

Why not use boost::serialization?

  • Would have been a lot more work now
  • Would need to be able to build FITS table schemas on the fly
  • Need to provide more information for table mapping
  • Advantage would be that object relationships and shared objects are handled (true here too, but a lot of work), and implementing persistence is much more concise.

Intrusiveness into existing classes

  • Didn't mind being intrusive
  • Insist on inheritance from Persistable
  • Should inherit from PersistableFacade<Derived>
  • PersistableFacade works in Python but only if you call the right class
    • Need to know in advance what type to read
    • Could try to build infrastructure to read generic Psf as right class
    • But is tricky to do, since has to be at Python level
    • Need to be able to import that Python module

Add readFits() and writeFits() to all Persistables?

  • Free readFits() function with Python class and filename?
    • Benefits would only be removing Facade and the *Fits() methods
  • Keep as is with instance method to write and class method to read

Exposure and Catalogs are not Persistable; Persistable is for things things that have a more complicated mapping to FITS.

Wcs can be written to FITS headers or binary tables: confusion about method names?

  • writeFits() is only for binary tables
  • header interface goes through PropertySet
  • not worried about confusion

Should we get rid of boost::serialization?

  • Remove from .h files at least
  • Would require explicit instantiation of archives
  • PropertySet/List do not have alternate persistence now
    • Would need to get rid of daf::base::Persistables in PropertySets
    • Choose another data format (JSON?)
  • Maybe keep boost::serialization only for PropertySet

Formatter idea is dead?

  • No, could persist Catalogs to database and have a query produce a Catalog
  • Difficulty is how to match up schemas
    • Better if schema is fixed
  • But existing Formatters can go away
    • Butler needs to directly call readFits() and writeFits()

Enough classes would use this that a framework is useful

  • Needed Psf and Wcs to use the same framework because CoaddPsf has both.
  • Kernels and Functions basically already done
    • Except SeparableKernel, functions used for warping.
  • Photometric ubercal results, background estimation would also be Persistable

Sidetracked: should Psfs (and "DiffimMatcher") and Kernels be more separate?

  • Kernels are just things to convolve with
  • Able to get a Kernel from the former, but they are not subclasses
  • Should they construct Kernels on demand rather than hold them, too?

How do Coord, Box, etc. get persisted?

  • Only as parts of bigger things
  • Generally, non-polymorphic classes are not Persistable

Why in C++?

  • Need to get at data members
  • Implement helper in C++ to allow rest to be in Python? Not much of a win

Index table:

  • One record for each schema for an object
  • Each HDU corresponds to a schema

Through archive interface, can retrieve objects by ID (assigned and returned when object is added to OutputArchive)

To write multiple objects, create an archive, add objects to it, then write

  • Most cases will use archive via Exposure
  • But tests can use single object persistence

Multiple versions of ancillary data (headers, masks, ExposureInfo)

  • Perhaps save Exposure and additional ExposureInfos in separate files
  • Could also have multiple versions in same file (as more HDUs)
    • Primary HDU says which HDU contains index table

Butler interface?

  • Just hide within Exposure
  • But can integrate if we want Butler to be able to access these files
  • Like catalogs; can just use FitsCatalogStorage
  • If we add readFits() to Exposure, could align interfaces

Removal of multi-file Exposures

  • If files are separate, read them individually and assemble
  • Will move headers to Primary HDU and use INHERIT
  • Will remove HDU argument from MaskedImage and Exposure
  • Requires update to afwdata that we don't want to make backwards-compatible.

What about DecoratedImage

  • Still there, didn't get in Jim's way.
  • Better: instead of DecoratedImage, use output argument to get metadata
  • Then assemble both into new Exposure
  • But still need a way to write an Exposure with only one image plane

Psf and Wcs should be immutable or copy-on-write

  • Need the Kernel separation to make this happen
  • Shifting Wcs for XY0 could be an issue
  • Should generally strive for value-like semantics
    • But Python works nicely with shared pointers
    • Python great with immutable objects