wiki:Winter2014/Bosch/PersistenceFramework
Last modified 5 years ago Last modified on 10/20/2013 08:55:28 PM

Bosch's Refactoring Thoughts: Persistence Framework

Despite the fact that I wrote it, I'm pretty unhappy with the state of the FITS/table-based persistence framework that has been slowly replacing the daf_persistence framework over the past year. I know there were a lot of complaints about the daf_persistence approach too, and a lot of hopes that the new tabled-based approach would completely replace it soon. I'm not familiar enough with the daf_persistence approach to know whether it's worthwhile trying to revive that and fix it, so here I'll focus on the problems with the table-based framework and my own thoughts on how persistence would work in an ideal world. I'll leave it to someone else to determine whether the best way forward is to fix one of the two existing frameworks or to start over yet again.

This is mostly about the API for how to implement persistence for a class or class hierarchy; it's not about the public interface that sets how to use the persistence framework to save and load objects. The public interface is essentially the Butler/Mapper, which I discuss on another page. I think it's important to say that I don't think the API discussed here will really affect the Butler much (with one notable exception: see Pointer Identity, below); this is really a lower-level API seen by a much smaller number of developers (I think I'm the only person who has implemented table persistence for any classes so far). As such, it might not be something we need to prioritize in order to bring new developers on board, but I think it's important to see it as a major task we'll need to take on at some point.

Boilerplate and Non-Tabular Data

The #1 problem with the table persistence framework is that there's a huge amount of boilerplate involved in implementing persistence for even an extremely simple class. One has to define three simple one-line methods, a rather large write method, and a factory class to reconstruct the object, and register the factory class, usually create an auxiliary helper class that manages the persistence schema and its keys. For each line of real content, there are typically 5-10 lines of boilerplate. This is largely because the table persistence framework demands that objects be reduced to a set of normalized tables, and each of these has a schema that must be defined in advance. This is not only unpleasant to code for a simple class with just a few data members, it's also a waste of space, as each different class requires at least one FITS binary table HDU even if that binary table only has a couple of columns and a single row.

A solution to this problem would have to come in two steps:

  • We'd have to relax the constraint that all objects reduce to tables, and instead be able to save and load objects one value at a time in more of a "stream" or "tuple" sense. In C++, I think the stream approach makes more sense (while C++11 brings better tuples, that would still involve massive amounts of template instantiation, I think, and it would still end up streaming under the hood). It's worth noting that this is essentially what Boost.Serialization does. This doesn't have to break the idea of saving to FITS binary tables (1), as FITS tables do have a heap section that could be used for exactly this sort of data (this is what RHL did for FITS persistence in SDSS, I believe). Unfortunately, the afw::table interface to FITS does not support heap data, and adding it could be a significant amount of work.
  • We'd need to add some syntactic sugar to to the persistence framework, to allow individual values to be saved to the heap easily. We'd probably also want some syntactic sugar to make using the current reduce-to-tables approach more concise as well; I don't think we want to remove this entirely, as many of our objects do reduce to tables well (or more often, a table and a few extra values).

Python/C++ Issues

Both of our current persistence frameworks require C++ implementations - there's no way to use them to store pure-Python objects. The butler has support for pickling, of course, which provides most of what we need, but the default pickle file formats are less than ideal in the same sense that the Boost ones were: they're difficult (impossible) to describe the data layout in such a way that would allow external code (i.e. IDL) to read it (2). More importantly, our frameworks don't provide a way to save compound objects implemented in both languages: if a pure-Python object holds a C++ object as an attribute, we have to implement pickle support for the C++ object separately from its C++ persistence in order to implement pickle support for the pure-Python object. This discourages writing pure Python classes, as it means composite classes in C++ require less work in implementing persistence.

One possible solution to these problems would be to implement *all* persistence in Python; even for C++ classes, we'd write the persistence code in Python, using attributes and constructors made available via Swig. To do this, we'd have to expose many more attributes and possibly some new constructors than are currently available, which might be undesirable when these should not be made public in general but are needed for persistence. There might also be performance issues associated with requiring all persistence code to be written in Python.

I believe a better alternative would be to have a C++ framework that provides a Python API via Swig (in addition to the C++ API) for implementing persistence purely in Python. Ideally, that Python API would actually mimic the Pickle API - i.e., it would look for __getstate__, __setstate__, __reduce__, etc. In addition, we'd provide Swig macros that would add a __reduce__ method to persistable C++ class. This would allow both C++ and Python classes to be persisted using the same framework, and it would also allow these classes to be pickled (which, even if we don't use it internally, would likely be something external users would find useful).

A Reminder: Necessary Features

This section is just a list of features I think a persistence framework needs to have. Most of these are already present in the afw::table::io persistence framework, and I believe they're at least mostly present in daf_persistence. But I didn't fully appreciate all of these when I first started afw::table::io, and I might have done a better job of including them if I'd thought about them from the beginning.

  • Registration and dynamic loading. All C++ persistence requires some sort of singleton registry of factories, with registration entries added at library load time (i.e. on Python module import). It is the responsibility of the persistence framework to ensure that the necessary modules are loaded before inspecting the singleton registry. When loading an afw::image::Exposure, for instance, there is no guarantee that the attached Psf's module will have been imported, especially if that Psf is provided by meas_extensions module. This means that the persistence framework must collect the module name of each object as well as its class name, and be able to import that module before attempting to locate and call the object's unpersistence factory.
  • Pointer identity. Persistence must be able to track individual pointers, and ensure that pointer relationships (including those of shared_ptrs) are preserved after round-tripping. In practice, this means the persistence framework must save the address of each pointer it saves, and use this information to never save the same object twice. When retrieving objects, it must keep a pointer (possibly a weak reference) to ensure that repeated requests for the same pointer always return the same object. While afw::table::io does support pointer tracking within a single compound object (i.e. a single Exposure), we can't go beyond that without integrating the pointer tracking into the Butler itself. I think this is highly desirable, and it's one area I think the persistence framework does impact the Butler design (see this for more discussion).
  • Versioning. I think complete versioning of all persistables is simply too much work to be valuable, at least for a rapidly-changing software stack like ours. That will change in the future, though, and it's clear that a persistence framework needs to at least store and check versions so it can generate sane error messages when the wrong version of the code is used to unpersist objects. Beyond that, it's not really the framework's responsibility, as individual objects need to be able to handle their own persistence schema evolution if they need to support older versions. The framework just needs to be able to tell them what version they're trying to load, and let them take it from there. There's currently no support for versioning within afw::table::io, though I think something basic would be easy to add.

Straw-Man Proposal

I think we should consider writing a new persistence framework, once again using Boost.Serialization, but with a multi-file, afw::table-based custom archive class. We'd use Boost.Serialization's stream approach to append heap data (better than writing our own syntactic sugar for this), and add new methods for saving tabular data that would only be available in our our custom archive class. There'd be no expectation that our objects would be able to be saved using the Boost-provided archive classes, but this would also save us from a lot of template instantiation: we'd only have to instantiate the custom table-based archive. To implement the heap part of the table-based archive, we could either implement this directly in the archive class, or add heap support to afw::table objects (perhaps via a subclass of afw::table::BaseTable, rather than by adding it to the base class).

This custom Boost.Serialization archive class (or perhaps a wrapper around it) would be held by the Butler - instead of creating a new archive instance when persisting each object, we'd keep the same one alive for the duration of the Butler object's lifetime, and the archive would represent the entire data repository, not just a single file. I haven't looked at the details how how Boost.Serialization's pointer tracking works, so I'm not sure if we'd be able to use to do everything we'd need. It might be best if it left this the responsibility of the archive, so we *had* to implement it ourselves, because we'd likely need a pointer-tracking system that's able to handle things that aren't saved via Boost.Serialization - though I'm not sure about this - if the Butler aggregate dataset features are provided via a Pickle-like "transform this object into a tuple of its constituent objects" interface (I'm being purposefully vague here), and we similarly provide a Pickle-like API for implementing persistence for Python objects (which would then be using Boost.Serialization under the hood), we may ultimately be able to handle everything with Boost.Serialization.

That said, I'm not at all certain that Boost.Serialization archive class inheritance would give us all of the flexibility we need; I really haven't looked at that at all recently. I think it's the first option we should investigate, but I think we could also build a new serialization framework without it, and we should strongly consider doing that if Boost.Serialization doesn't give us what we need in terms of pointer tracking.

Footnotes

(1) The ability to persist to FITS tables is considered valuable by many apps developers, including myself, but perhaps it shouldn't be accepted blindly as a requirement.

(2) I think it's fair to say that RHL thinks this was sufficiently useful in SDSS that it should be a requirement for LSST's persistence. I'm not entirely convinced myself, but I think it's a worthwhile goal if it's not too difficult.