wiki:db/obsolete/persistenceDesign
Last modified 12 years ago Last modified on 08/20/2007 11:16:19 AM

Persistence Design

This page documents the current thinking about the design of the persistence framework for LsstData instances, including images, astronomical objects, sources, difference sources, and all of their metadata.

(Add a pointer to EA here)

Goals

  • Enable applications to easily persist instances.
  • Enable collections of instances to be persisted.
  • Enable different forms of persistence to be selected by Policy. This includes persisting to different file types, persisting to the database using direct INSERTs, and persisting to the database using delayed ingest via TSV files.
  • Separate logical descriptions of instances used by applications from physical locations.
  • Minimize overhead for Persistable instances.
  • Stages should be reusable in other Pipelines by changing Policy, not code.
    • Implies that persistence is handled by Pipeline, not Stage.
    • Implies that we cannot assume all items on a Clipboard need be persisted.

Components

Persistable
Base class for all types that can be passed to the persistence framework.
Persistence
Class with one singleton instance that orchestrates persistence (or retrieval) and ensures atomicity (for a single Persistable instance).
Policy
Key/value pairs that configure other classes. See PolicyDesign.
Storage
Class with a subclass for each different kind of storage. Different file formats are considered different storage types. Each Storage subclass is primarily configured by a StorageLocation. It also has class-specific methods for manipulating the persistent representation of a Persistable.
StorageLocation
URL-like reference to a particular physical location within storage for the persisted version of a Persistable.
LogicalLocation
Composite location reference composed of:
  • Pipeline execution identifier (opaque string). This might be a FOV id; it might be used as a directory name or an ingest batch identifier.
  • Class identifier (type name) for the instance to be persisted.
  • Instance identifier (name) selecting a particular usage of the Persistable class.
  • Slice identifier (number) selecting a particular element of a Persistable collection.
Formatter
Class with one subclass per Persistable type that understands the Persistable's structure and content, the mapping from LogicalLocations to StorageLocations, and how to perform this mapping for each Storage subclass.

Algorithms

Persisting

  • Persistence looks up Policy for class and instance, getting appropriate Storages.
  • Persistence calls Formatter with Storages.
  • Formatter converts LogicalLocation to StorageLocation for each Storage.
  • Formatter passes appropriate information to each Storage for persistence at the StorageLocation.

Retrieval

  • Persistence looks up Policy for class and instance, getting appropriate Storages.
  • Persistence calls Formatter with Storages.
  • Formatter converts LogicalLocation to StorageLocation for each Storage.
  • Formatter retrieves information from each Storage at the StorageLocation.

Public Interfaces

Persistence

class Persistence {
    Persistence(boost::shared_ptr<Policy> policy, Formatter* formatter);
    virtual ~Persistence(void);

    virtual void persist(Persistable const& persistable,
                         LogicalLocation const& location);
    virtual boost::shared_ptr<Persistable> retrieve(LogicalLocation const& location);
};

Persistable

class Persistable {
    Persistable(void);
    virtual ~Persistable(void);
    virtual void persist(LogicalLocation const& location);
};

Formatter

class Formatter {
    virtual void write(Persistable const* persistable, Storage* storage,
                       LogicalLocation const& location,
                       bool topLevel = true) = 0;
    virtual Persistable* read(Storage* storage,
                              LogicalLocation const& location) = 0;
    virtual void update(Persistable* persistable, Storage* storage,
                        LogicalLocation const& location) = 0;

// Following are for boost::serialization.
    virtual void delegateSerialize(boost::archive::text_oarchive& ar,
                                   unsigned int const version,
                                   Persistable* persistable) = 0;
    virtual void delegateSerialize(boost::archive::text_iarchive& ar,
                                   unsigned int const version,
                                   Persistable* persistable) = 0;
};

Pipeline Integration

There is one instance of the Persistence class, each Formatter subclass, and each Storage subclass per Slice.

Pipeline/Stage? Input

The Pipeline Policy or the Stage Policy indicates the input data that needs to be retrieved. This is in the form of an "inputDataList" item. Each element of this list contains a class name and an instance name. Examples might be "Exposure/raw" or "DiaSourceCatalog/nightly" or "MopsPredictions/nightly". Note that vectors of instances must be wrapped in some type of container/catalog object. If the inputDataList is specified for the Pipeline, it is treated as if it belongs to the first Stage in the Pipeline.

Note that the previous paragraph conflicts directly with CreateStageImplementation, which suggests that the only input data available to a Pipeline -- or even a Stage, which must be a mistake -- is a DataProperty. The instance names used here also conflict somewhat with PipelineFramework, since they are intended to describe the type and state of data, not the role of data. Instead of "primaryImage", which could be raw or calibrated, we explicitly say we want a "raw" image. Otherwise, we would be persisting or retrieving two different instances to or from the same location.

Each Slice of the Pipeline uses the retrieve() method of the (singleton) Persistence instance to create a new Persistable instance for each element of the inputDataList, using the Pipeline execution id and the slice number as part of the LogicalLocation. The retrieve() method looks up the class name and instance name in the Persistence Policy and determines the correct list of Storages to use. It calls the Formatter subclass with this list and the LogicalLocation. The Formatter translates the LogicalLocation into the appropriate StorageLocation for each Storage, performs the necessary retrieval operations, and fills in the empty instance. This instance is placed on the Clipboard under a label composed of the class name and instance name.

Note that use of a Storage is expected to be single threaded within a Slice, although each Slice will have its own copy of a Storage.

After all instances on the inputDataList have been retrieved and placed on the Clipboard, the Stage is called.

Pipeline/Stage? Output

Any Stage of the Pipeline can place instances on the Clipboard under a label composed of the class name and instance name.

The Pipeline Policy or the Stage Policy indicates the output data that needs to be persisted. This is in the form of an "outputDataList" item, with each element of the list containing a class name and an instance name. If the outputDataList is specified for the Pipeline, it is treated as if it belongs to the last Stage in the Pipeline.

Each Slice of the Pipeline, after each Stage, looks for an outputDataList. If it is present, the Slice finds each instance described by the list and calls Persistence::persist() on each one with a LogicalLocation composed of the Pipeline execution id, the class name, the instance name, and the slice number. This method looks up the class name and instance name in the Persistence Policy and determines the correct list of Storages to use. It calls the Formatter subclass with this list and the LogicalLocation. The Formatter translates the LogicalLocation into the appropriate StorageLocation for each Storage and performs the necessary persistence operations.

Alternatives

If classes can identify themselves, or the class name is otherwise known on the Clipboard, the outputDataList may only need to contain instance names (i.e. Clipboard labels). The inputDataList still needs to contain class names so that the appropriate data type can be instantiated.

Policies

Persistence

  • For each class/instance pair, ordered list of Storages to use.

Pipeline/Stage?

  • inputDataList of data to retrieve with class name and instance name.
  • outputDataList of data to persist with class name and instance name.

Formatter

  • The Formatter subclasses may be configured by Policies if desired, but no generic support is provided.