Persistence Redesign
The persistence framework used in DC2 and DC3a needs enhancement for DC3b.
Note that persistence and retrieval need not be fully symmetric; some requirements pertain to retrieval that do not apply to persistence and vice versa. In particular, while we generally want to persist a complete object at some location, we may want to do two kinds of retrieval:
- Retrieve a complete object given the location where it was persisted.
- Retrieve an object (or collection of objects) described by some information. This object or collection may never have been persisted as a single unit -- it may be a portion of a larger complete object or it may be composed of multiple separately-persisted complete objects.
Summary of Critical Tasks
(All times are design and coding only. Add 50% overhead for testing, reviews, etc.)
- Upgrade OutputStage to register outputs. (2 wks)
- Implement simple, non-Policy retrieval, including use of pipeline output registry. (3 wks)
- Add a layer to map science identifiers to collections of data.
Implementation Plan
- Define a repository for registration of outputs. This can be a database or a manifest file. A tool will be provided to extract a manifest file from a database or import a manifest file into a database. Each entry in the repository will contain:
- Provenance identifier. This will typically be a run identifier but could be something else in different environments such as debugging, testing, command-line usage, etc. It distinguishes one instantiation of a pipeline from another.
- Pipeline name. This identifies the pipeline being executed.
- Event identifier. This identifies the particular pipeline triggering event.
- Item name. This identifies the clipboard item.
- Python type.
- C++ type.
- Storage type.
- Path for logical location.
- Additional data, if any, used to specify location.
- Formatter policy, if any.
- Create a new Python class, IoManager, to manage registrations in the repository while saving or loading complete objects. This class uses SimpleIoManager internally.
class IoManager(object): def __init__(self, repository, collectionId, policy) def save(self, item, itemId, storage, path, additionalData=None, policy=None) def load(self, itemId)
- Redefine OutputStage in terms of IoManager.
- Redefine InputStage in terms of IoManager and SimpleIoManager.
- Need to make sure that there is an Orca-substitutable relative logical location as well as an "absolute logical" location in the repository that can be relocated via storage policy or repository mapper.
- Be able to persist and retrieve STL vectors of persistable items. Throw an exception if something is not a vector. No attempt to preserve pointer aliases.
- Create a series of example uses.
- Testing context.
- Debugging pipelines and analyzing pipeline results.
- Production execution environment.
- Simple pipeline execution environment. (3rd party users)
- Input cases:
- CFHT data
- Outputs from other pipelines
Application Class Independence
Goal: Allow application classes to be independent of LSST persistence.
Tasks:
- Remove or minimize need to declare friend classes. (2 wks)
- Remove intrusiveness of Boost persistence. (4 wks)
- Allow application classes to have legacy file-based persistence/retrieval (e.g. writeFits(), filename constructors), but not for direct use in pipelines. (0 wks)
Python-Based LSST Persistence
Goal: Mix in LSST persistence/retrieval at the Python level.
There are three persistence environments:
- C++/debugging. This will be handled by the legacy file-based persistence.
- Non-pipeline Python. This includes simple data processors, tests, bug reproduction scripts, and manual usage. A simple but flexible, non-Policy level of LSST persistence is required.
- Pipeline Python. This will continue to be handled by an OutputStage. It will still be Policy/Orca-configurable including changes to the storage destination and type. All pipeline outputs will be registered for later retrieval, either via a database, a flat file, or a standardized filesystem layout. (This may be slightly tricky for sets of objects, such as Sources, that are merged into a single database table.)
The same environments exist for retrieval, but there are two sources of retrieved data:
- C++/debugging. This will be handled by the legacy file-based retrieval/construction.
- Non-pipeline Python.
- For retrieval of non-pipeline data (such as CFHT or Sim files or existing databases), a simple but flexible, non-Policy means of retrieving these items is required.
- For retrieval of pipeline outputs, the registry will be used to locate and retrieve the data.
- Pipeline Python. This will continue to be handled by an InputStage. It will still be Policy/Orca-configurable including changes to the storage source and type.
There are two possibilities that come to mind for introducing persistence/retrieval to the underlying application classes:
- A factory class with self-registered formatters per class and storage type.
- Injection of methods into the application Python class.
In either case, the application class will have to provide sufficient levels of access via accessors, constructors, and descriptive metadata for persistence/retrieval to function, but this need not be LSST-specific. This level of access may need to be the equivalent of a C++ "friend" declaration.
Tasks:
- Investigate most efficient/usable means of implementing Python persistence. (4 wks)
- Define required application class interface. (1 wk)
- Upgrade OutputStage to register outputs. (2 wks)
- Implement simple, non-Policy persistence. (2 wks)
- Implement simple, non-Policy retrieval, including use of pipeline output registry. (3 wks)
- Upgrade InputStage as necessary. (1 wk)
Generic ORM
Goal: Enhance the object-relational mapping capabilities of the framework for persisting to and retrieving from a database.
In DC2/3a, object attributes could be mapped to one table. In DC3b, mapping of object attributes to columns in one or more tables will be necessary. This mapping should be specified through a means simpler than the current C++ code. The ORM will have to perform automatic persistence and retrieval of entire collections of objects given code to persist only one object. It will be kept as simple as possible and as high-performance as possible; in particular, single-object queries and in-place updates will be avoided.
Tasks:
- Implement mapping of attributes to columns in one or more tables. (4 wks)
- Implement collection persistence and retrieval. (4 wks)
Attachments
-
EA logical model - persistence - cluttered.jpg
(119.0 KB) - added by robyn
7 months ago.
Unmodified Reverse Engineered Logical Diagram
-
EA logical model - persistence decluttered.jpg
(136.5 KB) - added by robyn
7 months ago.
Decluttered Reverse Engineered Logical Diagram (minus ptrs, typedefs, structs, enums, iters)
-
Persist Data from Pipeline.jpg
(15.4 KB) - added by robyn
7 months ago.
Usecase model
-
Persist Persistable object.jpg
(33.6 KB) - added by robyn
7 months ago.
Usecase model
-
Retrieve Persistable object.jpg
(32.8 KB) - added by robyn
7 months ago.
Usecase model
-
Execute persistence.jpg
(29.5 KB) - added by robyn
7 months ago.
Usecase model
