wiki:db/Provenance
Last modified 5 years ago Last modified on 10/29/2013 10:19:37 PM

Provenance

In LSST, we define Provenance as execution trace, or in other words information "how we came up with the data". Provenance will be used primarily:

  • to QA the data, for example to detect which parts of data were affected by a faulty algorithm or a bad node
  • as a recipe how to regenerate intermediate results.

Provenance is part of metadata, which is described separately, see Metadata.

The use cases that drive the design of provenance can be found here.

Specifically, Provenance is responsible for capturing configuration of:

  • software (which pipelines were run on given data, what processing stages were part of each pipeline, what versions of algorithms were used, how was processing parallelized and on which computing node each piece was executed etc) and
  • hardware (what was the configuration of focal plane, each raft, ccd, amplifier, filter etc).

This is easier said than done! Given the size of data, provenance information can not be captured per object, instead objects with the same provenance should be grouped. That in turn is non-trivial due to the way objects will be built and updated:

  • Any given object will be processed by several pipelines. Each pipeline consists of many stages. Each stage may involve executing multiple algorithms and may be run on a different processing node.
  • Each pipeline may run at different time, so in practice different parts of any given object may be updated few months apart.
  • It is anticipated we will need to fix bad data (as long as it is unreleased) by reprocessing it (e.g. using a fixed algorithm), so in practice different parts of the same object might have many different processing histories.
  • Finally, there is a large number of dependencies, here is just a small example (it is a tip of an iceberg, really): object data depends on corresponding difference sources which depend on corresponding difference image which depends on used template image which depends on the objects used to generate that template image and so on...

Also, provenance should be flexible enough to allow adding new "things" easily.

For the above reasons, simple solutions like timestamping every object will not work.

In practice, we are planning to implement Provenance based on time ranges, with granularity up to 1 sec. Every configurable piece of information will be tracked through 2 tables: one that keeps definitions and the other that keeps configurations. For example each filter (u, g, r, i, z, y) will be represented as one row in prv_Filter table (there will be exactly 6 definitions), and each configuration of each filter will be represented as one row in prv_cnf_Filter table (initially there will be 6 configurations: one for each filter, but if we happen to later break one filter and will introduce a new one, a new configuration will need to be inserted). Each configuration will have a validity time assigned (time period when it was valid).

The heart of Provenance is prv_ProcHistory table: it has two responsibilities:

  • assign a unique processing history id (procHistoryId) each time anything in the LSST configuration changes. An integer is sufficient to provide 1 sec granularity.
  • bind each unique procHistoryId with a set of "stages" for which given procHistoryId was valid. (A "stage" is a part of pipeline. Stages are the smallest chunks that will be executed atomically, that is, if part of a stage fails, the whole stage will be rolled back and re-run).

Notice that procHistoryId is not associated with a time range, but with a set of stages, each having its own configuration and its own time range.

The Provenance will also track which columns for each table are updated by each stage (we require that each column is updated by a single stage only - for that reasons some columns, eg. flags had to be split into multiple columns)

Every Object, Source, Exposure etc that needs to know what was the exact LSST configuration during the time that given Object/Source/Exposure etc was processed, will need to keep procHistoryId. Note that a single Object/Source/Exposure/etc might be generated or updated by different pipelines and each may run at different time, on different hardware, possibly even at different site. Having a procHistoryId will allow it to find which pipelines and stages were executed for it, at what time and on which processing nodes. It will then be possible to correlate the time periods when each stage was executed with the configurations that were valid at these times.

Such approach has two important pros:

  • Flexibility
    • It allows to decouple configurable elements from Objects, Sources, Exposures etc - they are very loosely coupled through time range only. In particular, it means that new configurable items can be added or removed from the system at any given time with no need to do schema evolution and no need to update existing data.
  • Space-efficiency
    • It is clear things will be changing relatively frequently. Minimizing storage required to track these changes is essential. In our case, a new configuration will only require storage for that configuration, plus few bytes for a new procHistoryId and procHistoryId --> stageId associations.
    • Each Object/Source/Exposure only needs to store one procHistoryId to get access to hundreds of different configurations.

...and one con:

  • A group of Objects/Sources/Exposures sharing the same procHistoryId needs to be processed together by every stage that touches this group. If needed, we should be able to come up with an algorithm to reassign one procHistoryId into multiple procHistoryIds to avoid having to process the whole group together. As of now, this requirement that Provenance imposes is not regarded as a problem.

We expect we will need to tune the current approach to provide efficient query access to the Provenance. In particular, the implementation relies on range queries which may need special optimizations, or special indexes.

These interested in Provenance should study Database Schema Description, in particular the Provenance section.

Here is an example that should further clarify Provenance architecture.

Thoughts on provenance from data loading

Data coming from DRP will come with provenance, represented by procHistoryId. Data Loading System will need to add its own provenance, capturing information how data is loaded (on which machines, etc). That will require creating a new procHistoryId, and "binding" it to the procHistoryId received from DRP.


An observation made during discussion related to Association Pipeline and nightly processing (June, 2007): note that we will be frequently updating values in the Object table and it will be very tricky to recompute things on demand based on Provenance. We will basically need to carefully replay the entire chain of updates, because Provenance will not keep data values before and after each update. To make things worse, we will be deleting old DIASources at the Base Camp, so if we don't capture what gets deleted, we will not be able to reproduce anything. Luckily, this is a post-DC2 issue.


We are planning to start by persisting provenance information for DC2 runs.

  • Software versions (from eups list -s)
  • OS information (from uname -a)
  • Policy data
  • Run data (runId, exposure data, any other input event data)

Persisting policy data is a little difficult.

  • The policy keys form a tree. This tree could be flattened to a string key using a separator; this is compatible with C++ and Python retrieval methods.
  • The policy data items are of varying types, including integers, reals, strings, booleans, sub-policies, and arrays of the above. These could be stored as strings using the syntax contained in the policy file, except for sub-policies, which would be flattened as mentioned. Doing so would potentially limit queries: detectionParameter = '1.35' is not the same as detectionParameter = 1.35.
  • The policy data should not be duplicated in the database if it hasn't changed. We'll calculate a hash of the policy file and will not store its contents if the hash is unchanged.
  • Policies will be persisted as they are loaded so that we don't have to know ahead of time which keys contain policy filenames. If the new policy inclusion mechanism were used throughout, we'd only need to worry about the top-level policy and could perhaps handle this differently.
  • Persisting policies should be done only in the master slice of each pipeline, but every slice instantiates policies.
  • There's at least one case (associate/pipeline.py:_massagePolicy()) where a policy is modified. Any such code should be fixed.

Related diagrams of sql tables and their relations are below:

provenance diagram

Hardware:

provenance hardware diagram

Software:

provenance software diagram

Attachments