Last modified 5 years ago Last modified on 01/10/2014 03:01:30 PM

Butler Redesign for Winter 2014

See DataButler for original design and usage.


All those in:

as well as these:

  • Reread registry as little as possible
  • Retrieve calibration images without visit info
  • Write composite datasets
    • Overwrite pieces of composites?
  • Read composite datasets
    • Read pieces of composites
  • Enable passing options to butler persistence/retrieval
  • Provide write-once-check-equal operation for provenance data
  • Specify new per-repository dataset types on the fly
  • Better checks for existence/non-existence of datasets and associated errors
  • Read from (but not write to) databases
  • Be able to use multiple butlers with different repositories but same dataset types
  • Support ad hoc glob-based repositories (obs_file)
  • Enable overrides for dataset retrieval
  • Enable overrides for registry lookups
  • Make data ids and dataRefs hashable, even though Python dicts are not
  • Make data ids and dataRefs comparable
  • Improve and make reversible numeric id generation

Design Outline

Butler execution stages

  1. Determine keys for dataset type (Mapper)
    • Extend keys based on available data id
    • Includes registry if present
    • Includes looking in filesystem
  2. Map keys to storage, path, parameters (Mapper)
  3. Use pluggable storage to retrieve/store object (Butler)

Butler interface remains similar

  • No more ButlerFactory
  • No more hierarchy of Butlers; ButlerDataRef and ButlerSubset instead
  • Keys no longer have hierarchy
    • Dataset types are more independent
    • Could inherit from common base if desired
  • queryMetadata() -> getValues() for specified keys
    • Lists only datasets that exist by default; can override
    • Can be used to determine numeric ids for data id or vice versa
  • ButlerDataRef based on particular dataset type (remembered)
    • Stores extended keys
    • When applied to other dataset types, keys may be extended further
  • Pluggable storages
    • Storages defined in Python
    • Must be setup (typically by obs_ package)
    • May be imported
    • Composite objects stored in denormalized form with copies
      • On reading, objects are reused instead of copying
        • Repository name passed to storage
    • Special storage type for shared provenance data
    • No more proxy on read
  • Butler-stored dataset type aliases
    • Use @src in code
    • Define alias using Butler.defineAlias("@src", "coadd_src")
    • May become required to enable introspection of input and output datasets

Mapper changes

  • Configuration stored in sqlite3 database in repository
    • Separate from registry
    • Allows dynamic creation of dataset types
      • Subtype of existing with different template
      • Composite of existing dataset types
    • Handles locking for multiple tasks accessing repository
      • Provides place to store provenance information
    • Mappings returned as storage/path/parameters
      • Could be defined in configuration as URL templates; simplifies specification
      • Database queries have DB location in path, query in params (read-only)
    • No more closures
    • Base class provides standard methods for looking up keys
    • Base class provides standard template substitution methods
    • Dataset types can refer to components of composite (read-only)

Task List

  • Replace PAF files with super-template system (2 weeks)
  • Implement database query datasets (1 week)
  • Improve dataRefs and existence/non-existence (1 week)
  • Implement write-once-check-equal (0.5 week)
  • Port obs_file functionality (2 weeks)
  • Improve numeric id generation (0.5 week)