wiki:Winter2014/Design/AfwTable
Last modified 5 years ago Last modified on 01/21/2014 01:04:43 PM

Design for W14 afw::table Upgrades

The W14 designs for the Measurement Framework overhaul, the new Data Analysis Toolkit, and several long-standing tickets have brought to light a need for changes in the afw::table library, particularly in the Schema class. This page is an attempt to gather those requirements in one place and produce a design that satisfies all of them.

Most of these requirements are needed mostly for the Data Analysis Toolkit work, but the overall workload will be decreased if they can be done in parallel with the measurement framework work - in particular, we'd like to be able to use the new Schema APIs and field name conventions when reimplementing most measurement algorithms in the new framework.

Overall, there are three themes here:

  • We are addressing a number of long-standing complaints about afw::table (see ticket numbers below)
  • We are implementing limited support for some relational database operations (better sorting, calculated fields, limited joins), as needed to support analysis toolkit operations on files. We will focus on operations that are either extremely easy to implement and represent a significant convenience to the user (e.g. calculated fields), or for which reliance on third-party database tools does not provide a potential solution (e.g. spatial matching).
  • We are making afw::table easier to map to a true relational database, so we can use such a database to perform more complex queries and retrieve the results as afw::table objects, and to ensure that the analysis toolkit code is insulated from whether the data source is a remote SQL database or a file.

Requirements

  • Improve field names for measurement output catalogs (see also wiki:Winter2013/TableFieldNameReview)
    • Make it possible to trace catalog fields to the code that produces them
    • Address long-standing problems with capitalization and special characters in field names (#2231, #2232, #3023)
    • Make field names and types easier to map directly to SQL database names and types
  • Improve or replace the slot mechanism in afw::table to make it more flexible (#2351)
  • Make it easier to map afw::table Catalogs to relational database tables
    • Make high-level compound field types (e.g. Points, Moments) available without making them part of the Schema definition
    • Restrict field name characters to those allowed by relational databases
  • We need to support multi-joins between tables, both via integer ID equality and spatial matching. The join API should support both:
    • SQL-style joins, in which a new table is created with columns from all input tables
    • nested iteration joins, in which we simply provide an object that allows for nested iteration over groups of matched records.
  • Do not implement too much relational database support in-house that would be better left for third-party tools.

Design

Replace Boost.Variant

I think we can do away with the horrible Boost.Variant and MPL usage in afw::table::Schema, by replacing it with a relatively simple combination of type-erasure and traits. The details of this change are quite technical, and I don't want to go into them here - the bottom line is that this will not change the public interface of afw::table substantially, but it will make adding new field types significantly easier and the code easier to understand overall, and it will also remove the current hard limit of 20 total field types.

Aliases

Aliases will be a new feature in afw::table::Schema, intended as a more general backend for the "slots" mechanism in SourceTable. Aliases will be based purely on field names; an alias is simply a string that is replaced by some other string in constructing a field name. Aliases need not refer to a complete field name, but they may only replace the beginning (not the middle or the end). For instance, if you had a table with fields "gaussian_centroid_x" and "gaussian_centroid_y", you could have an alias "gauss_c->gaussian_centroid", but not an alias "x->centroid_x". A field name may have multiple aliases, but field names will not just be another alias; the true field name has special roles. A Schema object will contain a dictionary of aliases, and will include aliases in searches when Keys are retrieved using field names, in order to allow aliases to be used anywhere field names currently are. In addition, we will provide an AliasMap (a thin wrapper around map<string,string>) class that will allow a set of aliases to be managed independently of schemas. By constructing an AliasMap with all common aliases, we will be able to restore aliases when unpersisting catalogs from data formats that do not support aliases (e.g. SQL tables, or queries on those SQL tables).

Once aliases are available, the current slot getters on SourceRecord would simply refer to certain predefined alias names. The Keys corresponding to these names would continue to be cached by the Table (as an optimization), but the responsibility for defining and persisting the slots would reside with the Schema object. Additional standard aliases could then be defined dynamically, though these would not have getters.

FunctorKeys

We can support computed fields easily by allowing Records to accept functors with a certain signature in place of a "Key", and return the result of calling the functor with itself as an argument (FunctorKeys will typically hold true Keys). This will be used to support composite fields in the future (e.g. Points, Moments, Covariances). It will also allow us to compute magnitudes on-the-fly from flux fields, Coord fields from image coordinates, and alternate ellipse parameters (e.g. radii, ellipticities) from sets of "[xx, yy, xy]" fields.

We will provide convenience functions to create these computed field functors from common field name patterns (see below), but we will not store these fields within "Schema"s. This will allow us to deprecate and ultimately remove support for compound fields within schemas, making them much easier to map to relational database tables.

Naming Requirements and Conventions

NOTE: I am aware of the naming conventions for the database schema, but have made little effort to adhere to them here; I think this is a different use case, in which the goals of the naming conventions are in some respects quite different (though I have changed the conventions for error columns to adhere to the database naming conventions, as there's no reason for them to differ assuming we add FunctorKeys). In particular, we really want to be able to trace field names to algorithm classes, and we want to be able to construct FunctorKeys that make use of multiple fields using common naming patterns. I do feel that these conventions are worse aesthetically compared to what we have now in afw::table (and they're probably a bit longer on average), but they'll end up being less confusing and more functional.

  • We will cease translating periods to underscores in field names during FITS table I/O.
  • All true field names must contain only upper and lowercase letters, numbers, and underscores. They must start with an uppercase or lowercase letter, and capitalization should be used only for readability, not for uniqueness.
    • This will ensure that the true field names are valid Python identifiers and valid SQL field names, removing the need for mangling/demangling when converting to either.
    • This will be a strong convention for fields created by LSST code, but it will not be enforced by afw::table, to allow FITS binary tables from other sources to be read.
  • Fields created by measurement algorithms should have the following properties:
    • All fields created by a single algorithm should have a common prefix that matches the template "$PACKAGE_$CLASS_". "$CLASS" is the name of the algorithm class. "$PACKAGE" is the name of the package where the algorithm is defined, with the following modifications for brevity: if the package name starts with "meas_", that should be dropped, and if it ends with a word that is part of the class name, that should be dropped (the goal is simply to make it obvious which package an algorithm is defined in, without producing extremely long field names). For example:
      • An algorithm class named "PsfFlux" defined in the "meas_base" package would use the prefix "base_PsfFlux_"
      • An algorithm class called "KronFlux" algorithm defined in "meas_ext_kron" would use the prefix "ext_KronFlux_"
    • Algorithms that provide a flux measurement should have at least two fields, with the following suffixes: "_flux" and "_fluxSigma". This should be the case even if the algorithm name ends with the word "Flux" (e.g. "base_PsfFlux_flux", "basee_PsfFlux_fluxErr").
    • Algorithms that provide a centroid or other position measurement should have "x" and "y" fields, as well as "xSigma", "ySigma", and optionally "x_y_Cov" fields.
    • Algorithms that measure an ellipse should have "xx", "yy", and "xy" fields, regardless of whether the values were measured as moments, or using some other parameters, as well as "xxSigma", "yySigma", "xySigma", and optionally "xx_yy_Cov", "xx_xy_Cov", etc.
    • A Flag field indicating whether the algorithm's results should be trusted in most cases should be present as simply "flag" (i.e. "base_PsfFlux_flag"). Often this will simply be an OR of the more detailed flags, but detailed flags that do not generally indicate untrustworthy results should not affect this overall failure flag.
    • Flags providing more detailed information should be formed by adding an underscored-joined suffix to the main flag field name (i.e. "base_PsfFlux_flag_badThingHappened").
  • Angle, Point, Moment, Coord, Array, and Covariance fields should not be added directly to tables as fields; these should be considered deprecated as true field types. Aliases to these types will be supported, however, and we will provide convenience methods to allow the appropriate subfields of a compound type to be added at the same time as an alias that can later be used to construct a "FunctorKey" for the compound type. (In other words, you still won't have to explicitly add "x" and "y" fields separately; there will be functions to create both simultaneously.)

Joins and Spatial Matches

  • A new JoinCatalog class that explicitly supports heterogeneous Records (i.e. multiple Schema) and contains group information. When iterating over a JoinCatalog, one iterates over the groups first, then over each record in that group.

  • A convenience function for flattening JoinCatalogs into regular catalogs (with one group per Record) with a new schema that contains all fields from the joined catalog schemas with prefixes.
  • A convenience function for creating a JoinCatalog from several catalogs by joining on IDs
  • A multi-catalog spatial match function, returning a JoinCatalog

Miscellaneous

  • Catalog sorting needs to support stable sorting and arbitrary comparison functors.
  • SortedCatalogs should limit their reliance on ID ordering in implementing searches on ID (perhaps we'd have a flag that indicates whether the Catalog is sorted by ID, and only use binary search implementations when it is).
  • SourceCatalog should provide iterators over deblended children (given the parent) and the ability to return the parent given a child. These operations will require the full catalog and will be performed lazily; we will not be adding references to the Record objects themselves.