wiki:Winter2014/Design/DataAnalysisToolkit
Last modified 5 years ago Last modified on 11/12/2013 09:31:30 AM

Design for W14 Debug & Analysis Toolkit Work

This page is a work-in-progress of gathering and organizing the requirements and suggestions from Andy Becker's brainstorming page, as we begin to turn them into a design. Please feel free to modify this page (but please indicate who is speaking) to add more requirements and suggestions, or send them (or links to them) directly to me and I can try to incorporate them.

NOTE: quite a bit of discussion on this topic has happened on the LSST-data mailing list (look for the [Analysis Toolkit] subject prefix):

http://listserv.lsstcorp.org/mailman/private/lsst-data/2013-November/subject.html

Scope

Jim thinks these aspects of the scope are fairly well-defined, but please push back if you disagree. There are other aspects of the scope that are not well-defined; see below.

  • We intend to provide plotting and display tools that can be used from the command-line, after data processing is complete, or from debug blocks during command-line processing (we also intend to explore replacing the current mechanisms for enabling live display-based debugging in pipeline code).
  • All plots and displays must be interactive at a level that allows "drilling down" to further investigate problems, indicating that we must have multiple ways of viewing the data (e.g. an annotated image display and a scatter plot of measured quantities) that are somehow linked (so an outlier in the scatter plot can be immediately viewed in the annotated image display). For the most part, we will provide a Python library interface that can be used easily from the command-line, with mouse actions used only to select points, pan, or zoom, but we will design classes in such a way that a true GUI could be implemented on top of these in the future.
  • We will focus on the particular data products associated with LSST pipeline outputs, and our high-level components will be focused on these data products and their relationships. We will not build a general-purpose tool for investigating arbitrary astronomical data with arbitrary connections between data tables, though some of our low-level primitives may be useful for these purposes. I strongly believe we need to be wary of over-generalization as a form of scope creep here.
  • This is not a rewrite of PipeQA; we will not be building a web front-end or focusing on quick-look "tests" for data or data reduction quality. I believe that if we do certain things well, we will be able to reuse some components in a future PipeQA rewrite, but we will not consider that a primary goal.

Collected Requirements

Readers, please add more...(we'll worry about priorities later)

  • Better primitives for image display (i.e. replace afw.display, even if DS9 is still the backend)
    • Should abstract out what tool is actually used to display the images
    • More details on DS9 pros/cons on Andy's page
  • Better mechanisms and primitives for display-based live pipeline debugging (i.e. lsstDebug)
    • Tests need to be more uniform
    • The framework for adding them needs to encourage documentation
    • Consider switching to something like pex_config.
    • (PAP) Command-line activation of particular debug features
  • Certain common plots, based on FITS files for small datasets and potentially databases for larger datasets. Should be particularly easy for common data units (i.e. CCDs, rafts, visits, patches).
    • 3-color diagrams
    • color-magnitude diagrams
    • repeatability of fluxes
      • TODO: need to collect example plots
    • repeatability of colors
      • TODO: need to collect example plots
    • matching, comparison with reference catalogs (not just astrometry.net catalogs!)
  • Selectable overlays for image display tools
    • source measurements (points/ellipses)
    • mask planes
    • footprints
    • coadd input image boundaries
    • (PAP) Match lists (including accounting for distortion)
    • (PAP) Scaling circles by magnitude?
  • Automatically-aligned opaque layers for image display tools
    • deblend children
    • background
    • constituent images of coadd
    • color composite images
    • template and science image for difference images
    • various states of
  • Interactivity between plots and image display tools
    • Zoom/pan image based on selection in scatter plots
    • Highlight scatter plot points based on selection in image display tools
    • Highlight scatter plot points or sources in image display based on histogram bin selection
    • Display lines in histograms corresponding to selections in scatter plots or image displays
  • Easy selection and drill-down between related things
    • Inspect sources (inc. forced) given an object
    • Inspect child properties and deblends given a parent (and vice versa)
    • Inspect PSF model at positions of sources, arbitrary positions on an image
  • Support for loading data from both Butler-based files and remote SQL databases, including common joins and calibration (i.e. transform fluxes to magnitudes)
    • (PAP) Is the SQL database support something better off putting in the butler?
  • Plotting primitives for simultaneously viewing many slices through high-dimensional data (prototypes present in meas_multifit)
  • Plotting/display primitives for viewing data-model-residuals for individual objects

Major Outstanding Design Questions

This is the section for design questions Jim already knows he doesn't know how to answer ("known unknowns"); comments and recommendations much appreciated.

Data/Query Abstraction

The data we wish to analyze may have originated in FITS files produced by the pipeline or the queries to a remote MySQL database, or even represent an intermediate output from a live pipeline task. Clearly we need to isolate the display code from these, but there are many ways to design this abstraction layer.

In addition, we need to be able to filter, sort, and (to a limited extent) join tables to provide the sort of interactivity and "drill-down" functionality required. Some data sources (i.e. SQL databases) provide a tremendous amount of query functionality that we could leverage in certain kinds of data abstraction layers, while others provide limited query functionality (e.g. NumPy boolean indexing) that may meet nevertheless our requirements. Implementing a SQL interface for non-SQL data sources seems extremely challenging, but we will clearly need some way to express moderately complex queries (i.e. a WHERE clause with multiple AND and OR clauses representing different cuts), and it seems somewhat foolish to devise a new, non-SQL way of representing such operations. On the other hand, when we already have a NumPy-based table in hand, it seems equally foolish to round-trip the data through a SQL database just to perform a simple filter that could be done with a boolean indexing operation.

Another major concern is data volume - delegating queries on large amounts of to a (possibly remote) SQL database is clearly desirable, but round-tripping through a SQL database for a small amount of data that trivially fits in memory is not.

Some possible designs are summarized below.

UPDATE: The first approach is the one chosen in the current design, with some hope that elements of the second may be added in the future (by using SQLite as the on-disk format for afw::table).

Strided-memory data with converters

This is likely the simplest approach to implement, but probably the most limited in terms of features and extensibility. All plotting operations would be based on a simple, strided-memory data structure (afw::table objects, or something based on NumPy arrays). We'd provide bidirecetional converters for these objects and SQL data sources, but we'd generally leave it to the user to explicitly use them to get the data in the form needed by the analysis toolkit (though common data dumping scripts could be collected and reused). We'd implement our own simple filters, sorts, and joins, probably delegating most of the work to NumPy. Joins would likely be only be available for certain predefined relationships (i.e. blended parent to children, Object to Source) that play a particular role in the analysis tools themselves. Complicated combinations of filters could be performed either in SQL before the data is converted to the strided-memory structure, or using NumPy logical operators and boolean indexing.

Embedded SQL with converters

In this scenario, we'd use something like SQLite to implement our core data object, and provide converters from FITS binary tables (etc) to this form. The SQL database could be in-memory or on disk, and we would use direct SQL queries for any filters and joins. We'd only convert to NumPy? arrays immediately before plotting. A major concern here could be computational performance (from the constant conversion between NumPy and SQL form), though we should do some experiments before making too many assumptions about that. I'm also worried about code complexity, especially if we find that we need to start caching the NumPy arrays, but it could end up being simpler than other options if it turns out that the high-level analysis tool requirements really do require a powerful query engine.

Polymorphic data objects

We can also imagine writing an opaque data interface that could have multiple backends for different data sources. This could include both of the above options, as well as delegating query operations to remote MySQL databases. This doesn't avoid the core query support problem, though: to support all the desired data sources, we're still stuck with either not having SQL support in the interface, or implementing a SQL parser for non-SQL sources. With this approach, though, we could base most display tools on a more restricted query interface that delegates to SQL in some implementations and boolean indexing in others, while allowing the full SQL interface to be available in some sense to users or plots that require a SQL-based backend. I'm nervous about trying to define this interface up-front, however, and that we'd end up either dropping flat-file support (as has happened in !PipeQA) or slowly implementing a lot of database functionality piecemeal in the non-SQL backends as we find that we need more and more. I'm also concerned that trying to make the data object polymorphic will lead us to a fairly minimal interface for it, which can be a real pain in interactive analysis when you really want to have lots of operations available to you in advance, before you know you need them (which both SQL and NumPy do).

Third-Party Tools

UPDATE: The current design will continue to use DS9 for image display, and will provide some connectivity to third-party tools via SAMP. Glue and particularly Ginga will be considered again in the future.

There are many third-party tools for scientific visualization we should consider adopting as part of this work, both within the astronomy communities and the general Python data analysis communities. Using one or more of these would allow us to leverage a lot of existing work, but my sense overall is that the "fit" between these tools and our goals here is not generally great. Choosing to adopt any of these tools would have a major impact on what we want to do with this task, so making this decision is something we should do very early in the planning process.

Here are a few that have been mentioned so far, along with my own preliminary thoughts on them and my attempts to summarize those of others (readers, please feel free to comment here or add more suggestions!):

  • ''ASCOT'', ''TOPCAT'': general-purpose visualization tools useful for exploration of data. ASCOT is web-based (good for collaborative work), and widget based (good extensibility...if we were web developers). TOPCAT is Java-based and is perhaps a bit more mature. Both are already useful for general GUI-based data exploration and quick plotting. I find it hard to imagine how we could leverage these to provide a lot of the functionality we need as part of our analysis toolkit, simply because of the language barriers and the sense that these are designed more as standalone applications rather than libraries. We should consider encouraging use of these tools with our pipeline for more general use cases our own tools will not cover, and that may include some work to provide a minimal level of interoperability (i.e. scripts to make it easier to load our data into these programs).
    • (PAP) TOPCAT uses SAMP to intercommunicate with ds9 and other SAMP-enabled clients. Having our code speak SAMP would enable us to leverage the full capabilities of these other tools.
  • ''glue'': like ASCOT and TOPCAT, a general-purpose visualization tool, but written in Python and based on NumPy, Matplotlib, etc. It seems somewhat immature, and the API documentation is clearly lacking, but there is a great deal more potential for interoperability than ASCOT or TOPCAT and glue was very clearly written with astronomy in mind. Definitely worth a closer look, but I don't yet have a clear picture of how this could really be used to reduce the work we have to do to build something that meets our requirements.
  • ''ginga'': an astronomical image viewer written in Python, and as such a leading candidate to replace DS9 in our display tools. Main concerns are whether it lacks any important DS9 features, and whether the need for a more heavyweight GUI toolkit (GTK or Qt is required) is acceptable.
    • (PAP) It looks like several plugins (which we would have to write) would be required in order to get ginga close to ds9's functionality.

Supporting Remote Analysis

UPDATE: Having learned a bit more about how it would work, Jim is no longer quite as intimidated by remote operation. Which protocol(s) will be used is not yet part of the design.

Our image display tools already support a certain degree of remote operation, which I believe is regarded as a requirement at some level. Our matplotlib-based plotting tools do not support any kind of remote operations, however, and this presents a big problem for providing better interoperability between the image display and data plotting components. Adding a client-server architecture with these sort of communication capabilities seems like a huge addition to the already substantial amount of development work needed, but I believe adopting a client-server architecture is something that generally works much better if it is done at the beginning of a design, rather than added on at the end.

We could imagine addressing this by adopting the Polymorphic data objects option for the data abstraction layer question, and implementing a data class that retrieves data from a remote system while doing as many calculations there as possible, or even by adding these sort of features to the Butler. That still seems like quite a challenge, however, and my first inclination is to not attempt to add any sort of remote operation to the analysis toolkit, with the hope that other options (e.g. remote filesystems that behave better than sshfs) will provide this support in the future (and in the meantime, we can get by with vnc and/or X forwarding).

(RHL). What do you mean by remote operations? If X were fast enough then I think it'd be sufficient. If it isn't, then we need the ability to run a viewer locally and command it remotely but isn't this a protocol question (SAMP? XMLRPC? XPA?) rather than a design question?

  • (JFB): I think what we do need is just the ability to run a viewer locally and command it remotely, and that's what I'm trying to do now. The choice of protocol may affect our ability to talk to third-party tools, but I don't think it will affect the design much. I think it does impact the design somewhat (you have to have some system of commands and a local viewer app of some sort), but this section was written before I had that idea in my head (hence my sentence at the top that "I'm no longer intimidated by this").

Design Proposal

Data Access: Butler and afw::table Enhancements

  • All plotting and display tools will rely on in-memory afw::table Catalog objects as inputs. Assumptions about the source of these objects will be limited and always abstracted away. For instance, a display tool based on a table of CoaddSources should not assume that those sources came from either a deepCoadd_src FITS file on disk or a MySQL database query, but it may be aware that there is a table of ForcedSources associated with each of the records in the CoaddSource table, and provide an overrideable way to load the associated ForcedSources on request.
  • K-T has noted that the updated Butler design will include support for running database queries through the Butler; we will thus rely on the Butler to abstract access to both FITS files and SQL query results in a consistent manner (which we assume will be Catalogs or objects trivially convertible to Catalogs).
  • We will rely on Mapper features to mangle/demangle record IDs.
    • (PAP) Should the Mapper (as opposed to the butler) be externally visible in order to allow this? I still like the idea of the Detector object knowing about the mangling.
    • (JFB) Perhaps; I'm not particularly concerned about that distinction for the purposes of this document, and I think it's better discussed probably on the cameraGeom page.
  • Some afw::table improvements will be done to support the analysis toolkit work. I've created a new page for this which combines the afw::table work from this page, the measurement overhaul page, and a number of long-standing tickets.

Image Display

  • We will continue to use DS9 as our primary display tool.
  • We will create a new mid-level interface for interacting with display tools that does not assume DS9 is the frontend. The new interface will be stateful, not command-oriented; rather than "sending" an image to the display tool, we will construct an object that represents the displayed image, with mutators that affect the display. This will likely be somewhat clunky as long as DS9 is our primary display tool, and in particular we will have to trust the user not to manipulate objects (such as DS9 frames and regions) that our interface expects to maintain full control over, but it will provide quite a bit more power and it should be possible to avoid the clunkiness when we one day replace DS9. This interface will involve five key classes:
    • Opaque ImageLayers which may be reordered (only the top layer is visible at any given time). Each layer may have a unique scale and colormap (which we will leave to the display tool gui) but will be locked to a single coordinate system. Not all ImageLayers must have the same size; some layers may just represent postage stamps that can be registered with a particular location. In DS9, each layer will be a separate frame, and we will use the "lock" feature to keep them aligned. In the post-DS9 future, image layers may be allowed to have an alpha channel. ImageLayers may be toggled out of DS9's frames without being deleted entirely, to keep the list of frames manageable.
    • ImageLayers may be placed into named LayerGroups (each layer may be in exactly one group). Groups may be toggled and moved together, and LayerGroup may be subclassed to provide additional features.
    • Transparent, single-color Overlays (of the sort currently used to display Masks). Overlays may be associated with a single ImageLayer or with a LayerGroup.
    • GeometrySets, which may be sets of points, ellipses, or polygons, all of which may be annotated with visible text or invisible metadata (different geometries will likely be subclasses of GeometrySet. Like Overlays, these may be associated with a single ImageLayer or a LayerGroup. In the DS9 implementation, we will prevent the user from manipulating these objects using the DS9 gui, aside from selection, and we will support querying DS9 for the list of currently-selected geometric objects (and the metadata those objects were constructed with, e.g. source IDs).
    • A singleton DisplayManager object that holds all ImageLayer, Overlay, GeometrySet, and LayerGroup objects currently active, and allows them to be toggled and moved up and down. A DisplayManager may control multiple DS9 instances, and allow layers to be moved between different instances. Different instances may have different coordinate systems. All of the DS9-specific code will be in (mostly private) methods of the DisplayManager class; it is the job of the DisplayManager to intepret and display the application-agnostic data in the layer classes.
  • In addition to the mid-level interface, we will provide a LayerGroup subclass ExposureDisplay for displaying a combination of Exposure+SourceCatalog. This will create ImageLayers for each MaskedImage plane, as well as for the background model and the unsubtracted image, Overlays for the Mask planes and the SourceCatalog's Footprints, and ellipse GeometrySets from the SourceCatalog's centroid and shape (with options for using annotations to display other fields as text). ExposureDisplay will also support:
    • Retrieving source IDs from the currently-selected GeometrySet ellipses
    • Creating an ImageLayer or a set of postage-stamp ImageLayers containing PSF model images at various points on the image. Making one ImageLayer for each PSF image is an option so these can be displayed in a tight grid by tiling DS9's frames (which we'll provide more convenient control over when this feature is used).
    • Creating postage-stamp ImageLayers that represent deblended child sources by replacing neighboring objects with noise in the same manner as SourceMeasurementTask.
    • Creating model and residual postage stamp ImageLayers for measurement algorithms that implement a few (TBD) hooks.
  • We will further subclass ExposureDisplay to CoaddDisplay, which will contain all the layers of ExposureDisplay as well as:
    • polygon GeometrySets for the bounding boxes of the images that went into a coadd
    • additional ImageLayers for constituent images (with Mask-plane overlays), and optionally more ImageLayers for their variance planes
  • All of the above classes will have interfaces that are designed to be controlled easily from an interactive Python prompt, but will be designed in such a manner that a GUI layer could be built on them in the future.
  • We will continue to use XPA to control DS9, as the choice of messaging protocol here should be mostly invisible to the user (it should reside completely within the DisplayManager class) and it will be less work than switching to a new protocol.
  • Some elements of the existing low-level DS9-specific interface may be preserved and used to implement the high-level interface. Others may be deprecated, but nothing will be removed immediately.
  • In addition, we will identify the features we would need to add (possibly in plugin form) to ginga to make it a possibility in the future, and ensure these are either filed as upstream issues or planned for future LSST development work.
  • (KTL) This design seems to conflate at least the Model and View components of the standard MVC paradigm. Adhering to this proven pattern is more likely to provide flexibility and adaptability down the road.
    • (JFB) I don't really follow. My understanding of MVC is quite limited, but I'd consider the "model" here to simply be the Butler, Exposure, and SourceCatalog objects (and I don't think* there's a need for a layer to abstract those away). I'd think of all the layer classes the "view", and the ExposureDisplay and CoaddDisplay classes the "controller". But maybe I'm trying to just map those concepts onto an existing design without really knowing how they should be used. In any case, I do think our ability to design an ideally abstract view interface is severely limited by the fact that we're trying to make use of a third-party display tool that doesn't provide exactly what we want, and doesn't give us a very good API with which to control it. And I think our need for a comprehensive model API is lessened by the fact that our data is read-only in this context.
  • (RHL) I'm not convinced that this abstraction is better than a simple command-based approach such as currently used to talk to ds9. It adds a lot of complexity (== work and design choices), and I don't see any clear benefits. I do not think that we want to make the image display smart -- the smarts should be in our code, and the viewer's a slave (albeit one that's allowed to send notes to its master such as "I was requested to ask you about about the object near the cursor")
    • (JFB) Is it just the stateful vs. command-driven aspect that you're nervous about? I think in this design, the smarts are still in our code, and the only difference is whether our interface "remembers" previous commands. My concern with a command-driven interface is that you're stuck with only low-level abstractions because all the higher-level ones need to have state (i.e. you can't tell a command-based API to change the color of some of the regions you've displayed, because it's already forgotten that it displayed those regions for you with the previous command). That said, this is a lot more complex than our current DS9 interface, and it could be a good idea to try to pare it down.

Plotting

Remote Operation and Communication

  • The image display tool will be remotely operable in essentially the same way as the current DS9 tools; the user must forward XPA ports via SSH tunnels, and then use an interactive Python prompt on the remote machine to send commands to local DS9 instances.
  • The display tools tool will also accept and send select SAMP messages, for interoperability with third-party tools (and the plotting tool?). The tool will likely have to be put into a state that explicitly waits and listens for a particular message in order to receive it (this may be revisited if/when we consider adding a GUI to the control the tools).
    • Any ImageLayer may be broadcast by the DisplayManager (image.load.fits)
    • An ExposureDisplay's SourceCatalog can be broadcast (table.load.fits)
    • An ExposureDisplay's currently selected sources (as represented ellipse geometry regions) may be broadcast as a single row (table.highlight.row) or set of rows (table.select.rowList). These messages may also be received, and used to set the selection.
    • The position at the center of the current display may be broadcast (coord.pointAt.sky). This message may also be received, and used to pan to a coordinate.

General

  • Display and plotting tools will be placed in a new top-level package, "analysis". This will likely be pure-Python, but we do not rule out the possibility that some C++/Swig will be needed. Data access code may go in afw::table, daf_persistence/daf_butlerUtils, or analysis.