Ticket #362 (closed design issue: fixed)

Opened 11 years ago

Last modified 7 years ago

Determine criteria for matching Objects across releases (object id synchronization)

Reported by: smm Owned by: smm
Priority: normal Milestone:
Component: database Keywords: object, association, release, synchronization
Cc: smm, ktl Blocked By:
Blocking: Project: LSST
Version Number:
How to repeat:

not applicable

Description (last modified by smm) (diff)

See dbObjectIdSync for background. This corresponds to line items 77 and 78 on the DC3 schedule.

Brief summary of the problem: object ids need to be kept in sync between data releases so that (among other things) data releases can be more easily compared. The current plan to deal with this is to take the output of the deep detection pipeline and associate it to the previous Object catalog, obtaining a map of old object ids to new object ids. Then, objects output by deep detect that were in the previous release have their ids updated to reflect the previously published values.

The question then becomes: how do we associate an upcoming data release with the previous one? One can do the obvious spatial match between the two catalogs, but what further criteria should we employ? We (the database group) need help from science/applications people in answering this question.

Situations that need some thought are:

  • Should we do anything special for objects in the previous release that don't appear in the upcoming release?
  • If two objects in the upcoming release match the same object in the previous release, should the id transfer to the best match, or should both objects get new ids? How is the best match picked?
  • If an object in the upcoming release matches two or more objects in the previous release, how do we pick the actual match?

Attachments

JHUProbabilistic.pdf (5.8 KB) - added by jbecla 11 years ago.

Change History

comment:1 Changed 11 years ago by smm

  • Description modified (diff)

comment:2 Changed 11 years ago by roc

  • Status changed from new to assigned

comment:3 Changed 11 years ago by jbecla

  • Type changed from task to design issue

changed ticket type, "task" becomes obsolete

Changed 11 years ago by jbecla

comment:4 Changed 11 years ago by jbecla

  • Owner changed from roc to smm

This following been transfered from the duplicate ticket #425. Per email exchange with Serge, reassigning to him.

Overview

I see two possible approaches here. The first is to systematically name objects so that ideally a given object automatically ends up with the same name in two different data releases. If it doesn't end up with the same name, for example due the original object splitting into two in better seeing data, we want to ensure that its new names are lexicographically close to the old one.

The second approach is to not worry about how id's are created, but to create the two catalogs independently, then associate the two and construct a table that maps the old id's to the new.

And, of course, it might be advantageous to combine the two approaches.

Lexicographic approach

One approach here is to construct the object name from a spatial identifier that labels a box at some level in a hierarchical partitioning of ra/dec space. The level in the hierarchy, and thus the box size, will be chosen based on the error level of the reported position. If the label of a smaller box contains the label of the next larger containing box as a substring, then it easy to arrange a lexicographic search so that an object with one position in DR1, and a refined position in DR2, are found close to one another. Note that SDSS uses an approach like this, the Hierarchical Triangular Mesh (HTM), (http://skyserver.org/htm). They use it for organizing their object catalogs, but do not carry it down to the level of individual objects. Taking it this last step is straightforward.

Parenthetically, a related (but crude) technique has a long history in astronomy, so that many objects have names like WD1625+093, indicating that this is a white dwarf near RA of 16 hrs 25 min, and Dec of +9.3 deg. One might well quarrel with the choice of units (!), and this basic idea needs to be extended to allow for arbitrary precision, but I think there is the germ of a good idea here.

Inter-DR Association approach

In this approach, object name generation is independent in each DR, and the name of the same object has no lexicographic locality between DR's. Instead, the association pipeline (AP) is run to build a mapping from (for example) DR1 names to DR2 names. To be completely general, it is probably necessary to associate every prior DR to the current one.

In passing, the paper "Probabilistic Cross-Identification of Astronomical Sources" by Budavari and Szalay (attachment:JHUProbabilistic.pdf) is useful in considering how to improve the performance of the AP by utilizing characteristics beyond spatial position.

Discussion

One important issue for either approach is how we handle objects whose position changes significantly during the survey.

First, note that the position is changing *within* a DR, not just between them. So, the association PL has to be able to handle this situation. Though the initialization phase needs to be worked out, once the object catalog has decent proper motion and parallax determinations for these objects, it is clear how to proceed. Before associating, one simply use the motion parameters to predict positions at the current epoch.

In the lexicographic approach, to eliminate name ambiguity for high PM objects, we could have the rule that we always generate the name based on the position at a particular epoch, say the start date of the LSST survey.

Another case we have to deal with is a single object in DR1 that is resolved into multiple objects in DR2, due to the inclusion of observations in better seeing. The lexicographic approach would at least make it clear that the cluster of new objects is implicitly spatially associated with the original single object. We might conceivably want to do more, making the parent child relationship explicit through the object names (eg adding -1, -2 etc). This might argue for allowing an object to have multiple names. This would also open up the possibility that the LSST catalog could be explicitly associated with one or more external catalogs, and the names of the objects in those external catalogs included. Certainly all these relationships can be created whenever one wants through database operations, but they may be time consuming and inconvenient compared to looking for names. Note, by the way, that the Inter-DR association approach effectively generates multiple names for the same object.

comment:5 Changed 11 years ago by ktl

  • Cc ktl added

comment:6 Changed 11 years ago by smm

Note: the "Produce Data Release" usecase in EA needs to be updated when a final procedure for associating objectIds between releases is decided on.

comment:7 Changed 11 years ago by jbecla

Discussed at DataAccWG telecon 11/19/08: the usecase in EA (Produce Data Release / Create New Data Release) mentions "importing" object ids from previous DR's Object table - this should be reworded.

comment:8 Changed 10 years ago by jbecla

  • Status changed from assigned to closed
  • Resolution set to fixed

Based on what I heard recently, including at the FRS meeting in Davis:

  • we won't be matching objects between data releases
  • each data release will start from an empty object catalog

comment:9 Changed 7 years ago by robyn

  • Milestone DC3 Design deleted

Milestone DC3 Design deleted

Note: See TracTickets for help on using tickets.