Last modified 6 years ago Last modified on 02/01/2013 12:38:27 AM

Queries per Difficulty-level

Queries We Are Planning to Support

Query Type Expected Response Time Type Notes
Retrieve any type of information about a single object (identified by a given objectId), including full time series.

Example: SELECT * FROM Object JOIN Source USING (objectId) WHERE objectId = 293848594;
Few seconds Interactive
Retrieve any type of information about a group of objects in a small area of sky, including neighborhood-type queries.

Example: SELECT * FROM Object WHERE qserv_areaSpec_circle(1.0, 35.0, 5.0/60);
few seconds - 1 min depending on the area, whether information about individual detections is needed. Interactive Small is defined as a few arcmin.
Scan all objects and apply a filter on any number of attributes. Aggregating results (average, sum, max, min etc) is ok.

Example: SELECT COUNT(*), MAX(scisql_fluxToAbMag(rFluxGaussian)) FROM Object WHERE rNumObs >= 5;
~1 hour Shared scan (Object Catalog) Ok to run on the entire Object Catalog.
Finding objects near objects across large area.

Example:SELECT A.objectId, B.objectId FROM Object AS A, Object AS B WHERE A.varProb > 0.95 AND A.extendedness > 9500 AND scisql_angSep(A.ra_SG, A.decl_SG, B.ra_SG, B.decl_SG) < 2.0/60;
~1 hour Shared scan (Object Catalog) Assuming ~14 "high volume" queries per DR, each accessing full data set.
Analysing light curves across large area.

Example: SELECT O.objectId, myFunction(S.taiMidPoint, S.psfFlux) FROM Object AS O JOIN Source AS S USING (objectId) WHERE O.varProb > 0.75 GROUP BY O.objectId;
~1 day (24h) Shared scan (Object + Source Catalog, join needed) Assuming ~4 "high volume" queries every 16 hours, (so ~6 per day) per DR, each accessing full data set.
Analysing light curves of faint objects across large area.

Example: SELECT O.objectId, myFunction(V.taiMidPoint, FS.flux) FROM Object AS O JOIN ForcedSource AS FS ON (O.objectId = FS.objectId) JOIN Visit AS V ON (FS.visitId = V.visitId);
~1 week Shared scan (Object + ForcedSource Catalog, join needed) Assuming ~2 per week per DR, each accessing full data set.

Note related to all shared scan queries: these numbers are per data release served, and we are planning to serve up to 3 DRs: DR1, latest and latest-1. More queries can be handled if then don't access full data set (e.g. 20 queries each accessing all rows = 200 queries each accessing 10% of data)

Queries That Will Be Very Difficult or Impossible to Support due to Technological Reasons

Large distance

The system will be optimized for distance searches over relatively small distance (~1 arcmin). Somewhat larger distance can be easily implemented using the same technique (overlaps) at the expense of some extra disk storage (and I/O), however if search over much larger distance is needed (say 1 deg or more), we will need to implement cross-node data transfer (non-trivial!). Example queries:

  • Find objects far away from other objects (for a large number of objects).

Question: what is the largest distance we should plan to support for distance-based queries involving (a) small number of objects, (b) all objects on the sky?

If large distances are needed, is there a way to narrow down the problem? For example, perhaps it is needed only for certain types of objects, like large galaxies...

Sliding window queries

Queries that involve a sliding spatial window will almost certainly not be supported by the SQL parser. Example:

  • Find all 5 arcmin x 5 arcmin regions with an object density higher than rho.

Non-partitionable aggregates

Some aggregates, (for example median) can't be partitioned. Given our large tables will be partitioned, if such non-partitionable aggregates are needed, we would have to write custom, non-trivial, UDFs.

Queries That Will Be Very Difficult or Impossible to Support due to Limited Resources

Large size of the intermediate (or final) results

Officially, the baseline is sized for ~0.5 GB result set size per interactive query, and 6 GB per expensive (e.g. shared scan type) query.

Some queries might return small result, but intermediate result might be huge. For example:

  • Select all pairs of stars within 1 arc min of each other in the Milky Way region.

will generate billions (trillions?) of pairs, and limiting the result size to something small is easy (e.g. by asking for the top 1000 pairs with distance between them closest to "x arcsec"), but it does not make the query much cheaper.

Expensive or hidden computation

Joins typically involve non-trivial computation (example: search for neighbors). Even if the computation cost is not very high per row (e.g. joining object with source via objectId), the total cost of computation quickly adds up if billions of rows are involved. Queries such as:

  • Near neighbor query on the Source or ForcedSource table
  • Joining large tables between different LSST data releases
  • Time series analysis of every object
  • Cross-match with very large external catalog (e.g. LSST with SKA)
  • Any non-spatial join on the entire catalog (Object, Source, ForcedSource)
  • Join of Source with ForcedSource

are examples of expensive joins.

Note that queries of these types that operate on small subsets of the data will be supported; it's only when the entire dataset is processed that things become difficult (special arrangements may need to be made) or impossible.

Question: what is the max size of data set (in terms of % of full sky, or % of all objects) that individual users would find it useful to analyze time series for? Is ~10% (of full sky or of all object) enough?

Some simple-looking questions in English might turn into a large number of non trivial queries; for example:

  • correlate each pair of attributes for every object

involves ~(500 columns)2 combinations x ~40 billion objects = 10 peta-combinations, which would run for several days on hundreds of CPUs to complete.

Sorting of large tables or intermediate results is also expensive.

Features Under Consideration

We are currently not planning to implement these features. Question: are they needed?

Large objects

Current design is not well suited for dealing with objects that span many chunks (exception: objects that size is smaller than the built-in overlap (~1 arcmin).

It is unclear how to deal with objects that are very big (large galaxies), e.g. few deg in size - they will span multiple "chunks". Since there should be relatively few of these, however, it's probably best to store them in a separate table (perhaps with lists of covered HTM ids) that is replicated on every node.


We are not planning to provide any generic operators for sampling.

Extracting a random choice of N rows from a table, or, alternatively, a random N% of rows in a table. Desiring a _random_ selection is more difficult than merely limiting by a size of N rows or N% of rows (which is likely to be biased by the system in some way). Scientists may want a random sampling of N objects, or N objects matching a certain criterion.

Sampling is typically done to reduce the working size to something manageable (e.g., in human or computer terms). Sampling can be implemented tractably, but an efficient implementation that executes appreciably faster than scanning the whole table may be more difficult.