Ticket #2056 (assigned enhancement)

Opened 7 years ago

Last modified 4 years ago

add support for polygon-based searches in qserv

Reported by: jbecla Owned by: smm
Priority: normal Milestone: qserv-usr2013
Component: qserv Keywords:
Cc: danielw, smm, jbecla Blocked By: #1848
Blocking: #2533 Project: LSST
Version Number:
How to repeat:

not applicable

Description

Qserv should support the following type of queries:

-- Create a binary representation of the search polygon
SET @poly = scisql_s2CPolyToBin(359.9601, 2.5952,
                               0.0398, 2.5952,
                               0.0398, 2.6748,
                               359.9601, 2.6748);

-- Compute HTM ID ranges for the level 20 triangles overlapping
-- @poly. They will be stored in a temp table called scisql.Region
-- with two columns, htmMin and htmMax
CALL scisql.scisql_s2CPolyRegion(@poly, 20);

-- Select reference objects inside the polygon. The join against
-- the HTM ID range table populated above cuts down on the number of
-- SimRefObject rows that need to be tested against the polygon
SELECT refObjectId, isStar, ra, decl, rMag
FROM SimRefObject AS sro INNER JOIN
    scisql.Region AS r ON (sro.htmId20 BETWEEN r.htmMin AND r.htmMax)
WHERE scisql_s2PtInCPoly(ra, decl, @poly) = 1;

This 3-step process can be hidden inside qserv, without exposing it to user. Ideally, we should be able to run it behind the scene, and avoid changing/extending the existing API between user and the proxy.

Change History

comment:1 Changed 7 years ago by DefaultCC Plugin

  • Cc danielw added

comment:2 Changed 7 years ago by danielw

It should be a lot easier to expose a syntax like this to the user:

SELECT refObjectId, isStar, ra, decl, rMag
FROM SimRefObject 
WHERE qserv_areaspec_poly(359.9601, 2.5952,
                          0.0398, 2.5952,
                          0.0398, 2.6748,
                          359.9601, 2.6748) = 1;

Note: Since the scheme relies on an htmId20 column, there is no choice for ra/decl columns anyway--htmId20 is always used.

This is preferable since it's much harder to support all of the syntax required to make the original 3-step approach work from the user. Those 3 steps can be generated programmatically, but implementing new language syntax is much harder, especially if you want it to work in a distributed context.

Hiding it is probably a week's work (or less), once the scisql workings are understood. It could take a bit more time to discover the failure modes and to make those happen reasonably.

comment:3 Changed 7 years ago by jbecla

  • Priority changed from normal to critical

comment:4 Changed 7 years ago by jbecla

  • Priority changed from critical to normal
  • Milestone set to S2012

comment:5 Changed 7 years ago by jbecla

  • Blocked By 1848 added

comment:6 Changed 7 years ago by jbecla

  • Milestone changed from S2012 to qserv-beta1

comment:7 Changed 7 years ago by jbecla

  • Cc smm added

As far as I can tell we use the same mysql connection for multiple queries:

  1. QueryRunnerManager will run multiple instances of QueryRunner - one per thread ("lauchThread(QueryRunner(a))"
  2. QueryRunner::operator()() calls _act()
  3. _act() calls _runTask
  4. _runTask calls SqlConnection::connectToDb
  5. SqlConnection::connectToDb is recycling connections, e.g., it won't create a new connection if one is already open

This means we can't safely call scisql_s2CPolyRegion() or scisql_s2CircleRegion() through qserv because the temporary table Region will get messed up -- we probably need to tweak sciSQL (scisql_s2CPolyRegion and scisql_s2CircleRegion) and allow caller to specify temp table name. Serge - would you agree?

comment:8 Changed 7 years ago by smm

Whether or not this is a problem depends on how a query as seen by a single worker node, let's call it Q, is split across tasks. If Q is executed using exactly one task, then we don't have a problem - scisql_s2CPolyRegion() will truncate the scisql.Region temp table (if it exists) when called, and the contents of the temp table never needs to survive beyond the scope of a single task (or be shared across connections).

On the other hand, if Q is split across multiple tasks even on a single worker (I couldn't tell from an admittedly short look at the code), then there are 2 options:

  • allow specification of a table name as you suggest (note that it can't be a temp table if it needs to be available from multiple connections)
  • make sure that each task initializes scisql.Region with the appropriate stored procedure call before using it.

The latter seems a lot simpler (no table cleanup, no task dependency graph to manage) and so gets my vote for now. Do you see reasons to favor the former option instead?

comment:9 Changed 7 years ago by jbecla

  • Owner changed from jbecla to smm
  • Status changed from new to assigned

Per discussion 2013/01/10: we should first find out if the polygon-based approach is still helping enough to justify implementing it (running queries through qserv will be much faster than on single large table because we scan only the relevant chunk(s)). We agreed that we will test using ~1m row table and polygon that covers 60-80% of chunk.

It might be worth trying if it helps to maintain an index on (ra, decl), which would mean we only need to do full index scan to determine which objects are inside given polygon.

As agreed, assigning to Serge.

comment:10 Changed 7 years ago by jbecla

  • Blocking 2533 added

comment:11 Changed 4 years ago by jbecla

  • Cc jbecla added
Note: See TracTickets for help on using tickets.