Version 2 (modified by jbecla, 7 years ago) (diff)


Inevitable database schema will change over time. This document discusses how we are planning to address it.

Data release catalog

Each data release is independent, therefore database schema can easily change between data releases; we do not anticipate any non-trivial challenges here.

Each published data release will be frozen, e.g. to guarantee result reproducibility changes that might impact query results will not be allowed. Allowed changes include changes such as adding new columns, changing precision (without loosing information, eg float --> double), or adding/removing/resorting indexes.

New columns: we are planning to keep few extra unused columns of each common type (for example, 3 FLOATs, 3 DOUBLEs, 3 INTs, 3 BIGINTs) for each large table, and use one of them when a new column is needed. (renaming a column is a trivial and instantaneous operation). Speed of filling new columns should be comparable to speed of a full table.

Updating columns. Speed of updating values should be comparable to speed of a single shared scan.

Deletes. If we have to delete a column, instead of deleting it, which is expensive as it changes shape of a table, we will add it to the pool of "extra, unused columns" by renaming it.

Up-to-date catalog

We are planning to maintain two copies: one live used by alert production pipeline, and one for users to query. Use the tricks described above, plus, we can take the user-facing copy offline and do the schema evolution. These catalogs are smaller, only Object table matters.

Notes about scalability

note that all large tables will be partitioned, building index for many small tables can be parallelized, plus, index for each table is small, fits in memory, fast.

Notes about administration

In terms of the process, we will be able to put the system in special mode (users can't run queries, admin can do the work). We can use shared-scan mechanism to do the upgrade in the most efficient fashion, we will have tools issue admin commands on all nodes (eg to verify schema for all table matches), and we will have centrally managed metadata about each table we manage where we could keep track of schema versions etc.