wiki:db/SchemaEvolution

Version 1 (modified by jbecla, 6 years ago) (diff)

first version, work in progress

Inevitable database schema will change over time. This document discusses how we are planning to address it.

Data release catalog

Each data release is independent, therefore database schema can easily change between data releases; we do not anticipate any non-trivial challenges here.

Each published data release will be frozen, e.g. to guarantee result reproducibility changes that might impact query results will not be allowed. Allowed changes include::adding new columns, adding/removing/resorting indexes, changing precision (without loosing information, eg float --> double).

New columns: we are planning to keep few extra unused columns for each large table, (for example, 5 doubles, 5 floats, 5 ints), and use these columns when a new column is needed. (renaming such column should be trivial, as it does not change shape on a table. Speed of filling these rows should eb comparable to speed of a full table.

Updating columns. Speed of updating values should be comparable to speed of a single shared scan.

Deletes. If we have to delete a column, we will like rename it, or hide via a non materialized view.

Up-to-date catalog

We are planning to maintain two copies: one live for alert production pipeline, one for users to query. Use the tricks described above, plus, we can take the user-facing copy offline and do the schema evolution. These catalogs are smaller, only Object table matters.

Notes about scalability

note that all large tables will be partitioned, building index for many small tables can be parallelized, plus, index for each table is small, fits in memory, fast.

Notes about administration

In terms of the process, we will be able to put the system in special mode (users can't run queries, admin can do the work). We can use shared-scan mechanism to do the upgrade in the most efficient fashion, we will have tools issue admin commands on all nodes (eg to verify schema for all table matches), and we will have centrally managed metadata about each table we manage where we could keep track of schema versions etc.