wiki:faintSourceCompression
Last modified 9 years ago Last modified on 11/24/2010 09:22:37 PM

ForcedSource Compression

LSST Database

ForcedSource table is unique due to its size (the largest) and contents (very narrow, no FLOATs). For that reasons we run dedicated tests to understand how this table will compress.

Executive summary

Based on the test we run, we expect the ForcedSource table will compress to about 59% of its original size (data) and 77% (index). It is possible data for some of the columns will be correlated better than in the test we run, yielding few % better compression, so assuming ForcedSource will compress down to 60% of its original size seems fair.

How we run these tests

The data was generated randomly, assuming the following:

  1. the ratio between different filters: u:g:r:i:z:y = 7:10:23:23:20:20.
  2. objectId in the range [1, 1730000000] (1.73 billion is the max number of objects in the last data release)
  3. exposureId in the range [1, 7000000] (7 million is ~ max number of FPA exposures in the last data release). Used only for the very first test run in March 2010. Later we used ccdExposureId.
  4. ccdExposureId in the range [1, 1400000000] (7 million is ~ max number of FPA exposures in the last data release, x ~200 ccds)). The very first test in march used
  5. flux, fluxSigma, x, y - a sample of 100k values taken from DC3b PT1.1 imsim_slac_prod run (psfFlux, psfFluxSimga, xAstrom, yAstrom). In the very first test we run in March 2010 for flux we used a randomly generated number from the range [100, 10000] (per Andy Becker 4/9/2009)
  6. sky background in the range (data for u, r y from Andy):
min max
u 20 1,000
g 50 2,000
r 100 5,000
i 200 10,000
z 500 20,000
y 2000 100,000
  1. psfLnLR in the range [0, 100] (0-100%)
  2. modelLSLnLR in the range [0, 100] (0-100%)

Note that we assumed 2-digit precision, that is, for INTEGER-test we used a number multiplied by 100, for FLOAT-test we generated numbers with 2-digit precision.

MyISAM pack was used. Commands used:

  • myisampack -v ForcedSourceF.MYI
  • myisampack -v ForcedSourceI.MYI
  • myisamchk -rq ForcedSourceF.MYI
  • myisamchk -rq ForcedSourceI.MYI

Discussion:

  • objectIds won't be completely randomly scattered, but it is not likely there will be many clusters of sequential numbers (unlikely we will get better compression)
  • there will likely be many adjacent rows with the same value of exposureIds (will get better compression)
  • flux: will likely be randomly scattered, unlikely we will get better compression
  • sky: likely many spatially collocated ForcedSources will have the same or similar values of sky background, so will get better compression

Nov 2010 Test (23 bytes)

Schema used: objectId, ccdExposureId, flux, fluxSigma, flag, see contrib/dbutils/trunk/genForcedSource4compression.py?rev=18069

Test was run with 10 million rows.

file org size compressed size org:compressed
ForcedSourceF.MYD 260000000 171791470 66.07%
ForcedSourceF.MYI 267629568 205685760 76.85%
ForcedSourceF (both) 205685760 377477230 71.54%
ForcedSourceI.MYD 240000000 142074950 59.20%
ForcedSourceI.MYI 267629568 205685760 76.85%
ForcedSourceI (both) 507629568 347760710 68.51%

Nov 2010 Test (33 bytes)

Schema used objectId, ccdExposureId, sky, flux, x, y, flag, see contrib/dbutils/trunk/genForcedSource4compression.py?rev=18061

Test was run with 10 million rows.

file org size compressed size org:compressed
ForcedSourceF.MYD 340000000 243857269 71.72%
ForcedSourceF.MYI 267668480 205685760 76.84%
ForcedSourceF (both) 607668480 449543029 73.98%
ForcedSourceI.MYD 340000000 204566447 60.17%
ForcedSourceI.MYI 267668480 205685760 76.84%
ForcedSourceI (both) 607668480 410252207 67.51%

Nov 2010 Test (44 bytes)

Schema used, objectId, ccdExposureId, sky, skySigma, flux, fluxSigma, psfLnL, modelLSLnL, flag, see contrib/dbutils/trunk/genForcedSource4compression.py?rev=18002

Test was run with 10 million rows.

file org size compressed size org:compressed
ForcedSourceF.MYD 450000000 318629595 70.81%
ForcedSourceF.MYI 267712512 205685760 76.83%
ForcedSourceF (both) 717712512 524315355 73.05%
ForcedSourceI.MYD 450000000 265218208 58.94%
ForcedSourceI.MYI 267712512 205685760 76.83%
ForcedSourceI (both) 717712512 470903968 65.61%

March 2009 Test (28 bytes)

Schema used objectId, exposureId, flux, sky, fluxDia, skyDia, see: contrib/dbutils/trunk/genForcedSource4compression.py?rev=17964

Test 1: we inserted 100 million rows of simulated data into a single table. In this case, the data file (MYD) compressed to 56% of the original size (2,899,999,913 --> 1,523,310,235) and the index file (MYI) compressed to 76% of its original size (2,193,760,256 --> 1,658,588,160). All together the size of data +index shrunk to 64% of the original size.

Test 2: we segregated 100 million rows into 6 different tables (one per filter). In this case, the sum of all 6 data files compressed to 53% of the original size, and index compression stayed at 76% (data+index: 62% of the original size).