Last modified 12 years ago Last modified on 06/21/2007 12:02:27 AM

Thoughts About all-in-database, computation-in-application or all-custom

There are 3 distinct approaches to consider:

  1. do all in database (database-centric approach)
  2. use database to store data and to query data, but push computation to application (application-centric approach)
  3. don't use database, store data in custom form and do computation in application (custom store)

Dump of thoughts in no particular order about various pros and cons:

  • moving data to application may be expensive, it is better to do computation in database
  • database server CPU is more expensive than client CPU, so offload database server and do computation in application. BTW, with commodity hardware and distributed processing model this is not so much true as it used to be in the past
  • stored procedures are not very portable between DBMSes, so push computation to application
  • if using custom store, it is harder to provide query interface (we have to implement it)
  • with custom store, we can implement whatever compression we want, eg. direct IO -> highly optimized real-time (de)compression library (like LZO see on binary Object/DIASource rows -> mapping to the final in-memory structure
  • with custom store, we are completely in control of threading (i.e. we can process using 1 thread per zone for much finer granularity in parallelization).
  • with custom store, we can write special purpose "mirroring" code. For instance, assuming a shared-nothing model where each of our cluster nodes has a disk array attached but a dumb FS, we could write the AP such that any partition is explicitly written to at least 2 or more nodes. Alternatively, one could push the responsibility for reliability to the FS layer (something like GoogleFS, or xrootd)
  • with custom store, we could do zero overhead vertical partitioning (split Object/DIASource position and id from the rest of the columns for better CPU cache behaviour during x-match)
  • custom store requires much more coding on our side, and then maintenance - some would argue this cost is prohibitively expensive....