wiki:db/MySQL/Issues
Last modified 5 years ago Last modified on 07/30/2014 03:21:20 PM

MySQL Related Issues

  • need better proxy
    • we don't like writing results to mysql table, want stream results back to client
    • partial/incomplete results
    • we wan to pass custom errors
    • issue we run into: result can have duplicate column names, through proxy it is a problem
  • producing myisam files

Old

  • need efficient index joins (20-30 indexes)
  • need to optimize multiple full table scans running simultaneously (data train algorithm). According to syncscan.pdf Teradata, Red Brick, SQL Server, and DB2 support synchronized scans (data trains).
  • bad performance of DROP TABLE
  • bad performance of composite index on (zoneId, ra)
  • MyISAM uses 1K disk blocks (8K in new generation myisam) - investigate why it is not tunable
  • would be nice to integrate MySQL with MSS like HPSS
  • persisting vectors of values (useful for persisting lsst metadata)
  • What are the implications of having large number of tables per database?
  • loading compressed csv files via LOAD FILE
  • What kind of things will cause performance to drop as a server stays up over extended periods of time?
  • How are rows transmitted over the network (ascii or binary)? Specifically, how does the size of row in a MEMORY or MyISAM table compare with the size of that same row when it traverses the network? Is there any support for transmitting compressed rows?
  • enterprise edition - worth looking at?

Hardware/software details for tests run below

  • SunFire V240
  • 2 UltraSPARC IIIi CPUs, 1503 MHz
  • 16 GB RAM
  • 2 Sun StoreEdge T3 arrays, 470GB each, configured in RAID 5, sustained sequential write speed (256KB blocks) 150 MB/sec, read: 146 MB/sec
  • OS: Sun Solaris sun4x_510
  • MySQL: version 5.0.27

Details on bad performance of DROP TABLE

Using an "out of the box" MySQL config:

Number of rows row size (bytes) time to DROP TABLE
3,044,468 102 1.36 sec
5,865,220 1702 11 min 10.24 sec

Note that for the ~10GB table, the maximum drop table time observed was 19min 40sec. The shortest time observed was 8min 31sec. TRUNCATE TABLE has the same performance characteristics as DROP TABLE in these tests.

vmstat also didn't reveal evidence of memory pages being swapped in/out from/to disk while these tables were being dropped.

Next, we tried prefixing the mysqld launch command with /usr/bin/ppgsz -o stack=512K,heap=4M . The ppgsz process is configured with a memory page size of 512KB for stack pages and 4MB for heap pages. ppgsz then launches the remainder of its command line as a child process (in this case, the standard mysqld launch command) which inherits these page sizes. Note that on Solaris, the default for both heap and stack page sizes is 8KB - the theory was that TLB misses were slowing us down.

Number of rows row size (bytes) time to DROP TABLE
5,865,220 1702 7 min 1.92 sec

The situation is a bit better, but the main cause of slowdown does not appear to be TLB misses. Finally, the read_buffer_size server variable was set to 524288 bytes, since based on a reading of MEMORY engine source code this influences the granularity of memory allocations for table data.

Number of rows row size (bytes) time to DROP TABLE
3,044,468 102 2.10 sec
5,865,220 1702 19.88 sec

Questions resulting from this are:

  • Why is read_buffer_size tied to the MEMORY allocation granularity? Is it still?
  • Is there any advice for tweaking this value, or should we just use trial and error?
  • Why is performance so poor with the smaller default value of read_buffer_size (131072 bytes)?
  • Should we use a different memory allocator from the default Solaris system allocator (Hoard, mtmalloc, or ...)?
  • Do we have to worry about performance dropping over time due to things like heap fragmentation?

Details on bad performance of composite index on (zoneId, ra)

This issue requires a bit of background - it arises when trying to implement the cross matching algorithm from the following paper:

  MSR-TR-2006-52
  The Zones Algorithm for Finding Points-Near-a-Point or Cross-Matching Spatial Datasets
  Gray, Jim; Nieto-Santisteban, Maria A.; Szalay, Alexander S.

See http://research.microsoft.com/research/pubs/view.aspx?msr_tr_id=MSR-TR-2006-52 for a detailed explanation of the crossmatching procedure.

Below are schemas for the tables involved in crossmatch (columns not involved in the crossmatch are omitted for brevity).

CREATE TABLE `InMemoryObject` (
  `id` bigint(20) NOT NULL,
  `ra` decimal(8,5) NOT NULL,
  `decl` decimal(8,5) NOT NULL,
...
  `zoneId` int(11) NOT NULL default '0',
  `x` double NOT NULL default '0',
  `y` double NOT NULL default '0',
  `z` double NOT NULL default '0',

  PRIMARY KEY  (`id`),
  KEY `idx_zone_ra` USING BTREE (`zoneId`,`ra`)
) ENGINE=MEMORY 

For the tests, InMemoryObject had 3,044,468 102 byte rows.

CREATE TABLE `InMemoryDIASource` (
  `id` bigint(20) NOT NULL,
  `zoneId` int(11) NOT NULL,
  `ra` double NOT NULL,
  `decl` double NOT NULL,
  `x` double NOT NULL,
  `y` double NOT NULL,
  `z` double NOT NULL,
...
  PRIMARY KEY  (`id`),
  KEY `idx_zone_ra` USING BTREE (`zoneId`,`ra`)
) ENGINE=MEMORY

For the tests, InMemoryDIASource had 30,276 238 byte rows.

CREATE TABLE `LooseMatch` (
  `id1` bigint(20) NOT NULL,
  `id2` bigint(20) NOT NULL
) ENGINE=MEMORY

53024 rows are generated as a result of the crossmatch queries under test.

CREATE TABLE `ZoneZone` (
  `zoneId` int(11) NOT NULL,
  `matchZoneId` int(11) NOT NULL,
  `deltaRa` double NOT NULL,
  PRIMARY KEY  USING BTREE (`zoneId`,`matchZoneId`)
) ENGINE=MEMORY

ZoneZone contains 32398 rows.

Here are the variants of the core crossmatch query that were tried, along with explain output and (in some cases) timing.

1.)

INSERT INTO LooseMatch
    SELECT p.id, s.id FROM InMemoryDIASource AS p
    INNER JOIN ZoneZone AS zz FORCE INDEX (PRIMARY) ON p.zoneId = zz.zoneId
    INNER JOIN InMemoryObject AS s FORCE INDEX (idx_zone_ra) ON zz.matchZoneId = s.zoneId
    WHERE s.ra BETWEEN p.ra - zz.deltaRa AND p.ra + zz.deltaRa
    AND s.decl BETWEEN p.decl - 0.000833 AND p.decl + 0.000833
    AND POW(p.x - s.x, 2) + POW(p.y - s.y, 2) + POW(p.z - s.z, 2) < 2.1137067679466e-10 

After 10 hours, the query was ~40% finished (based on the number of matches found versus the expected total). The same query with no index hints was tried, but was killed after it didn't finish for an hour. The EXPLAIN output for the SELECT (regardless of whether index hints are used) was:

id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE s ALL idx_zone_ra NULL NULL NULL 5865220
1 SIMPLE p ALL idx_zone_ra NULL NULL NULL 30276 Using where
1 SIMPLE zz eq_ref PRIMARY PRIMARY 8 serge.p.zoneId,serge.s.zoneId 1 Using where

EXPAIN EXTENDED + SHOW WARNINGS gives:

select `p`.`id` AS `id`,`s`.`id` AS `id`
  from `InMemoryDIASource` `p` join `ZoneZone` `zz` join `InMemoryObject` `s` 
  where ((`zz`.`zoneId` = `p`.`zoneId`) and 
         (`zz`.`matchZoneId` = `s`.`zoneId`) and 
         (`s`.`ra` between (`p`.`ra` - `zz`.`deltaRa`) and (`p`.`ra` + `zz`.`deltaRa`)) and
         (`s`.`decl` between (`p`.`decl` - 0.000833) and (`p`.`decl` + 0.000833)) and
         (((pow((`p`.`x` - `s`.`x`),2) + pow((`p`.`y` - `s`.`y`),2)) + pow((`p`.`z` - `s`.`z`),2)) < 2.1137067679466e-10))

2.)

INSERT INTO LooseMatch
    SELECT p.id, s.id FROM InMemoryDIASource AS p
    INNER JOIN ZoneZone AS zz FORCE INDEX (PRIMARY) ON p.zoneId = zz.zoneId
    INNER JOIN InMemoryObject AS s FORCE INDEX (idx_zone_ra) ON
        zz.matchZoneId = s.zoneId AND
        s.ra BETWEEN p.ra - zz.deltaRa AND p.ra + zz.deltaRa
    WHERE s.decl BETWEEN p.decl - 0.000833 AND p.decl + 0.000833
    AND POW(p.x - s.x, 2) + POW(p.y - s.y, 2) + POW(p.z - s.z, 2) < 2.1137067679466e-10 

EXPLAIN output is exactly the same as for 1.) (again regardless of index hints). This query was also killed after an hour or so. EXPAIN EXTENDED + SHOW WARNINGS outputs this:

select `p`.`id` AS `id`,`s`.`id` AS `id`
  from `InMemoryDIASource` `p` join `ZoneZone` `zz` join `InMemoryObject` `s`
  where ((`zz`.`zoneId` = `p`.`zoneId`) and 
         (`zz`.`matchZoneId` = `s`.`zoneId`) and
         (`s`.`decl` between (`p`.`decl` - 0.000833) and (`p`.`decl` + 0.000833)) and
         (((pow((`p`.`x` - `s`.`x`),2) + pow((`p`.`y` - `s`.`y`),2)) + pow((`p`.`z` - `s`.`z`),2)) < 2.1137067679466e-10) and 
         (`s`.`ra` between (`p`.`ra` - `zz`.`deltaRa`) and (`p`.`ra` + `zz`.`deltaRa`)))

3.)

INSERT INTO LooseMatch
    SELECT STRAIGHT_JOIN p.id, s.id FROM InMemoryDIASource AS p
    INNER JOIN ZoneZone AS zz FORCE INDEX (PRIMARY) ON p.zoneId = zz.zoneId
    INNER JOIN InMemoryObject AS s FORCE INDEX (idx_zone_ra) ON zz.matchZoneId = s.zoneId
    WHERE s.ra BETWEEN p.ra - zz.deltaRa AND p.ra + zz.deltaRa
    AND s.decl BETWEEN p.decl - 0.000833 AND p.decl + 0.000833
    AND POW(p.x - s.x, 2) + POW(p.y - s.y, 2) + POW(p.z - s.z, 2) < 2.1137067679466e-10

Total run time was 34 min 44.54 sec. EXPLAIN output was:

id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE p ALL idx_zone_ra NULL NULL NULL 30276
1 SIMPLE zz ref PRIMARY PRIMARY 4 serge.p.zoneId 323
1 SIMPLE s ref idx_zone_ra idx_zone_ra 4 serge.zz.matchZoneId 58652 Using where

EXPAIN EXTENDED + SHOW WARNINGS gave:

select straight_join `p`.`id` AS `id`,`s`.`id` AS `id` 
  from `InMemoryDIASource` `p` join `ZoneZone` `zz` join `InMemoryObject` `s` 
  where ((`zz`.`zoneId` = `p`.`zoneId`) and 
         (`s`.`zoneId` = `zz`.`matchZoneId`) and 
         (`s`.`ra` between (`p`.`ra` - `zz`.`deltaRa`) and (`p`.`ra` + `zz`.`deltaRa`)) and 
         (`s`.`decl` between (`p`.`decl` - 0.000833) and (`p`.`decl` + 0.000833)) and 
         (((pow((`p`.`x` - `s`.`x`),2) + pow((`p`.`y` - `s`.`y`),2)) + pow((`p`.`z` - `s`.`z`),2)) < 2.1137067679466e-10))

4.)

INSERT INTO LooseMatch
    SELECT STRAIGHT_JOIN p.id, s.id FROM InMemoryDIASource AS p
    INNER JOIN ZoneZone AS zz FORCE INDEX (PRIMARY) ON p.zoneId = zz.zoneId
    INNER JOIN InMemoryObject AS s FORCE INDEX (idx_zone_ra) ON
        zz.matchZoneId = s.zoneId AND
        s.ra BETWEEN p.ra - zz.deltaRa AND p.ra + zz.deltaRa
    WHERE s.decl BETWEEN p.decl - 0.000833 AND p.decl + 0.000833
    AND POW(p.x - s.x, 2) + POW(p.y - s.y, 2) + POW(p.z - s.z, 2) < 2.1137067679466e-10 

Total run time was 35 min 43.06 sec. EXPLAIN output was the same as for 3.), and EXPLAIN EXTENDED + SHOW WARNINGS gives:

select straight_join `p`.`id` AS `id`,`s`.`id` AS `id` 
  from `InMemoryDIASource` `p` join `ZoneZone` `zz` join `InMemoryObject` `s` 
  where ((`zz`.`zoneId` = `p`.`zoneId`) and 
         (`s`.`zoneId` = `zz`.`matchZoneId`) and 
         (`s`.`decl` between (`p`.`decl` - 0.000833) and (`p`.`decl` + 0.000833)) and 
         (((pow((`p`.`x` - `s`.`x`),2) + pow((`p`.`y` - `s`.`y`),2)) + pow((`p`.`z` - `s`.`z`),2)) < 2.1137067679466e-10) and
         (`s`.`ra` between (`p`.`ra` - `zz`.`deltaRa`) and (`p`.`ra` + `zz`.`deltaRa`)))

So it looks like MySQL does not take full advantage of the composite (zoneId, ra) index on InMemoryObject - the s.zoneId = zz.matchZoneId AND s.ra BETWEEN p.ra - zz.deltaRa AND p.ra + zz.deltaRa should map to a range lookup, but it seems that only s.zoneId = zz.matchZoneId is used to look things up in the index. Is that what's happening here? If so, is this something we can reasonably expect the optimizer to figure out in the future?

Concerning large numbers of tables per database

So according to the current plan, we will spatially partition the major LSST tables required for nightly processing into 335,482 "chunks", with 1 physical MyISAM table per chunk (tables are further distributed across 516 databases to circumvent OS file system limitations). The number of chunks and databases aren't final. LSST would be reading from a different set of about 200 or so tables every 37 seconds. From what I understand, we have no hope of keeping info about all these tables in the table cache - should we disable it (set it's size to 0)?. If we do that, every query will presumably have to go and read .frm files for the MyISAM tables involved in it. How expensive is this (in terms of number of seeks)? How does a zero-sized table cache affect MEMORY tables? Specifically, MEMORY tables still have .frm files on disk, but are they ever actually read after the server is started? Also, is there any way to control whether or not tables are cached at the level of an individual database?

Can the MySQL folks comment on whether the following idea makes any sense and/or is feasible:

First of all, we can arrange for these partition tables to never be dropped (truncation I think would be enough for our purposes). So we create all 330,000 .frm files on disk (or if possible, just a single .frm "template" file, since all tables have the same schema) using mysql. Before starting mysqld, we create an ext2 or ext3 volume in memory (this assumes Linux as the OS), and copy the disk based .frm files to the RAM disk. We then softlink to that in-memory directory from the mysql data directory, and as a result have all partition table .frm files on fast RAM-disk once mysqld is started, hopefully mitigating the cost of turning off the table cache. The same strategy could be applied for the .MYI files since none of these partition chunks have any indexes. I'm a bit confused as to why there is a .MYI file created for tables with no indexes - are .MYI files for tables with no indexes ever read? Do they consume a file descriptor in the server process?

Related question: while doing xrootd+mysql tests we noticed mysqld will try to open .TRG file many times. Is there a way to permanently disable this, e.g. per database?