8

We have an application which stores articles from different sources in a MySQL table and allows users to retrieve those articles ordered by date. Articles are always filtered by source, so for client SELECTs we always have

WHERE source_id IN (...,...) ORDER BY date DESC/ASC

We are using IN, because users have many subscriptions (some have thousands).

Here is the schema of the articles table:

CREATE TABLE `articles` (
  `id` BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT,
  `source_id` INTEGER(11) UNSIGNED NOT NULL,
  `date` DOUBLE(16,6) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `source_id_date` (`source_id`, `date`),
  KEY `date` (`date`)
)ENGINE=InnoDB
AUTO_INCREMENT=1
CHARACTER SET 'utf8' COLLATE 'utf8_general_ci'
COMMENT='';

We need the (date) index, because sometimes we are running background operations on this table without filtering by source. Users however cannot do this.

The table has around 1 Billion records (yes, we are considering sharding for future...). A typical query looks like this:

SELECT a.id, a.date, s.name
FROM articles a FORCE INDEX (source_id_date)
     JOIN sources s ON s.id = a.source_id
WHERE a.source_id IN (1,2,3,...)
ORDER BY a.date DESC
LIMIT 10

Why FORCE INDEX? Because it turned out MySQL sometimes chooses to use the (date) index for such queries (maybe because of it's smaller length?) and this results in scans of millions of records. If we remove the FORCE INDEX in production, our database server CPU cores gets maxed out in seconds (It's an OLTP applications and queries like the above are executed at rates around 2000 per second).

The issue with this approach is that some queries (we suspect it's somehow related to the number of source_ids in the IN clause) really runs faster with the date index. When we run EXPLAIN on those we see that the source_id_date index scans tens of millions of records, while the date index scans only some thousands. Usually it's the other way around, but we can't find a solid relation.

Ideally we wanted to find out why MySQL optimizer chooses the wrong index and remove the FORCE INDEX statement, but a way to predict when to force date index will also work for us.

Some clarifications:

The SELECT query above is a lot simplified for the purposes of this question. It has several JOINs to tables with around 100 Million rows each, joined the PK (articles_user_flags.id=article.id), which aggravates the problem when there are millions of rows to sort. Also some queries have additional where, e.g:

SELECT a.id, a.date, s.name
FROM articles a FORCE INDEX (source_id_date)
     JOIN sources s ON s.id = a.source_id
     LEFT JOIN articles_user_flags auf ON auf.article_id=a.id AND auf.user_id=1
WHERE a.source_id IN (1,2,3,...)
AND auf.starred=1
ORDER BY a.date DESC
LIMIT 10

This query lists only starred articles for the particular user (1).

The server is running MySQL version 5.5.32 (Percona) with XtraDB. Hardware is 2xE5-2620, 128GB RAM, 4HDDx1TB RAID10 with Battery backed controller. The problematic SELECTs are completely CPU bound.

my.cnf is as follows (removed some unrelated directives such as server-id, port, etc...):

transaction-isolation           = READ-COMMITTED
binlog_cache_size               = 256K
max_connections                 = 2500
max_user_connections            = 2000
back_log                        = 2048
thread_concurrency              = 12
max_allowed_packet              = 32M
sort_buffer_size                = 256K
read_buffer_size                = 128K
read_rnd_buffer_size            = 256K
join_buffer_size                = 8M
myisam_sort_buffer_size         = 8M
query_cache_limit               = 1M
query_cache_size                = 0
query_cache_type                = 0
key_buffer                      = 10M
table_cache                     = 10000
thread_stack                    = 256K
thread_cache_size               = 100
tmp_table_size                  = 256M
max_heap_table_size             = 4G
query_cache_min_res_unit        = 1K
slow-query-log                  = 1
slow-query-log-file             = /mysql_database/log/mysql-slow.log
long_query_time                 = 1
general_log                     = 0
general_log_file                = /mysql_database/log/mysql-general.log
log_error                       = /mysql_database/log/mysql.log
character-set-server            = utf8

innodb_flush_method             = O_DIRECT
innodb_flush_log_at_trx_commit  = 2
innodb_buffer_pool_size         = 105G
innodb_buffer_pool_instances    = 32
innodb_log_file_size            = 1G
innodb_log_buffer_size          = 16M
innodb_thread_concurrency       = 25
innodb_file_per_table           = 1

#percona specific
innodb_buffer_pool_restore_at_startup           = 60

As requested, here are some EXPLAINs of the problematic queries:

mysql> EXPLAIN SELECT a.id,a.date AS date_double
    -> FROM articles a
    -> FORCE INDEX (source_id_date)
    -> JOIN sources s ON s.id = a.source_id WHERE
    -> a.source_id IN (...) --Around 1000 IDs
    -> ORDER BY a.date LIMIT 20;
+----+-------------+-------+--------+-----------------+----------------+---------+---------------------------+----------+------------------------------------------+
| id | select_type | table | type   | possible_keys   | key            | key_len | ref                       | rows     | Extra                                    |
+----+-------------+-------+--------+-----------------+----------------+---------+---------------------------+----------+------------------------------------------+
|  1 | SIMPLE      | a     | range  | source_id_date  | source_id_date | 4       | NULL                      | 13744277 | Using where; Using index; Using filesort |
|  1 | SIMPLE      | s     | eq_ref | PRIMARY         | PRIMARY        | 4       | articles_db.a.source_id   |        1 | Using where; Using index                 |
+----+-------------+-------+--------+-----------------+----------------+---------+---------------------------+----------+------------------------------------------+
2 rows in set (0.01 sec)

The actual SELECT takes around one minute and is completely CPU bound. When I change the index to (date) which in this case the MySQL optimizer also chooses automatically:

mysql> EXPLAIN SELECT a.id,a.date AS date_double
    -> FROM articles a
    -> FORCE INDEX (date)
    -> JOIN sources s ON s.id = a.source_id WHERE
    -> a.source_id IN (...) --Around 1000 IDs
    -> ORDER BY a.date LIMIT 20;

+----+-------------+-------+--------+---------------+---------+---------+---------------------------+------+--------------------------+
| id | select_type | table | type   | possible_keys | key     | key_len | ref                       | rows | Extra                    |
+----+-------------+-------+--------+---------------+---------+---------+---------------------------+------+--------------------------+
|  1 | SIMPLE      | a     | index  | NULL          | date    | 8       | NULL                      |   20 | Using where              |
|  1 | SIMPLE      | s     | eq_ref | PRIMARY       | PRIMARY | 4       | articles_db.a.source_id   |    1 | Using where; Using index |
+----+-------------+-------+--------+---------------+---------+---------+---------------------------+------+--------------------------+

2 rows in set (0.01 sec)

And the SELECT takes only 10ms.

But EXPLAINs can be a lot broken here! For example if I EXPLAIN a query with only one source_id in the IN clause and forced index on (date) it tells me that it will scan only 20 rows, but that's not possible, because the table has over 1 Billion rows and only a few match this source_id.

Jacket
  • 520
  • 2
  • 6
  • 13
  • *"When we run analyze on those..."* Do you mean [`EXPLAIN`](http://dev.mysql.com/doc/refman/5.6/en/explain.html)? [`ANALYZE`](http://dev.mysql.com/doc/refman/5.6/en/analyze-table.html) is something different, and is probably something to consider if you haven't, as one possible explanation is that skewed index statistics are distracting the optimizer from choosing wisely. I don't think there's any need for the my.cnf in the question, and that space might be better used to post some `EXPLAIN` output of the variations in behavior that you see... after you investigate `ANALYZE [LOCAL] TABLE`... – Michael - sqlbot Oct 17 '13 at 10:54
  • Yes, this was a typo, thanks for the correction. I have fixed it. Of course we did ANALYZE, but that didn't help at all. I will try to capture some EXPLAINs later. – Jacket Oct 17 '13 at 11:04
  • And `date` is a `DOUBLE` ...? – ypercubeᵀᴹ Oct 17 '13 at 11:57
  • Yes, because we need microsecond precision here. Insert rate at this table is around 400,000 entries per hour and we need the dates to be as unique as possible. – Jacket Oct 17 '13 at 12:30
  • @Jacket Can you post an EXPLAIN off the offending query? i think because it is CPU bound your server is quicksorting ("using filesort in explain)" your result set.. – Raymond Nijland Oct 19 '13 at 20:10
  • Does `articles` have only these 3 columns? I guess you may be simplifying the structure for posting here but the table design may be relevant to the efficiency of the query. – ypercubeᵀᴹ Oct 19 '13 at 20:48
  • And one more thing: The values in `WHERE a.source_id IN (1,2,3,...)` are arbitrary, provided by the application or are they found in another table? – ypercubeᵀᴹ Oct 19 '13 at 20:49
  • Yes, this table has exactly 3 columns. It is meant to contain only the basic metadata, because it was know it will become huge. The IDs in the IN clause are fetched via another query in the client and are different for every user. – Jacket Oct 21 '13 at 05:13
  • Just throwing this out there.. Have you tried running the query using the read uncommitted isolation level? I'm wondering if another (query or queries) has a lock on the date index while you're running your SELECT. – Craig Efrein Oct 21 '13 at 09:38
  • What's the cardinality of `Source`? Are some values much more common than others, and are those the ones which run faster with the date index? – Jon of All Trades Feb 25 '15 at 23:50

1 Answers1

4

You might check your value for the innodb_stats_sample_pages parameter. It controls how many index dives MySQL performs on a table when updating index statistics, which in turn are used to calculate the cost of a candidate join plan. The default value was 8 for the version we were using. We changed it to 128, and observed fewer unexpected join plans.

Eric Rath
  • 292
  • 1
  • 3
  • 11