选择非索引列将“发送数据”增加 25 倍 - 为啥以及如何改进？

Posted 2023-04-19

技术标签:

【中文标题】选择非索引列将“发送数据”增加 25 倍 - 为啥以及如何改进？【英文标题】：SELECTing non-indexed column increases 'sending data' 25x - why and how to improve?选择非索引列将“发送数据”增加 25 倍 - 为什么以及如何改进？ 【发布时间】：2011-06-19 15:53:12 【问题描述】：

鉴于此表在本地 mysql 实例 5.1 上关闭查询缓存：

show create table product_views\G
*************************** 1. row ***************************
       Table: product_views
Create Table: CREATE TABLE `product_views` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `dateCreated` datetime NOT NULL,
  `dateModified` datetime DEFAULT NULL,
  `hibernateVersion` bigint(20) DEFAULT NULL,
  `brandName` varchar(255) DEFAULT NULL,
  `mfrModel` varchar(255) DEFAULT NULL,
  `origin` varchar(255) NOT NULL,
  `price` float DEFAULT NULL,
  `productType` varchar(255) DEFAULT NULL,
  `rebateDetailsViewed` tinyint(1) NOT NULL,
  `rebateSearchZipCode` int(11) DEFAULT NULL,
  `rebatesFoundAmount` float DEFAULT NULL,
  `rebatesFoundCount` int(11) DEFAULT NULL,
  `siteSKU` varchar(255) DEFAULT NULL,
  `timestamp` datetime NOT NULL,
  `uiContext` varchar(255) DEFAULT NULL,
  `siteVisitId` bigint(20) NOT NULL,
  `efficiencyLevel` varchar(255) DEFAULT NULL,
  `siteName` varchar(255) DEFAULT NULL,
  `clicks` varchar(1024) DEFAULT NULL,
  `rebateFormDownloaded` tinyint(1) NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `siteVisitId` (`siteVisitId`,`siteSKU`),
  KEY `FK52C29B1E3CAB9CC4` (`siteVisitId`),
  KEY `rebateSearchZipCode_idx` (`rebateSearchZipCode`),
  KEY `FIND_UNPROCESSED_IDX` (`siteSKU`,`siteVisitId`,`timestamp`),
  CONSTRAINT `FK52C29B1E3CAB9CC4` FOREIGN KEY (`siteVisitId`) REFERENCES `site_visits` (`id`) ON DELETE NO ACTION ON UPDATE NO ACTION
) ENGINE=InnoDB AUTO_INCREMENT=32909504 DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

这个查询大约需要 3 秒：

    SELECT pv.id, pv.siteSKU
      FROM product_views pv 
CROSS JOIN site_visits sv 
     WHERE pv.siteVisitId = sv.id 
       AND pv.siteSKU = 'foo' 
       AND sv.siteId = 'bar' 
       AND sv.postProcessed = 1 
       AND pv.timestamp >= '2011-05-19 00:00:00' 
       AND pv.timestamp < '2011-06-18 00:00:00';

但是这个（添加到 SELECT 的非索引列）需要大约 65 秒：

    SELECT pv.id, pv.siteSKU, pv.hibernateVersion 
      FROM product_views pv 
CROSS JOIN site_visits sv 
     WHERE pv.siteVisitId = sv.id 
       AND pv.siteSKU = 'foo' 
       AND sv.siteId = 'bar' 
       AND sv.postProcessed = 1 
       AND pv.timestamp >= '2011-05-19 00:00:00' 
       AND pv.timestamp < '2011-06-18 00:00:00';

“where”或“from”子句中没有什么不同。所有额外的时间都花在“发送数据”上：

mysql> show profile for query 1;
+--------------------+-----------+
| Status             | Duration  |
+--------------------+-----------+
| starting           |  0.000155 |
| Opening tables     |  0.000029 |
| System lock        |  0.000007 |
| Table lock         |  0.000019 |
| init               |  0.000072 |
| optimizing         |  0.000032 |
| statistics         |  0.000316 |
| preparing          |  0.000034 |
| executing          |  0.000002 |
| Sending data       | 63.530402 |
| end                |  0.000044 |
| query end          |  0.000005 |
| freeing items      |  0.000091 |
| logging slow query |  0.000002 |
| logging slow query |  0.000109 |
| cleaning up        |  0.000004 |
+--------------------+-----------+
16 rows in set (0.00 sec)

我知道在 where 子句中使用非索引列会减慢速度，但为什么在这里？考虑到我实际上想从 product_views 中选择（*），可以做些什么来改进后一种情况？

解释输出

explain extended select pv.id, pv.siteSKU from product_views pv cross join site_visits sv where pv.siteVisitId=sv.id and pv.siteSKU='foo' and sv.sit eId='bar' and sv.postProcessed=1 and pv.timestamp>='2011-05-19 00:00:00' and pv.timestamp<'2011-06-18 00:00:00';
+----+-------------+-------+--------+-----------------------------------------------------+----------------------+---------+----------------------+-------+-----
-----+--------------------------+ | id | select_type | table | type   | possible_keys                          | key                  | key_len | ref | rows  | filt ered | Extra            |
+----+-------------+-------+--------+-----------------------------------------------------+----------------------+---------+----------------------+-------+-----
-----+--------------------------+ |  1 | SIMPLE      | pv    | ref    | siteVisitId,FK52C29B1E3CAB9CC4,FIND_UNPROCESSED_IDX | FIND_UNPROCESSED_IDX | 258     | const                | 41810 |   10
0.00 | Using where; Using index | |  1 | SIMPLE      | sv    | eq_ref | PRIMARY,post_processed_idx             | PRIMARY              | 8       | clabs.pv.siteVisitId |     1 |   10
0.00 | Using where              |
+----+-------------+-------+--------+-----------------------------------------------------+----------------------+---------+----------------------+-------+-----
-----+--------------------------+ 2 rows in set, 1 warning (0.00 sec)

mysql> explain extended select pv.id, pv.siteSKU, pv.hibernateVersion from product_views pv cross join site_visits sv where pv.siteVisitId=sv.id and pv.siteSKU= 'foo' and sv.siteId='bar' and sv.postProcessed=1 and pv.timestamp>='2011-05-19 00:00:00' and pv.timestamp<'2011-06-18 00:00:00';
+----+-------------+-------+--------+-----------------------------------------------------+----------------------+---------+----------------------+-------+-----
-----+-------------+ | id | select_type | table | type   | possible_keys                          | key                  | key_len | ref | rows  | filt ered | Extra       |
+----+-------------+-------+--------+-----------------------------------------------------+----------------------+---------+----------------------+-------+-----
-----+-------------+ |  1 | SIMPLE      | pv    | ref    | siteVisitId,FK52C29B1E3CAB9CC4,FIND_UNPROCESSED_IDX | FIND_UNPROCESSED_IDX | 258     | const                | 41810 |   10
0.00 | Using where | |  1 | SIMPLE      | sv    | eq_ref | PRIMARY,post_processed_idx             | PRIMARY              | 8       | clabs.pv.siteVisitId |     1 |   10
0.00 | Using where |
+----+-------------+-------+--------+-----------------------------------------------------+----------------------+---------+----------------------+-------+-----
-----+-------------+ 2 rows in set, 1 warning (0.00 sec)

UPDATE1：拆分为 2 个查询使总时间缩短到约 30 秒范围

不知道为什么，但是将后一个查询拆分为以下查询会降低纬度。从 65s 到 ~30s：

1) SELECT pv.id .... //from, where 子句同上

2) SELECT * FROM product_views where id in (idList); //id列表

更新 2：表格大小

表有大约 10M 行查询返回大约 3k 行

【问题讨论】：

SELECT 中的不同列确实会影响检索。我不认为 BIGINT 会出现这样的问题——它通常是二进制文件（BLOB 等），它真正突出了这一事实。执行时间可能取决于系统上的负载，以及之间的网络取决于您的测试方式。为什么不想要查询缓存？由于应用用例，我对第一次查询执行性能感兴趣（表格内容几乎总是在查询之间发生变化 -> 无论如何都不会命中查询缓存）没有理由有这么大的桌子。请阅读3NF。 @teresko 该表对我来说看起来已标准化。你会如何缩小它的尺寸？似乎 FIND_UNPROCESSED_IDX 不是最好的索引，它扫描 40000+ 行。尝试 forsing 其他索引，看看它们是否会有更好的性能。 【参考方案1】：

当只选择索引列时，MySQL 确实只读取索引，不需要读取表数据。据我所知，这被称为索引覆盖查询。但是，当使用的索引中不存在列时，MySQL 需要打开表并从中读取数据。这就是索引覆盖查询要快得多的原因。

见Using Covering Indexes to Improve Query Performance。

至于改进，表中有多少行，查询返回多少，你的缓冲池大小是多少，可用RAM多少等等？

【讨论】：

+1 这仅适用于 InnoDB。见这里：xaprb.com/blog/2006/07/04/…【参考方案2】：

根据我所读到的关于显示配置文件的内容，“发送数据”是执行过程的一部分，与向客户端发送实际数据几乎没有任何关系。你可以看看this thread 另外，mysql docs 提到“发送数据”：

线程正在读取和处理SELECT 语句的行，并向客户端发送数据。由于在此状态期间发生的操作往往会执行大量磁盘访问（读取），因此它通常是给定查询生命周期内运行时间最长的状态。

在我看来，mysql 最好不要在一种状态下将“读取和处理 SELECT 语句的行”和“发送数据”混在一起，尤其是在称为“发送”数据的状态下会导致很多混乱。

【讨论】：

好的，你是说“执行”阶段很长？如果是这样，问题仍然存在 - 为什么？查看EXPLAIN 显示的内容。我认为您的第一个查询不必查找不属于索引的列（列），而是扫描/查找索引。两个查询的“explain extended”输出没有区别您能否将两个查询的解释输出添加到您的帖子中？ @Nikita ：您提供的输出显示了一个重要的区别：Using where; Using index 用于第一个查询，而 Using where 在第二个。 Using index 表示“仅使用索引树中的信息从表中检索列信息，而无需进行额外的查找以读取实际行”（来自 mysql）。在第二种情况下，它必须进行查找以检索非索引列的值。正如已经提到的，第一个查询有一个覆盖索引，第二个查询有一个非覆盖索引。【参考方案3】：

我根本不了解 MySQL 的内部结构，但 Darhazer 的解释对我来说似乎是赢家。添加非索引字段时，必须检索整行。你的行是非常宽。我不能从名称中完全看出它是如何（如果有的话）非规范化的，但我怀疑它是。 site name 和 site sku 闻起来像是属于带有 FK 的 site 查找表。 rebates found amount 和 rebates found count 听起来像是来自连接到单独的 product rebate 表的统计信息。等等

【讨论】：

以上是关于选择非索引列将“发送数据”增加 25 倍 - 为啥以及如何改进？的主要内容，如果未能解决你的问题，请参考以下文章