从大表里随机取若干行的效率问题

Posted 2020-10-26 yingjiang

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了从大表里随机取若干行的效率问题相关的知识，希望对你有一定的参考价值。

知乎里有个问题在 MySQL 中，从 10 万条主键不连续的数据里随机取 3000 条，如何做到高效？，用

select id from t order by rand() limit 3000

需要花三四十秒，怎么办？

以下是我的回答：

慢的原因有两个：

1）rand()执行了10万次(而不是3000次）

2）10万条记录进行了排序

我的方案如下：

我用informationschema.columns表来举例。这个表在我的数据库里有7482条记录，我要从中取30个column_name。(每个mysql数据库的informationschema.columns记录数不同，用这个表的好处是谁都可以试）

先看看这张表：

select count(*) from information_schema.columns;

mysql> select count(*) from information_schema.columns;
+----------+
| count(*) |
+----------+
| 7482 |
+----------+
1 row in set (0.09 sec)

第一步是做一个全表查询，加上一个行号字段n,从1到7482 (Mysql没有row_number功能，所以用此法）

select @x:=@x +1 n,column_name from information_schema.columns,(select @x:=0) x

结果类似这样：

| n | column_name |
| 1 | CHARACTER_SET_NAME |
| 2 | DEFAULT_COLLATE_NAME |
| 3 | DESCRIPTION |
| 4 | MAXLEN |
| 5 | COLLATION_NAME |

……

其次做一个查询，产生30个随机数，1-7482之间

select distinct ceil(r*c) n from (select rand() r from information_schema.columns limit 35) t1
       ,(select count(*) c from information_schema.columns) t2 
limit 30

结果大致是这样的；

| n
+------
| 4452
| 3838
| 5835
| 2694
| 3449
| 1678
|……

主要子查询t1里故意产生了35个随机数，然后在外面用了distinct,这样可以基本保证有30个不重复的整数

最后，组装一下做个左连接就可以了：

select tblrand.n,tbldata.column_name from

(select distinct ceil(r*c) n from (select rand() r from information_schema.columns limit 35) t1,

(select count(*) c from information_schema.columns) t2 limit 30) tblrand

left join

(select @x:[email protected] +1 n,column_name from information_schema.columns,(select @x:=0) x) tbldata on tblrand.n=tbldata.n;

总结：

1）只做了30多次rand()，而不是7400多次。

2) 没有排序

3) 对表的ID连续性没要求。实例中的表压根没ID字段

以上例子，用root用户登入mysql就能测试

以上是关于从大表里随机取若干行的效率问题的主要内容，如果未能解决你的问题，请参考以下文章