大表连接的mysql查询优化

Posted

技术标签:

【中文标题】大表连接的mysql查询优化【英文标题】:mysql query optimization for large table joins 【发布时间】:2012-08-01 21:49:50 【问题描述】:

我正在为广播电台创建一个报告,该报告会生成在线听众的日志,以记录 ip、日期、时间、总用户收听等。

听众表

client_ip        date        time      date_time            listeners  
---------------  ----------  --------  -------------------  -----------
166.147.81.179   2012-04-30  00:19:46  2012-04-30 00:19:46            1
64.12.243.203    2012-04-30  04:38:37  2012-04-30 04:38:37            1
198.228.211.195  2012-04-30  05:36:33  2012-04-30 05:36:33            1
198.228.211.195  2012-04-30  05:36:34  2012-04-30 05:36:34            2
198.228.211.195  2012-04-30  05:36:35  2012-04-30 05:36:35            2
198.228.211.195  2012-04-30  05:36:35  2012-04-30 05:36:35            3
166.147.81.179   2012-04-30  05:47:13  2012-04-30 05:47:13            2
76.170.251.97    2012-04-30  06:01:37  2012-04-30 06:01:37            2
76.170.251.97    2012-04-30  06:01:39  2012-04-30 06:01:39            2
76.170.251.97    2012-04-30  06:01:39  2012-04-30 06:01:39            2

同时记录歌曲详细信息(标题、艺术家、专辑、长度、日期、时间)等。

播放列表表

title                       artist                           length_in_secs  played_date  played_time  start_date_time      end_date_time        
--------------------------  -------------------------------  --------------  -----------  -----------  -------------------  ---------------------
We Found Love               Rihanna                                     184  2012-04-30   00:00:21     2012-04-30 00:00:21  2012-04-30 00:03:25  
Photograph                  Nickelback                                  216  2012-04-30   00:03:31     2012-04-30 00:03:31  2012-04-30 00:07:07  
Not Over You                Gavin DeGraw                                214  2012-04-30   00:07:18     2012-04-30 00:07:18  2012-04-30 00:10:52  
Stereo Hearts               Gym Class Heroes Ft Adam Levine             210  2012-04-30   00:10:55     2012-04-30 00:10:55  2012-04-30 00:14:25  
I Gotta Feeling             Black  Eyed Peas                            243  2012-04-30   00:15:03     2012-04-30 00:15:03  2012-04-30 00:19:06  
One Thing Leads To Another  Fixx                                        182  2012-04-30   00:19:14     2012-04-30 00:19:14  2012-04-30 00:22:16  
Raise Your Glass            Pink                                        202  2012-04-30   00:22:29     2012-04-30 00:22:29  2012-04-30 00:25:51  
Better In Time              Leona Lewis                                 216  2012-04-30   00:30:13     2012-04-30 00:30:13  2012-04-30 00:33:49  
Tainted Love                Soft Cell                                   153  2012-04-30   00:33:56     2012-04-30 00:33:56  2012-04-30 00:36:29  
Haven't Met You Yet         Michael Buble'                              242  2012-04-30   00:37:14     2012-04-30 00:37:14  2012-04-30 00:41:16  

因此,报告要求是“在日期或日期范围内有多少用户收听歌曲”,我这样编写查询。它提供了正确的输出(据我所知),但查询执行所需的时间与数据大小不成比例 - 从 5 秒到 5-10 分钟,具体取决于日期范围。

SELECT DATE_FORMAT(p.played_date, "%m/%d/%Y") `played_date`, p.played_time, p.length_in_secs, p.title, p.artist, RTRIM(p.label) `label`, RTRIM(p.album) `album`, IFNULL((SELECT SUM(l.listeners) FROM listeners `l` WHERE l.date_time >= p.start_date_time AND l.date_time <= p.end_date_time LIMIT 1), 0) `listeners` FROM playlists `p` WHERE p.title <> "" AND (p.played_date >= '2012-04-30' AND p.played_date <= '2012-05-30') HAVING listeners > 0 ORDER BY p.title ASC;
// formatted //
SELECT 
    DATE_FORMAT(p.played_date, "%m/%d/%Y") `played_date`,
    p.played_time,
    p.length_in_secs,
    p.title,
    p.artist,
    RTRIM(p.label) `label`,
    RTRIM(p.album) `album`,
    IFNULL(
        (SELECT 
            SUM(l.listeners) 
        FROM
            listeners `l` 
        WHERE l.date_time >= p.start_date_time 
            AND l.date_time <= p.end_date_time 
        LIMIT 1),
        0
    ) `listeners` 
FROM
    playlists `p` 
WHERE p.title <> "" 
    AND (
        p.played_date >= '2012-04-30' 
        AND p.played_date <= '2012-05-30'
    ) 
HAVING listeners > 0 
ORDER BY p.title ASC

输出:

played_date  played_time  length_in_secs  title                  artist                    label               album               listeners  
-----------  -----------  --------------  ---------------------  ------------------------  ------------------  ------------------  -----------
04/30/2012   08:06:26                228  Brighter Than The Sun  Colbie Caillat (Cal-Lay)  Universal Republic  All of You                    9

04/30/2012   08:44:16                248  Breakfast At Tiffanys  Deep Blue Something                                                         6

04/30/2012   18:06:40                253  Bizarre Love Triangle  New Order                                                                   2

04/30/2012   17:05:21                183  Animal                 Neon Trees                Mercury             Habits                        5

04/30/2012   08:58:05                253  Always Be My Baby      Mariah Carey                                                                2

04/30/2012   07:25:52                264  Already Gone           Kelly Clarkson            RCA                 All I Ever Wante              3

04/30/2012   16:21:33                236  All The Right Moves    One Republic              Interscope          Waking Up                     7

04/30/2012   11:58:26                199  All That She Wants     Ace Of Base                                                                12

04/30/2012   11:14:17                247  All I Wanna Do         Sheryl Crow                                                                 2

04/30/2012   16:12:59                235  A Thousand Miles       Vanessa Carlton                                                             5

有没有办法优化这个查询以更快地运行,或者编写一个新的、更快的查询?请建议/帮助我。谢谢!!

使用解释

EXPLAIN playlists;


Field            Type              Null    Key     Default            Extra                        
---------------  ----------------  ------  ------  -----------------  -----------------------------
playlist_id      int(10) unsigned  NO      PRI     (NULL)             auto_increment               
title            varchar(255)      YES             (NULL)                                          
artist           varchar(255)      YES             (NULL)                                          
label            varchar(255)      YES             (NULL)                                          
album            varchar(255)      YES             (NULL)                                          
length_in_secs   int(11)           NO              (NULL)                                          
played_date      date              NO              (NULL)                                          
played_time      time              NO              (NULL)                                          
start_date_time  datetime          NO              (NULL)                                          
end_date_time    datetime          NO              (NULL)                                          
added_date       datetime          NO              (NULL)                                          
modified_date    timestamp         NO              CURRENT_TIMESTAMP  on update CURRENT_TIMESTAMP


EXPLAIN listeners;


Field          Type                 Null    Key     Default            Extra                        
-------------  -------------------  ------  ------  -----------------  -----------------------------
listener_id    bigint(20) unsigned  NO      PRI     (NULL)             auto_increment               
station_id     int(10) unsigned     NO              (NULL)                                          
client_ip      varchar(50)          NO              (NULL)                                          
time           time                 NO              (NULL)                                          
date           date                 NO              (NULL)                                          
date_time      datetime             YES             (NULL)                                          
timestamp      bigint(20) unsigned  NO              (NULL)                                          
listeners      int(10) unsigned     NO              (NULL)                                          
processes      int(10) unsigned     NO              (NULL)                                          
uid            int(10) unsigned     NO              (NULL)                                          
user_agent     varchar(255)         YES             (NULL)                                          
added_date     datetime             NO              (NULL)                                          
modified_date  timestamp            NO              CURRENT_TIMESTAMP  on update CURRENT_TIMESTAMP  

【问题讨论】:

您能否对执行时间较长的查询之一运行EXPLAIN 查询?也许问题只是您正在运行的查询没有合适的索引,创建一个好的索引可以解决时间问题。此外,如果您可以显示该表上的当前索引是什么,那将非常有帮助。谢谢 如何识别用户何时停止收听? @invertedSpear,为什么需要停止/启动,我需要上面的查询优化就是这样。 @drew010,我使用EXPLAIN 数据更新了问题。 这是必需的,因为如果您不知道用户何时停止收听(“注销”),您将不知道该用户是否听过某首歌曲。此外,在歌曲中间退出的人呢,他们算吗? 【参考方案1】:

使用INNER JOIN 而不是使用correlated subquery 作为:

SELECT DATE_FORMAT(p.played_date, "%m/%d/%Y") played_date,
       p.played_time,
       p.length_in_secs,
       p.title,
       p.artist,
       RTRIM(p.label) label,
       RTRIM(p.album) album,
       SUM(l.listeners) listeners
FROM playlists p
     INNER JOIN listeners l
         ON l.date_time BETWEEN p.start_date_time AND p.end_date_time
WHERE p.title <> "" AND
      p.played_date BETWEEN '2012-04-30' AND  '2012-05-30'
ORDER BY p.title ASC;

考虑在表上添加以下索引可能有助于提高查询性能。使用EXPLAIN 检查以下索引:

playlists KEY (played_date, start_date_time, end_date_time, title);

listeners KEY (date_time, listeners);

【讨论】:

尽管随着时间的推移差异正在缩小,但使用JOIN 而不是子SELECTmysql 通常仍然更快。 @Omesh 它没有给出所需的结果。我上面贴的,应该怎么输出。 我已经对其进行了测试,它给出的结果与您的查询相同。对于您提供的输入数据,您的输出似乎是错误的。你可以为它设置 sqlfiddle 吗? INNER 的工作方式与子查询相同,但执行时间或多或少。【参考方案2】:

正如 cmets 中所讨论的,您的查询实际上并没有按照您的意愿执行。鉴于您拥有的数据,我会在 SQL 之外亲自处理这些数据,以创建一个表来存储每首歌曲有多少人听过,然后您可以在 SQL 中查询该表以获取此信息。但是,如果您绝对希望 SQL 查询来执行此操作,则它需要类似于这种怪物;

SELECT 
DATE_FORMAT(p.played_date, "%m/%d/%Y") `played_date`,
p.played_time,
p.length_in_secs,
p.title,
p.artist,
RTRIM(p.label) `label`,
RTRIM(p.album) `album`,
SUM(SMALLEST(prev_listeners,next_listeners,dur_listeners) AS listeners
FROM (
  SELECT
  P.start_date_time,
  SUBSTRING_INDEX(GROUP_CONCAT(l_before.listeners ORDER BY l_before.date_time DESC),',',1) AS prev_listeners, 
  SUBSTRING_INDEX(GROUP_CONCAT(l_after.listeners ORDER BY l_after.date_time ASC),',',1) AS next_listeners, 
  MIN(l_during) AS dur_listeners
  FROM playlists p
  JOIN listeners l_before ON l_before.date_time < p.start_date_time 
  LEFT JOIN listeners l_after ON l_before.client_ip = l_after.client_ip AND l_after.date_time > p.end_date_time 
  LEFT JOIN listeners l_during ON l.client_ip = l_during.client_ip AND l_during.date_time BETWEEN p.start_date_time AND p.end_date_time
  WHERE p.title <> ""
  AND p.played_date BETWEEN '2012-04-30' AND '2012-05-30'
  GROUP BY p.start_date_time, l_before.client_ip
) l
JOIN playlists p USING (start_date_time)
GROUP BY p.start_date_time
ORDER BY p.start_date_time

其中 SMALLEST 是一个返回最小 non_null 参数的函数。

这将比您当前的查询花费更长的时间,但这是我能想到的最有效的方法来获取每首歌曲的实际听众数量。

哦,这是假设当一个ip地址的每个人都停止侦听时,日志记录了一行零侦听器,否则真的没有办法这样做。

【讨论】:

以上是关于大表连接的mysql查询优化的主要内容,如果未能解决你的问题,请参考以下文章

在 3 个大表上使用内连接优化 SQL 查询

了解MySQL联表查询中的驱动表,优化查询,以小表驱动大表

大表的 MySQL 查询优化

MySQL 对于千万级的大表要怎么优化

Mysql语句优化

求助Oracle大表查询优化