大表连接的mysql查询优化
Posted
技术标签:
【中文标题】大表连接的mysql查询优化【英文标题】:mysql query optimization for large table joins 【发布时间】:2012-08-01 21:49:50 【问题描述】:我正在为广播电台创建一个报告,该报告会生成在线听众的日志,以记录 ip、日期、时间、总用户收听等。
听众表
client_ip date time date_time listeners
--------------- ---------- -------- ------------------- -----------
166.147.81.179 2012-04-30 00:19:46 2012-04-30 00:19:46 1
64.12.243.203 2012-04-30 04:38:37 2012-04-30 04:38:37 1
198.228.211.195 2012-04-30 05:36:33 2012-04-30 05:36:33 1
198.228.211.195 2012-04-30 05:36:34 2012-04-30 05:36:34 2
198.228.211.195 2012-04-30 05:36:35 2012-04-30 05:36:35 2
198.228.211.195 2012-04-30 05:36:35 2012-04-30 05:36:35 3
166.147.81.179 2012-04-30 05:47:13 2012-04-30 05:47:13 2
76.170.251.97 2012-04-30 06:01:37 2012-04-30 06:01:37 2
76.170.251.97 2012-04-30 06:01:39 2012-04-30 06:01:39 2
76.170.251.97 2012-04-30 06:01:39 2012-04-30 06:01:39 2
同时记录歌曲详细信息(标题、艺术家、专辑、长度、日期、时间)等。
播放列表表
title artist length_in_secs played_date played_time start_date_time end_date_time
-------------------------- ------------------------------- -------------- ----------- ----------- ------------------- ---------------------
We Found Love Rihanna 184 2012-04-30 00:00:21 2012-04-30 00:00:21 2012-04-30 00:03:25
Photograph Nickelback 216 2012-04-30 00:03:31 2012-04-30 00:03:31 2012-04-30 00:07:07
Not Over You Gavin DeGraw 214 2012-04-30 00:07:18 2012-04-30 00:07:18 2012-04-30 00:10:52
Stereo Hearts Gym Class Heroes Ft Adam Levine 210 2012-04-30 00:10:55 2012-04-30 00:10:55 2012-04-30 00:14:25
I Gotta Feeling Black Eyed Peas 243 2012-04-30 00:15:03 2012-04-30 00:15:03 2012-04-30 00:19:06
One Thing Leads To Another Fixx 182 2012-04-30 00:19:14 2012-04-30 00:19:14 2012-04-30 00:22:16
Raise Your Glass Pink 202 2012-04-30 00:22:29 2012-04-30 00:22:29 2012-04-30 00:25:51
Better In Time Leona Lewis 216 2012-04-30 00:30:13 2012-04-30 00:30:13 2012-04-30 00:33:49
Tainted Love Soft Cell 153 2012-04-30 00:33:56 2012-04-30 00:33:56 2012-04-30 00:36:29
Haven't Met You Yet Michael Buble' 242 2012-04-30 00:37:14 2012-04-30 00:37:14 2012-04-30 00:41:16
因此,报告要求是“在日期或日期范围内有多少用户收听歌曲”,我这样编写查询。它提供了正确的输出(据我所知),但查询执行所需的时间与数据大小不成比例 - 从 5 秒到 5-10 分钟,具体取决于日期范围。
SELECT DATE_FORMAT(p.played_date, "%m/%d/%Y") `played_date`, p.played_time, p.length_in_secs, p.title, p.artist, RTRIM(p.label) `label`, RTRIM(p.album) `album`, IFNULL((SELECT SUM(l.listeners) FROM listeners `l` WHERE l.date_time >= p.start_date_time AND l.date_time <= p.end_date_time LIMIT 1), 0) `listeners` FROM playlists `p` WHERE p.title <> "" AND (p.played_date >= '2012-04-30' AND p.played_date <= '2012-05-30') HAVING listeners > 0 ORDER BY p.title ASC;
// formatted //
SELECT
DATE_FORMAT(p.played_date, "%m/%d/%Y") `played_date`,
p.played_time,
p.length_in_secs,
p.title,
p.artist,
RTRIM(p.label) `label`,
RTRIM(p.album) `album`,
IFNULL(
(SELECT
SUM(l.listeners)
FROM
listeners `l`
WHERE l.date_time >= p.start_date_time
AND l.date_time <= p.end_date_time
LIMIT 1),
0
) `listeners`
FROM
playlists `p`
WHERE p.title <> ""
AND (
p.played_date >= '2012-04-30'
AND p.played_date <= '2012-05-30'
)
HAVING listeners > 0
ORDER BY p.title ASC
输出:
played_date played_time length_in_secs title artist label album listeners
----------- ----------- -------------- --------------------- ------------------------ ------------------ ------------------ -----------
04/30/2012 08:06:26 228 Brighter Than The Sun Colbie Caillat (Cal-Lay) Universal Republic All of You 9
04/30/2012 08:44:16 248 Breakfast At Tiffanys Deep Blue Something 6
04/30/2012 18:06:40 253 Bizarre Love Triangle New Order 2
04/30/2012 17:05:21 183 Animal Neon Trees Mercury Habits 5
04/30/2012 08:58:05 253 Always Be My Baby Mariah Carey 2
04/30/2012 07:25:52 264 Already Gone Kelly Clarkson RCA All I Ever Wante 3
04/30/2012 16:21:33 236 All The Right Moves One Republic Interscope Waking Up 7
04/30/2012 11:58:26 199 All That She Wants Ace Of Base 12
04/30/2012 11:14:17 247 All I Wanna Do Sheryl Crow 2
04/30/2012 16:12:59 235 A Thousand Miles Vanessa Carlton 5
有没有办法优化这个查询以更快地运行,或者编写一个新的、更快的查询?请建议/帮助我。谢谢!!
使用解释
EXPLAIN playlists;
Field Type Null Key Default Extra
--------------- ---------------- ------ ------ ----------------- -----------------------------
playlist_id int(10) unsigned NO PRI (NULL) auto_increment
title varchar(255) YES (NULL)
artist varchar(255) YES (NULL)
label varchar(255) YES (NULL)
album varchar(255) YES (NULL)
length_in_secs int(11) NO (NULL)
played_date date NO (NULL)
played_time time NO (NULL)
start_date_time datetime NO (NULL)
end_date_time datetime NO (NULL)
added_date datetime NO (NULL)
modified_date timestamp NO CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP
EXPLAIN listeners;
Field Type Null Key Default Extra
------------- ------------------- ------ ------ ----------------- -----------------------------
listener_id bigint(20) unsigned NO PRI (NULL) auto_increment
station_id int(10) unsigned NO (NULL)
client_ip varchar(50) NO (NULL)
time time NO (NULL)
date date NO (NULL)
date_time datetime YES (NULL)
timestamp bigint(20) unsigned NO (NULL)
listeners int(10) unsigned NO (NULL)
processes int(10) unsigned NO (NULL)
uid int(10) unsigned NO (NULL)
user_agent varchar(255) YES (NULL)
added_date datetime NO (NULL)
modified_date timestamp NO CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP
【问题讨论】:
您能否对执行时间较长的查询之一运行EXPLAIN
查询?也许问题只是您正在运行的查询没有合适的索引,创建一个好的索引可以解决时间问题。此外,如果您可以显示该表上的当前索引是什么,那将非常有帮助。谢谢
如何识别用户何时停止收听?
@invertedSpear,为什么需要停止/启动,我需要上面的查询优化就是这样。
@drew010,我使用EXPLAIN
数据更新了问题。
这是必需的,因为如果您不知道用户何时停止收听(“注销”),您将不知道该用户是否听过某首歌曲。此外,在歌曲中间退出的人呢,他们算吗?
【参考方案1】:
使用INNER JOIN
而不是使用correlated subquery
作为:
SELECT DATE_FORMAT(p.played_date, "%m/%d/%Y") played_date,
p.played_time,
p.length_in_secs,
p.title,
p.artist,
RTRIM(p.label) label,
RTRIM(p.album) album,
SUM(l.listeners) listeners
FROM playlists p
INNER JOIN listeners l
ON l.date_time BETWEEN p.start_date_time AND p.end_date_time
WHERE p.title <> "" AND
p.played_date BETWEEN '2012-04-30' AND '2012-05-30'
ORDER BY p.title ASC;
考虑在表上添加以下索引可能有助于提高查询性能。使用EXPLAIN
检查以下索引:
playlists KEY (played_date, start_date_time, end_date_time, title);
listeners KEY (date_time, listeners);
【讨论】:
尽管随着时间的推移差异正在缩小,但使用JOIN
而不是子SELECT
,mysql 通常仍然更快。
@Omesh 它没有给出所需的结果。我上面贴的,应该怎么输出。
我已经对其进行了测试,它给出的结果与您的查询相同。对于您提供的输入数据,您的输出似乎是错误的。你可以为它设置 sqlfiddle 吗?
INNER
的工作方式与子查询相同,但执行时间或多或少。【参考方案2】:
正如 cmets 中所讨论的,您的查询实际上并没有按照您的意愿执行。鉴于您拥有的数据,我会在 SQL 之外亲自处理这些数据,以创建一个表来存储每首歌曲有多少人听过,然后您可以在 SQL 中查询该表以获取此信息。但是,如果您绝对希望 SQL 查询来执行此操作,则它需要类似于这种怪物;
SELECT
DATE_FORMAT(p.played_date, "%m/%d/%Y") `played_date`,
p.played_time,
p.length_in_secs,
p.title,
p.artist,
RTRIM(p.label) `label`,
RTRIM(p.album) `album`,
SUM(SMALLEST(prev_listeners,next_listeners,dur_listeners) AS listeners
FROM (
SELECT
P.start_date_time,
SUBSTRING_INDEX(GROUP_CONCAT(l_before.listeners ORDER BY l_before.date_time DESC),',',1) AS prev_listeners,
SUBSTRING_INDEX(GROUP_CONCAT(l_after.listeners ORDER BY l_after.date_time ASC),',',1) AS next_listeners,
MIN(l_during) AS dur_listeners
FROM playlists p
JOIN listeners l_before ON l_before.date_time < p.start_date_time
LEFT JOIN listeners l_after ON l_before.client_ip = l_after.client_ip AND l_after.date_time > p.end_date_time
LEFT JOIN listeners l_during ON l.client_ip = l_during.client_ip AND l_during.date_time BETWEEN p.start_date_time AND p.end_date_time
WHERE p.title <> ""
AND p.played_date BETWEEN '2012-04-30' AND '2012-05-30'
GROUP BY p.start_date_time, l_before.client_ip
) l
JOIN playlists p USING (start_date_time)
GROUP BY p.start_date_time
ORDER BY p.start_date_time
其中 SMALLEST 是一个返回最小 non_null 参数的函数。
这将比您当前的查询花费更长的时间,但这是我能想到的最有效的方法来获取每首歌曲的实际听众数量。
哦,这是假设当一个ip地址的每个人都停止侦听时,日志记录了一行零侦听器,否则真的没有办法这样做。
【讨论】:
以上是关于大表连接的mysql查询优化的主要内容,如果未能解决你的问题,请参考以下文章