SQL 不适用于大样本
Posted
技术标签:
【中文标题】SQL 不适用于大样本【英文标题】:SQL doesn't work on a large sample 【发布时间】:2017-03-17 14:44:51 【问题描述】:我正在尝试解决一个挑战并想出了解决方案。我编写的解决方案适用于小型数据集,但似乎不适用于较大的数据集。有人可以帮我看看我哪里做错了吗?
我在计算每天的唯一身份用户时遇到了麻烦(输出中的第二列)。其余的逻辑工作正常。
Julia 举办了为期 15 天的 SQL 学习竞赛。比赛开始日期为 2016 年 3 月 1 日,结束日期为 2016 年 3 月 15 日。
编写查询以打印每天至少提交的唯一黑客总数(从比赛的第一天开始),并找到每天提交最大数量的黑客的hacker_id 和名称。如果不止一个这样的黑客有提交的最大数量,打印最低的hacker_id。查询应打印比赛每一天的此信息,按日期排序。
输入格式
下表包含比赛数据:
Hackers:hacker_id是黑客的id,name是名字 黑客。
Submissions:submission_date 是提交日期,submission_id 是提交的id,hacker_id 是提交的黑客的id,score 是提交的分数。
示例输入
对于以下示例输入,假设比赛的结束日期是 2016 年 3 月 6 日。
黑客表:提交表:
**Explanation :-**
2016 年 3 月 1 日,黑客 , , , 并提交了内容。有独特的黑客每天至少提交一次。由于每个黑客都提交了一次,被认为是当天提交的最大数量的黑客。黑客的名字是安吉拉。
2016 年 3 月 2 日,黑客 , , 并提交了意见书。现在并且是唯一每天提交的人,因此有独特的黑客每天至少提交一份。提交了,黑客的名字是迈克尔。
2016 年 3 月 3 日,黑客 , , 并提交了意见书。现在并且是唯一的,所以有独特的黑客每天至少提交一次。由于每个黑客都提交了一次,因此被认为是当天提交最多数量的黑客。黑客的名字是安吉拉。
在 2016 年 3 月 4 日,黑客 , , , 提交了文件。现在并且每天只提交一次,所以每天都有独特的黑客至少提交一次。由于每个黑客都提交了一次,因此被认为是当天提交最多数量的黑客。黑客的名字是安吉拉。
2016 年 3 月 5 日,黑客 , , 并提交了意见书。现在每天只提交一次,所以只有每天至少提交一次的唯一黑客。提交的文件和黑客的名字是弗兰克。
2016 年 3 月 6 日仅提交,因此只有每天至少提交一次的唯一黑客。提交了,黑客的名字是安吉拉。
样本输出
2016-03-01 4 20703 Angela
2016-03-02 2 79722 Michael
2016-03-03 2 20703 Angela
2016-03-04 2 20703 Angela
2016-03-05 1 36396 Frank
2016-03-06 1 20703 Angela
Schema & Data :-
http://sqlfiddle.com/#!9/844928
Solution :-
SELECT A.submission_date, A.cnt, B.hacker_id, B.name
FROM
(
SELECT submission_date, COUNT( DISTINCT hacker_id ) AS cnt
FROM submissions
WHERE submission_date = '2016-03-01'
GROUP BY submission_date
UNION ALL
SELECT submission_date, COUNT( DISTINCT hacker_id )
FROM
(
SELECT DATEADD(day, 1, convert( date, A.submission_date )) AS submission_date, A.hacker_id
FROM
(
SELECT submission_date, hacker_id
FROM submissions
GROUP BY submission_date, hacker_id
) A
INNER JOIN
(
SELECT DATEADD(day, -1, convert( date, submission_date )) AS new_submission_date, hacker_id
FROM submissions
GROUP BY DATEADD(day, -1, convert( date, submission_date )) , hacker_id
) B
ON A.submission_date = B.new_submission_date
AND A.hacker_id = B.hacker_id
) Z
GROUP BY submission_date
) A
INNER JOIN
(
SELECT s.submission_date, s.hacker_id, h.name
FROM
(
SELECT submission_date, hacker_id
FROM
(
SELECT submission_date, hacker_id,cnt, ROW_NUMBER() OVER ( PARTITION BY submission_date ORDER BY cnt DESC, hacker_id ) AS rn
FROM
(
SELECT submission_date, hacker_id, COUNT(*) AS cnt
FROM submissions
GROUP BY submission_date, hacker_id
) Z
) Y
WHERE rn = 1
) s
INNER JOIN
hackers h
ON s.hacker_id = h.hacker_id
) B
ON A.submission_date = B.submission_date
;
【问题讨论】:
我添加了 sqlfiddle 链接来创建模式和数据.... “它似乎无法正常工作”到底是什么意思? 你的 sqlfiddle 中的colleges
表有什么关系?
这对我来说看起来很像家庭作业。因此,我们可以为您指明正确的方向,但您应该自己做。或者,如果您说这是一个挑战,如果其他人必须为您编写,您如何提交条目?
@Carra 执行计划将如何帮助调试逻辑错误?
【参考方案1】:
WITH
unique_hackers_on_dates AS
(
SELECT
Submissions.submission_date,
Submissions.hacker_id,
COUNT(1) subs_per_hacker_per_day,
MAX(COUNT(1)) OVER (PARTITION BY submission_date) max_subs_per_day
FROM
Submissions
GROUP BY
Submissions.submission_date,
Submissions.hacker_id
),
hacker_with_max_sub AS
(
SELECT
unique_hackers_on_dates.submission_date,
MIN(hacker_id) min_hacker_id
FROM
unique_hackers_on_dates
WHERE
unique_hackers_on_dates.subs_per_hacker_per_day = unique_hackers_on_dates.max_subs_per_day
GROUP BY
unique_hackers_on_dates.submission_date
),
dates AS
(
SELECT
unique_hackers_on_dates.submission_date,
unique_hackers_on_dates.hacker_id
FROM
unique_hackers_on_dates
WHERE
unique_hackers_on_dates.submission_date = CAST('2016-03-01' AS Date)
UNION ALL
SELECT
unique_hackers_on_dates.submission_date,
unique_hackers_on_dates.hacker_id
FROM
dates
INNER JOIN
unique_hackers_on_dates
ON dates.hacker_id = unique_hackers_on_dates.hacker_id AND
DATEADD(DAY, 1, dates.submission_date) = unique_hackers_on_dates.submission_date
),
consec_hackers as
(
SELECT
submission_date,
count(1) num_consec_hackers
FROM
dates
GROUP BY
submission_date
)
SELECT
consec_hackers.submission_date,
consec_hackers.num_consec_hackers,
hacker_with_max_sub.min_hacker_id,
Hackers.name
FROM
consec_hackers
INNER JOIN
hacker_with_max_sub
on consec_hackers.submission_date = hacker_with_max_sub.submission_date
INNER JOIN
Hackers
ON hacker_with_max_sub.min_hacker_id = Hackers.hacker_id
ORDER BY
consec_hackers.submission_date;
【讨论】:
我刚开始解决这个问题,我很好奇除了使用子查询之外是否有更有效的方法来做到这一点?【参考方案2】:查看rextester.com下一个查询:
WITH
a AS
(
SELECT
submission_date,
hacker_id,
COUNT(*) AS submissions_by_hacker,
DENSE_RANK() OVER (ORDER BY submission_date) AS sequence_number_by_date,
DENSE_RANK() OVER
(
PARTITION BY hacker_id ORDER BY submission_date
) AS sequence_number_by_hacker,
RANK() OVER
(
PARTITION BY submission_date ORDER BY count(*) DESC
) AS rank_by_hacker_submissions
FROM #submissions
GROUP BY submission_date, hacker_id
),
b AS
(
SELECT
*,
MIN(IIF(rank_by_hacker_submissions = 1, hacker_id, NULL)) OVER
(
PARTITION BY submission_date
) AS min_hacker_id
FROM a
)
SELECT
b.submission_date,
h.hacker_id,
COUNT(*) AS quantity_of_hackers_who_made_at_least_submission_each_day,
h.name AS hacker_name
FROM b JOIN #hackers AS h ON b.min_hacker_id = h.hacker_id
WHERE b.sequence_number_by_date = b.sequence_number_by_hacker
GROUP BY b.submission_date, h.hacker_id, h.name
ORDER BY b.submission_date, h.hacker_id;
输出:
+---------------------+-----------+-----------------------------------------------------------+-------------+
| submission_date | hacker_id | quantity_of_hackers_who_made_at_least_submission_each_day | hacker_name |
+---------------------+-----------+-----------------------------------------------------------+-------------+
| 01.03.2016 00:00:00 | 20703 | 4 | Angela |
| 02.03.2016 00:00:00 | 79722 | 2 | Michael |
| 03.03.2016 00:00:00 | 20703 | 2 | Angela |
| 04.03.2016 00:00:00 | 20703 | 2 | Angela |
| 05.03.2016 00:00:00 | 36396 | 1 | Frank |
| 06.03.2016 00:00:00 | 20703 | 1 | Angela |
+---------------------+-----------+-----------------------------------------------------------+-------------+
【讨论】:
【参考方案3】:IF OBJECT_ID('tempdb..#Results') IS NOT NULL
DROP TABLE #Results;
CREATE TABLE #Results
([Number of Hackers that had a Submission] INT,
SubmissionDate DATE,
[Greatest # of Submissions by Hacker (lowest ID if tied)] INT,
[Hacker Name with Most Submissions] VARCHAR(50)
);
DECLARE @CurrentDate DATE;
DECLARE my CURSOR
FOR SELECT DISTINCT
submission_date
FROM submissions;
OPEN my;
FETCH NEXT FROM my INTO @CurrentDate;
WHILE @@FETCH_STATUS = 0
BEGIN
INSERT INTO #Results
SELECT a.hackers [Number of Hackers that had a Submission],
a.SubmissionDate,
b.Submission_Count [Greatest # of Submissions by Hacker (lowest ID if tied)],
b.Hacker [Hacker Name with Most Submissions]
FROM
(
SELECT COUNT(DISTINCT hacker_ID) hackers,
@CurrentDate [SubmissionDate]
FROM submissions
WHERE submission_date = @CurrentDate
) a
JOIN
(
SELECT TOP 1 COUNT(submission_id) Submission_Count,
b.name [Hacker],
submission_date
FROM submissions a
JOIN hackers b ON a.hacker_id = b.hacker_id
WHERE a.submission_date = @currentDate
GROUP BY b.name,
a.hacker_id,
submission_date
ORDER BY COUNT(submission_id) DESC,
a.hacker_id
) b ON a.SubmissionDate = b.submission_date;
FETCH NEXT FROM my INTO @CurrentDate;
END;
CLOSE my;
DEALLOCATE my;
SELECT *
FROM #Results;
通常不喜欢使用游标,但它对于小数据来说很快,并且易于基于每个日期进行评估..
你的结果很接近,但和我得到的不一样,没有时间诊断你的查询,所以用这个来比较和对比。
考虑到你是在 3 月 17 日发布的,我猜测并希望这是家庭作业,现在已经过期了......而且我没有帮助你作弊......
祝你好运!
结果:
【讨论】:
【参考方案4】:尝试以下查询:
select submission_date ,( SELECT COUNT(distinct hacker_id)
FROM Submissions s2
WHERE s2.submission_date = s1.submission_date AND
(SELECT COUNT(distinct s3.submission_date)
FROM Submissions s3
WHERE s3.hacker_id = s2.hacker_id AND
s3.submission_date < s1.submission_date) = dateDIFF(s1.submission_date , '2016-03-01')) ,
(select hacker_id from submissions s2
where s2.submission_date = s1.submission_date
group by hacker_id
order by count(submission_id) desc , hacker_id limit 1) as hack,
(select name from hackers where hacker_id = hack)
from
(select distinct submission_date from submissions) s1
group by submission_date;
【讨论】:
【参考方案5】:select big_1.submission_date, big_1.hkr_cnt, big_2.hacker_id, h.name
from
(select submission_date, count(distinct hacker_id) as hkr_cnt
from
(select s.*, dense_rank() over(order by submission_date) as date_rank,
dense_rank() over(partition by hacker_id order by submission_date) as hacker_rank
from submissions s ) a
where date_rank = hacker_rank
group by submission_date) big_1
join
(select submission_date,hacker_id,
rank() over(partition by submission_date order by sub_cnt desc, hacker_id) as max_rank
from (select submission_date, hacker_id, count(*) as sub_cnt
from submissions
group by submission_date, hacker_id) b ) big_2
on big_1.submission_date = big_2.submission_date and big_2.max_rank = 1
join hackers h on h.hacker_id = big_2.hacker_id
order by 1;
【讨论】:
【参考方案6】:试试下面的简单查询。使用下面提供的示例数据进行测试
--- This CTE pulls the unique hackers who made atleast 1 submission per day
WITH cte_c(submission_date,hacker_id) AS
(
SELECT submission_date,hacker_id FROM Submissions WHERE submission_date = '2020-03-01'
UNION ALL
SELECT A.submission_date,A.hacker_id FROM Submissions A
JOIN cte_c B ON A.submission_date = DATEADD(dd,1,B.submission_date) and A.hacker_id = B.hacker_id
WHERE A.submission_date > '2020-03-01'
)
-- This CTE gives the hackers who made maximum submissions each day and assigns rank 1 to min(hacker_id)
,cte_h as
(
SELECT submission_date,hacker_id, ROW_NUMBER()OVER(PARTITION BY submission_date ORDER BY COUNT(*) DESC, hacker_id) rnk
FROM Submissions
GROUP BY submission_date,hacker_id
)
SELECT c.submission_date,c.hackers_per_day,h.hacker_id,ha.name
FROM (SELECT submission_date, COUNT(DISTINCT hacker_id) as hackers_per_day FROM cte_c GROUP BY submission_date) C
JOIN cte_h H on c.submission_date = H.submission_date and rnk = 1--and c.hacker_id = h.hacker_id
JOIN Hackers ha ON h.hacker_id = ha.hacker_id
ORDER BY c.submission_date
------- Sample Data ---------------------------------------
create table #Hackers
(
hacker_id int,
name varchar(10)
)
create table #Submissions
(submission_date date,
hacker_id int)
insert into Hackers Values(1,'Test1'),(2,'Test2'),(3,'Test3'),(4,'Test4'),(5,'Test5')
insert into Submissions Values('2016-03-01',1),('2016-03-01',2),('2016-03-01',3),('2016-03-01',4),
('2016-03-02',2),('2016-03-02',2),('2016-03-02',3),('2016-03-02',4),('2016-03-02',3),
('2016-03-03',5),('2016-03-03',1),('2016-03-03',2),('2016-03-03',4),('2016-03-03',1),
('2016-03-04',1),('2016-03-04',2),('2016-03-04',5),('2016-03-04',2)
【讨论】:
以上是关于SQL 不适用于大样本的主要内容,如果未能解决你的问题,请参考以下文章
Oracle where exists 子句不适用于 SQL Plus
估计量|估计值|置信度|置信水平|非正态的小样本|t分布|大样本抽样分布|总体方差|