SQL 不适用于大样本

Posted

技术标签:

【中文标题】SQL 不适用于大样本【英文标题】:SQL doesn't work on a large sample 【发布时间】:2017-03-17 14:44:51 【问题描述】:

我正在尝试解决一个挑战并想出了解决方案。我编写的解决方案适用于小型数据集,但似乎不适用于较大的数据集。有人可以帮我看看我哪里做错了吗?

我在计算每天的唯一身份用户时遇到了麻烦(输出中的第二列)。其余的逻辑工作正常。

Julia 举办了为期 15 天的 SQL 学习竞赛。比赛开始日期为 2016 年 3 月 1 日,结束日期为 2016 年 3 月 15 日。

编写查询以打印每天至少提交的唯一黑客总数(从比赛的第一天开始),并找到每天提交最大数量的黑客的hacker_id 和名称。如果不止一个这样的黑客有提交的最大数量,打印最低的hacker_id。查询应打印比赛每一天的此信息,按日期排序。

输入格式

下表包含比赛数据:

Hackers:hacker_id是黑客的id,name是名字 黑客。

Submissions:submission_date 是提交日期,submission_id 是提交的id,hacker_id 是提交的黑客的id,score 是提交的分数。

示例输入

对于以下示例输入,假设比赛的结束日期是 2016 年 3 月 6 日。

黑客表:提交表:

**Explanation :-**

2016 年 3 月 1 日,黑客 , , , 并提交了内容。有独特的黑客每天至少提交一次。由于每个黑客都提交了一次,被认为是当天提交的最大数量的黑客。黑客的名字是安吉拉。

2016 年 3 月 2 日,黑客 , , 并提交了意见书。现在并且是唯一每天提交的人,因此有独特的黑客每天至少提交一份。提交了,黑客的名字是迈克尔。

2016 年 3 月 3 日,黑客 , , 并提交了意见书。现在并且是唯一的,所以有独特的黑客每天至少提交一次。由于每个黑客都提交了一次,因此被认为是当天提交最多数量的黑客。黑客的名字是安吉拉。

在 2016 年 3 月 4 日,黑客 , , , 提交了文件。现在并且每天只提交一次,所以每天都有独特的黑客至少提交一次。由于每个黑客都提交了一次,因此被认为是当天提交最多数量的黑客。黑客的名字是安吉拉。

2016 年 3 月 5 日,黑客 , , 并提交了意见书。现在每天只提交一次,所以只有每天至少提交一次的唯一黑客。提交的文件和黑客的名字是弗兰克。

2016 年 3 月 6 日仅提交,因此只有每天至少提交一次的唯一黑客。提交了,黑客的名字是安吉拉。

样本输出

2016-03-01 4 20703 Angela
2016-03-02 2 79722 Michael
2016-03-03 2 20703 Angela
2016-03-04 2 20703 Angela
2016-03-05 1 36396 Frank
2016-03-06 1 20703 Angela

Schema & Data :-

http://sqlfiddle.com/#!9/844928

Solution :-


SELECT A.submission_date, A.cnt, B.hacker_id, B.name 
  FROM
    (
        SELECT submission_date, COUNT( DISTINCT hacker_id ) AS cnt
          FROM submissions
         WHERE submission_date = '2016-03-01'
         GROUP BY submission_date 
        UNION ALL
        SELECT submission_date, COUNT( DISTINCT hacker_id )
          FROM
            (
                SELECT DATEADD(day, 1, convert( date, A.submission_date ))  AS submission_date, A.hacker_id
                  FROM 
                    (
                       SELECT submission_date, hacker_id
                         FROM submissions
                       GROUP BY submission_date, hacker_id
                     ) A
                INNER  JOIN  
                    (
                         SELECT DATEADD(day, -1, convert( date, submission_date )) AS new_submission_date, hacker_id
                           FROM submissions
                          GROUP BY DATEADD(day, -1, convert( date, submission_date )) , hacker_id
                     ) B
              ON A.submission_date = B.new_submission_date
             AND A.hacker_id = B.hacker_id  
            ) Z
        GROUP BY submission_date
    ) A
INNER JOIN 
(
    SELECT s.submission_date, s.hacker_id, h.name
      FROM
    (
        SELECT submission_date, hacker_id 
          FROM
        ( 
            SELECT submission_date, hacker_id,cnt, ROW_NUMBER() OVER ( PARTITION BY submission_date ORDER BY cnt DESC, hacker_id ) AS rn
              FROM 
            (
             SELECT submission_date, hacker_id, COUNT(*) AS cnt
               FROM submissions
              GROUP BY submission_date, hacker_id
            ) Z
        ) Y
        WHERE rn = 1
    ) s
    INNER JOIN
    hackers h
    ON s.hacker_id = h.hacker_id
) B
ON A.submission_date = B.submission_date
;

【问题讨论】:

我添加了 sqlfiddle 链接来创建模式和数据.... “它似乎无法正常工作”到底是什么意思? 你的 sqlfiddle 中的 colleges 表有什么关系? 这对我来说看起来很像家庭作业。因此,我们可以为您指明正确的方向,但您应该自己做。或者,如果您说这是一个挑战,如果其他人必须为您编写,您如何提交条目? @Carra 执行计划将如何帮助调试逻辑错误? 【参考方案1】:
WITH 
  unique_hackers_on_dates AS
(
   SELECT   
      Submissions.submission_date,
      Submissions.hacker_id,
      COUNT(1) subs_per_hacker_per_day,
      MAX(COUNT(1)) OVER (PARTITION BY submission_date) max_subs_per_day
   FROM
      Submissions
   GROUP BY
      Submissions.submission_date,
      Submissions.hacker_id      
), 
   hacker_with_max_sub AS
(
   SELECT
      unique_hackers_on_dates.submission_date,
      MIN(hacker_id) min_hacker_id
   FROM
      unique_hackers_on_dates
   WHERE
      unique_hackers_on_dates.subs_per_hacker_per_day = unique_hackers_on_dates.max_subs_per_day
   GROUP BY
      unique_hackers_on_dates.submission_date
), 
   dates AS
(
   SELECT   
      unique_hackers_on_dates.submission_date, 
      unique_hackers_on_dates.hacker_id
   FROM
      unique_hackers_on_dates
   WHERE
      unique_hackers_on_dates.submission_date = CAST('2016-03-01' AS Date)
   UNION ALL
      SELECT   
         unique_hackers_on_dates.submission_date, 
         unique_hackers_on_dates.hacker_id
      FROM     
         dates
      INNER JOIN
         unique_hackers_on_dates 
         ON dates.hacker_id = unique_hackers_on_dates.hacker_id AND
            DATEADD(DAY, 1, dates.submission_date) = unique_hackers_on_dates.submission_date
), 
   consec_hackers as
(
   SELECT 
      submission_date,
      count(1) num_consec_hackers
   FROM 
      dates
   GROUP BY
      submission_date
)
SELECT
   consec_hackers.submission_date,
   consec_hackers.num_consec_hackers,
   hacker_with_max_sub.min_hacker_id,
   Hackers.name
FROM
   consec_hackers
INNER JOIN
   hacker_with_max_sub
   on consec_hackers.submission_date = hacker_with_max_sub.submission_date
INNER JOIN
   Hackers 
   ON hacker_with_max_sub.min_hacker_id = Hackers.hacker_id
ORDER BY 
    consec_hackers.submission_date;

【讨论】:

我刚开始解决这个问题,我很好奇除了使用子查询之外是否有更有效的方法来做到这一点?【参考方案2】:

查看rextester.com下一个查询:

WITH
  a AS
  (
    SELECT
      submission_date,
      hacker_id,
      COUNT(*) AS submissions_by_hacker,
      DENSE_RANK() OVER (ORDER BY submission_date) AS sequence_number_by_date,
      DENSE_RANK() OVER
      (
        PARTITION BY hacker_id ORDER BY submission_date
      ) AS sequence_number_by_hacker,
      RANK() OVER
      (
        PARTITION BY submission_date ORDER BY count(*) DESC
      ) AS rank_by_hacker_submissions
    FROM #submissions
    GROUP BY submission_date, hacker_id
  ),
  b AS
  (
    SELECT
      *,
      MIN(IIF(rank_by_hacker_submissions = 1, hacker_id, NULL)) OVER
      (
        PARTITION BY submission_date
      ) AS min_hacker_id
    FROM a
  )
SELECT
  b.submission_date,
  h.hacker_id,
  COUNT(*) AS quantity_of_hackers_who_made_at_least_submission_each_day,
  h.name AS hacker_name
FROM b JOIN #hackers AS h ON b.min_hacker_id = h.hacker_id
WHERE b.sequence_number_by_date = b.sequence_number_by_hacker
GROUP BY b.submission_date, h.hacker_id, h.name
ORDER BY b.submission_date, h.hacker_id;

输出:

+---------------------+-----------+-----------------------------------------------------------+-------------+
|   submission_date   | hacker_id | quantity_of_hackers_who_made_at_least_submission_each_day | hacker_name |
+---------------------+-----------+-----------------------------------------------------------+-------------+
| 01.03.2016 00:00:00 |     20703 |                                                         4 | Angela      |
| 02.03.2016 00:00:00 |     79722 |                                                         2 | Michael     |
| 03.03.2016 00:00:00 |     20703 |                                                         2 | Angela      |
| 04.03.2016 00:00:00 |     20703 |                                                         2 | Angela      |
| 05.03.2016 00:00:00 |     36396 |                                                         1 | Frank       |
| 06.03.2016 00:00:00 |     20703 |                                                         1 | Angela      |
+---------------------+-----------+-----------------------------------------------------------+-------------+

【讨论】:

【参考方案3】:
IF OBJECT_ID('tempdb..#Results') IS NOT NULL
    DROP TABLE #Results;
CREATE TABLE #Results
([Number of Hackers that had a Submission]                 INT,
 SubmissionDate                                            DATE,
 [Greatest # of Submissions by Hacker (lowest ID if tied)] INT,
 [Hacker Name with Most Submissions]                       VARCHAR(50)
);
DECLARE @CurrentDate DATE;
DECLARE my CURSOR
FOR SELECT DISTINCT
           submission_date
    FROM submissions;
OPEN my;
FETCH NEXT FROM my INTO @CurrentDate;
WHILE @@FETCH_STATUS = 0
    BEGIN
        INSERT INTO #Results
               SELECT a.hackers [Number of Hackers that had a Submission],
                      a.SubmissionDate,
                      b.Submission_Count [Greatest # of Submissions by Hacker (lowest ID if tied)],
                      b.Hacker [Hacker Name with Most Submissions]
               FROM
               (
                   SELECT COUNT(DISTINCT hacker_ID) hackers,
                          @CurrentDate [SubmissionDate]
                   FROM submissions
                   WHERE submission_date = @CurrentDate
               ) a
               JOIN
               (
                   SELECT TOP 1 COUNT(submission_id) Submission_Count,
                                b.name [Hacker],
                                submission_date
                   FROM submissions a
                        JOIN hackers b ON a.hacker_id = b.hacker_id
                   WHERE a.submission_date = @currentDate
                   GROUP BY b.name,
                            a.hacker_id,
                            submission_date
                   ORDER BY COUNT(submission_id) DESC,
                            a.hacker_id
               ) b ON a.SubmissionDate = b.submission_date;
        FETCH NEXT FROM my INTO @CurrentDate;
    END;
CLOSE my;
DEALLOCATE my;
SELECT *
FROM #Results;

通常不喜欢使用游标,但它对于小数据来说很快,并且易于基于每个日期进行评估..

你的结果很接近,但和我得到的不一样,没有时间诊断你的查询,所以用这个来比较和对比。

考虑到你是在 3 月 17 日发布的,我猜测并希望这是家庭作业,现在已经过期了......而且我没有帮助你作弊......

祝你好运!

结果:

【讨论】:

【参考方案4】:

尝试以下查询:

select submission_date ,( SELECT COUNT(distinct hacker_id)  
                    FROM Submissions s2  
                    WHERE s2.submission_date = s1.submission_date AND 
                    (SELECT COUNT(distinct s3.submission_date) 
                     FROM Submissions s3 
                     WHERE s3.hacker_id = s2.hacker_id AND  
     s3.submission_date < s1.submission_date) = dateDIFF(s1.submission_date , '2016-03-01')) ,

        (select hacker_id  from submissions s2 
         where s2.submission_date = s1.submission_date 
           group by hacker_id 
         order by count(submission_id) desc , hacker_id limit 1) as hack,
    (select name from hackers where hacker_id = hack)
    from 
    (select distinct submission_date from submissions) s1
    group by submission_date;

【讨论】:

【参考方案5】:
select big_1.submission_date, big_1.hkr_cnt, big_2.hacker_id, h.name
from
(select submission_date, count(distinct hacker_id) as hkr_cnt
from 
(select s.*, dense_rank() over(order by submission_date) as date_rank, 
dense_rank() over(partition by hacker_id order by submission_date) as hacker_rank 
from submissions s ) a 
where date_rank = hacker_rank 
group by submission_date) big_1 
join 
(select submission_date,hacker_id, 
rank() over(partition by submission_date order by sub_cnt desc, hacker_id) as max_rank 
from (select submission_date, hacker_id, count(*) as sub_cnt 
from submissions 
group by submission_date, hacker_id) b ) big_2
on big_1.submission_date = big_2.submission_date and big_2.max_rank = 1 
join hackers h on h.hacker_id = big_2.hacker_id 
order by 1;

【讨论】:

【参考方案6】:

试试下面的简单查询。使用下面提供的示例数据进行测试

--- This CTE pulls the unique hackers who made atleast 1 submission per day
WITH cte_c(submission_date,hacker_id) AS
(
SELECT submission_date,hacker_id FROM Submissions WHERE  submission_date = '2020-03-01'
UNION ALL
SELECT A.submission_date,A.hacker_id FROM Submissions A
JOIN cte_c B ON A.submission_date = DATEADD(dd,1,B.submission_date) and A.hacker_id = B.hacker_id
WHERE A.submission_date > '2020-03-01'
)
-- This CTE gives the hackers who made maximum submissions each day and assigns rank 1 to min(hacker_id)
,cte_h as
(
SELECT submission_date,hacker_id, ROW_NUMBER()OVER(PARTITION BY submission_date ORDER BY COUNT(*) DESC, hacker_id) rnk
FROM Submissions
GROUP BY submission_date,hacker_id
)
SELECT c.submission_date,c.hackers_per_day,h.hacker_id,ha.name 
FROM (SELECT submission_date, COUNT(DISTINCT hacker_id) as hackers_per_day FROM cte_c GROUP BY submission_date) C
JOIN cte_h H on c.submission_date = H.submission_date  and rnk = 1--and c.hacker_id = h.hacker_id
JOIN Hackers ha  ON h.hacker_id = ha.hacker_id
ORDER BY c.submission_date
------- Sample Data ---------------------------------------
create table #Hackers
(
hacker_id int,
name varchar(10)
)

create table #Submissions
(submission_date date,
hacker_id int)

insert into Hackers Values(1,'Test1'),(2,'Test2'),(3,'Test3'),(4,'Test4'),(5,'Test5')
insert into Submissions Values('2016-03-01',1),('2016-03-01',2),('2016-03-01',3),('2016-03-01',4),
('2016-03-02',2),('2016-03-02',2),('2016-03-02',3),('2016-03-02',4),('2016-03-02',3),
('2016-03-03',5),('2016-03-03',1),('2016-03-03',2),('2016-03-03',4),('2016-03-03',1),
('2016-03-04',1),('2016-03-04',2),('2016-03-04',5),('2016-03-04',2)

【讨论】:

以上是关于SQL 不适用于大样本的主要内容,如果未能解决你的问题,请参考以下文章

Oracle where exists 子句不适用于 SQL Plus

用于具有动态样本大小的分层抽样的 sql 查询

估计量|估计值|置信度|置信水平|非正态的小样本|t分布|大样本抽样分布|总体方差|

来自 Sql 数据库的简单随机样本

IdentityServer 网站中缺少的声明,包括所有样本

在 SQL 上对粒度样本进行平均最简单的方法