如何获得最近 x 周数据的不同计数,但在红移中按周分组?
Posted
技术标签:
【中文标题】如何获得最近 x 周数据的不同计数,但在红移中按周分组?【英文标题】:How to get distinct count for last x weeks data but group by week in redshift? 【发布时间】:2020-10-29 05:44:05 【问题描述】:我有一个下面的查询,我运行它给了我month
作为当前月份的单一计数,dates_for_week
列出了上周从周日到周六的所有日期。
select COUNT(DISTINCT(CLIENTID))
FROM process.data
where type = 'pots'
and stype= 'kites'
and tires IN ('abc', 'def', 'ghi', 'jkl')
and comp IN ('data', 'hello', 'world')
AND year = '2020'
-- this is for month october but week 43
and (month = '10' and dates_for_week IN ('18', '19', '20', '21', '22', '23', '24'))
到目前为止,我看到的输出是这样的 -
Count
-----
982
现在我正在尝试使这个查询动态化,以便它可以为我提供过去 6 周的计数,如下所示:
Count Week
------------
982 W43
123 W42
126 W41
127 W40
128 W39
129 W38
我能够以动态方式转换上述查询,这给了我当前月份 10 月和前一周的计数,即 43,它工作正常,如下所示,但我不确定如何更改它以便它可以以上述输出格式给我过去 6 周的数据。看起来我还需要在一周内动态更改月份才能获得过去 6 周的输出。
select COUNT(DISTINCT(CLIENTID))
FROM process.data
where type = 'pots'
and stype= 'kites'
and tires IN ('abc', 'def', 'ghi', 'jkl')
and comp IN ('data', 'hello', 'world')
AND year = '2020'
-- this is for month october but week 43
and (
month = extract(month from current_date)
and dates_for_week IN (
select
date_part('d',((DATE_TRUNC('week', CURRENT_DATE) - 9) + row_number() over (order by true))::date)
from process.data
limit 7
)
)
所以我需要的是过去 6 周的数据,并按周分组给我如上所示的计数。这有可能做到吗?
and (month = '10' and dates_for_week IN ('18', '19', '20', '21', '22', '23', '24'))
and (month = '10' and dates_for_week IN ('11', '12', '13', '14', '15', '16', '17'))
and (month = '10' and dates_for_week IN ('4', '5', '6', '7', '8', '9', '10'))
and (month IN ('9', '10') and dates_for_week IN ('27', '28', '29', '30', '1', '2', '3'))
and (month = '9' and dates_for_week IN ('20', '21', '22', '23', '24', '25', '26'))
and (month = '9' and dates_for_week IN ('13', '14', '15', '16', '17', '18', '19'))
【问题讨论】:
我必须阅读 Philipp Johannis 的回答才能最终明白dates_for_week
只是日期中的天数,例如29 为 2020 年 10 月 29 日。这个名字让我很困惑。分开存储年、月和日有什么原因吗?为什么不将日期存储为日期?这将使查询数据变得更加容易,并且还可以防止数据库存储无效日期(例如 2020-02-30 或 2020-29-10)。
【参考方案1】:
如果我理解正确,您可以在单独的列中使用年、月和日。我认为最简单的方法是“构建”一个适当的日期列,然后使用该列。
以下查询应为您提供包括当前周在内的最近 6 周。
select
EXTRACT(week from TO_DATE(year||'-'||month||'-'|| dates_for_week,'YYYY-MM-DD')) week_num
,COUNT(DISTINCT(CLIENTID))
FROM process.data
where type = 'pots'
and stype= 'kites'
and tires IN ('abc', 'def', 'ghi', 'jkl')
and comp IN ('data', 'hello', 'world')
and TO_DATE(year||'-'||month||'-'|| dates_for_week,'YYYY-MM-DD') >= DATEADD(day,-42,DATE_TRUNC('week', sysdate))
GROUP BY 1
ORDER BY 1 desc
但是,由于红移时间从星期一开始,可能会遇到挑战,因此可能需要稍作调整(增加一天):
select
EXTRACT(week from DATEADD(day,1,TO_DATE(year||'-'||month||'-'|| dates_for_week,'YYYY-MM-DD'))) week_num
,COUNT(DISTINCT(CLIENTID))
FROM process.data
where type = 'pots'
and stype= 'kites'
and tires IN ('abc', 'def', 'ghi', 'jkl')
and comp IN ('data', 'hello', 'world')
and DATEADD(day,1,TO_DATE(year||'-'||month||'-'|| dates_for_week,'YYYY-MM-DD')) BETWEEN DATEADD(day,-42,DATE_TRUNC('week', sysdate)) AND DATEADD(day,-1,DATE_TRUNC('week', sysdate))
GROUP BY 1
ORDER BY 1 desc
调试:
我会先开始运行这个查询,检查日期是否计算正确
select COUNT(DISTINCT(CLIENTID))
FROM process.data
where type = 'pots'
and stype= 'kites'
and tires IN ('abc', 'def', 'ghi', 'jkl')
and comp IN ('data', 'hello', 'world')
AND year = '2020'
-- this is for month october but week 43
and TO_DATE(year||'-'||month||'-'|| dates_for_week,'YYYY-MM-DD') between '2020-10-18' and '2020-10-24'
然后我会看看是否正确计算了一周:
select
EXTRACT(week from DATEADD(day,1,TO_DATE(year||'-'||month||'-'|| dates_for_week,'YYYY-MM-DD'))) week_num
,COUNT(DISTINCT(CLIENTID))
FROM process.data
where type = 'pots'
and stype= 'kites'
and tires IN ('abc', 'def', 'ghi', 'jkl')
and comp IN ('data', 'hello', 'world')
AND year = '2020'
-- this is for month october but week 43
and TO_DATE(year||'-'||month||'-'|| dates_for_week,'YYYY-MM-DD') between '2020-10-18' and '2020-10-24'
group by 1
order by 1
最后但并非最不重要的一点是,我会延长时间范围并使其充满活力:
select
EXTRACT(week from DATEADD(day,1,TO_DATE(year||'-'||month||'-'|| dates_for_week,'YYYY-MM-DD'))) week_num
,COUNT(DISTINCT(CLIENTID))
FROM process.data
where type = 'pots'
and stype= 'kites'
and tires IN ('abc', 'def', 'ghi', 'jkl')
and comp IN ('data', 'hello', 'world')
AND year = '2020'
-- this is for month october but week 43
and DATEADD(day,1,TO_DATE(year||'-'||month||'-'|| dates_for_week,'YYYY-MM-DD')) Between DATEADD(day,-42,DATE_TRUNC('week', sysdate)) and DATEADD(day,-1,DATE_TRUNC('week', sysdate))
group by 1
order by 1
【讨论】:
这太棒了。我学到了很多。我不知道我们可以这样写查询。看来是我自己弄复杂了。感谢您的帮助!让我试试这个也能理解。 我尝试了这个查询,但如果我将此动态查询输出与每周的手动查询进行比较,结果不知何故不匹配。同样在我的情况下,我本周不需要它,我需要它来跟踪六周,即上周,然后再持续 5 周。知道为什么结果不匹配吗?有什么方法可以拆分此查询,以便我可以调试它并查看它每周使用的日期,因为我需要每周周日到周六。 添加了一些调试并发现了一个小错误。我正在减去一天以获得周日到周六 - 但我应该增加一天。可以重试吗? 我在调试部分运行了您的所有 3 个查询 - 如果我将您的第一个和第二个调试查询分别与我的原始手动查询一起使用,则数据完全匹配,但如果我运行第三个调试查询,我注意到一些事情。我从第三个查询中注意到的第一件事是,它提供了37, 38, 39, 40, 41, 42, 43
周的数据,但我只需要最后 6 周,所以我不需要 37
周的数据。第二件事是38, 39, 40, 41, 42
周的数据与您的第一个调试查询的第三个查询完全匹配,但第 43 周的数据与您的第一个和第二个查询的第三个查询不匹配。有什么想法吗?
所以我猜问题出在计算的日期范围上。仔细检查一下,我认为这应该是正确的 where 条件:and DATEADD(day,1,TO_DATE(year||'-'||month||'-'|| dates_for_week,'YYYY-MM-DD')) Between DATEADD(day,-42,DATE_TRUNC('week', sysdate)) and DATEADD(day,-1,DATE_TRUNC('week', sysdate))
- 您可能需要使用 -42
- 这意味着您将从本周的最后一个开始返回 42 天,这应该为 6 周,因为 6*7 = 42。【参考方案2】:
假设你有某种日期列,你可以简单地使用这样的东西
select date_part(w, your_date_column) as week_number,
COUNT(DISTINCT(CLIENTID))
FROM process.data
where type = 'pots'
and stype= 'kites'
and tires IN ('abc', 'def', 'ghi', 'jkl')
and comp IN ('data', 'hello', 'world')
AND year = '2020'
group by 1
【讨论】:
【参考方案3】:您可以使用order by
和limit
:
select year, week, COUNT(DISTINCT CLIENTID)
from process.data
where type = 'pots' and
stype= 'kites' and
tires IN ('abc', 'def', 'ghi', 'jkl') and
comp IN ('data', 'hello', 'world')
group by year, dates_for_week
order by year desc, week desc
limit 6;
这是假设您有一个星期列,这似乎是一个合理的假设。
这是完成您想做的事情的简单方法。我猜在 Redshift 上它应该有不错的性能。
【讨论】:
以上是关于如何获得最近 x 周数据的不同计数,但在红移中按周分组?的主要内容,如果未能解决你的问题,请参考以下文章