用 GROUP BY 求和
Posted
技术标签:
【中文标题】用 GROUP BY 求和【英文标题】:SUM OVER with GROUP BY 【发布时间】:2017-10-30 18:15:54 【问题描述】:我正在处理一个包含数百万行的大型数据库,并且我正在努力提高查询效率。该数据库包含贷款组合的定期快照,有时贷款违约(状态从“1”变为“1”)。当它们出现时,它们仅在相应的快照中出现一次,然后不再报告。我正在尝试获取此类贷款的累积计数 - 因为它们随着时间的推移而发展,并根据原产国、年份等分为许多桶。 SUM (...) OVER 似乎是一个非常有效的函数来实现结果但是当我运行以下查询时
Select
assetcountry, edcode, vintage, aa25 as inclusionYrMo, poolcutoffdate, aa74 as status,
AA16 AS employment, AA36 AS product, AA48 AS newUsed, aa55 as customerType,
count(1) as Loans, sum(aa26) as OrigBal, sum(aa27) as CurBal,
SUM(count(1)) OVER (ORDER BY [poolcutoffdate] ROWS UNBOUNDED PRECEDING) as LoanCountCumul,
SUM(aa27) OVER (ORDER BY [poolcutoffdate] ROWS UNBOUNDED PRECEDING) as CurBalCumul,
SUM(aa26) OVER (ORDER BY [poolcutoffdate] ROWS UNBOUNDED PRECEDING) as OrigBalCumul
from myDatabase
where aa22>='2014-01' and aa22<='2014-12' and vintage='2015' and active=0 and aa74<>'1'
group by assetcountry, edcode, vintage, aa25, aa74, aa16, aa36, aa48, aa55, poolcutoffdate
order by poolcutoffdate
我明白了
SQL 错误 (8120) 列 aa27 在所选列表中无效,因为它不包含在聚合函数或 GROUP BY 子句中
谁能解释一下?谢谢
【问题讨论】:
这个问题是否能说明这一点:***.com/questions/10039431/how-can-i-use-sum-over 用您正在使用的数据库标记您的问题。 仅在请求添加数据库标签后 400 天 :) 【参考方案1】:我相信你想要:
Select assetcountry, edcode, vintage, aa25 as inclusionYrMo, poolcutoffdate, aa74 as status,
AA16 AS employment, AA36 AS product, AA48 AS newUsed, aa55 as customerType,
count(1) as Loans, sum(aa26) as OrigBal, sum(aa27) as CurBal,
SUM(count(1)) OVER (ORDER BY [poolcutoffdate] ROWS UNBOUNDED PRECEDING) as LoanCountCumul,
SUM(SUM(aa27)) OVER (ORDER BY [poolcutoffdate] ROWS UNBOUNDED PRECEDING) as CurBalCumul,
SUM(SUM(aa26)) OVER (ORDER BY [poolcutoffdate] ROWS UNBOUNDED PRECEDING) as OrigBalCumul
from myDatabase
where aa22 >= '2014-01' and aa22 <= '2014-12' and vintage = '2015' and
active = 0 and aa74 <> '1'
group by assetcountry, edcode, vintage, aa25, aa74, aa16, aa36, aa48, aa55, poolcutoffdate
order by poolcutoffdate;
注意累积和表达式中的SUM(SUM())
。
【讨论】:
我认为不允许聚合内的聚合?即sum(sum())
?每当我尝试做这样的事情时,我都会得到Cannot perform an aggregate function on an expression containing an aggregate or a subquery
非常感谢您的洞察力,这是朝着正确方向迈出的一大步,并且消除了错误消息(!),但总数一直在进步,而我希望它会从GROUP BY 存储桶中的每次更改为零。例如。如果累积默认值是员工的 20 个和个体经营者的 30 个,则使用此解决方案,我得到一个总计数 50
@GIG。 . .每个“按桶分组”产生一行,所以我不知道您所说的“进步”是什么意思。也许您应该问另一个问题——使用稍微简化的数据集。
我同意我需要从更简单的集合开始更多的学习。同时,我相信使用 PARTITION BY 并删除 GROUP BY 可能会朝着正确的方向发展。我还在这里找到了对 SUM OVER、PARTITION、RANGE 等的有用解释:sqlservercentral.com/articles/Over+Clause/132079【参考方案2】:
这是我发现的工作,将我的结果与一些外部研究数据进行比较。 为了便于阅读,我已经简化了字段:
select
poolcutoffdate,
count(1) as LoanCount,
MAX(sum(case status when 'default' then 1 else 0 end))
over (order by poolcutoffdate
ROWS between unbounded preceding AND CURRENT ROW) as CumulDefaults
from myDatabase
group by poolcutoffdate
order by poolcutoffdate asc
因此,我计算了从开始到当前截止日期至少一次处于“违约”状态的所有贷款。
注意 MAX(SUM()) 的使用,以便结果是从第一行到当前行的各种迭代中最大的。使用 SUM(SUM()) 将添加导致累积累积的各种迭代。
我考虑将 SUM(SUM()) 与“PARTITION BY poolcutoffdate”一起使用,以便计数从 0 重新开始,并且不会从上一个截止日期添加,但这只会包括最近截止日期的贷款,因此如果贷款违约并从池中删除它不会被错误地计算在内。
注意 OVER 语句中的 CASE。
感谢大家的帮助
【讨论】:
以上是关于用 GROUP BY 求和的主要内容,如果未能解决你的问题,请参考以下文章
R语言按组聚合求和实战(sum a variable by group):使用aggregate函数按组聚合求和使用tapply函数按组聚合求和按组聚合求和(使用dplyr包)