选择超过总值百分比的行子集
Posted
技术标签:
【中文标题】选择超过总值百分比的行子集【英文标题】:Selecting a subset of rows that exceed a percentage of total values 【发布时间】:2016-07-26 22:40:00 【问题描述】:我有一张包含客户、用户和收入的表格,如下所示(实际上有数千条记录):
Customer User Revenue
001 James 500
002 James 750
003 James 450
004 Sarah 100
005 Sarah 500
006 Sarah 150
007 Sarah 600
008 James 150
009 James 100
我想做的是只返回占用户总收入 80% 的支出最高的客户。
要手动执行此操作,我会按 James 的客户收入排序,计算出总百分比和运行总百分比,然后只返回运行总和达到 80% 的记录:
Customer User Revenue % of total Running Total %
002 James 750 0.38 0.38
001 James 500 0.26 0.64
003 James 450 0.23 0.87 <- Greater than 80%, last record
008 James 150 0.08 0.95
009 James 100 0.05 1.00
我尝试过使用 CTE,但到目前为止都是空白。有没有办法通过单个查询而不是在 Excel 工作表中手动执行此操作?
【问题讨论】:
【参考方案1】:在 SQL Server 2012+ 中,您将使用累积和——效率更高。在 SQL Server 2008 中,您可以使用相关子查询或cross apply
:
select t.*,
sum(t.Revenue*1.0) / sum(t.Revenue) over (partition by user) as [% of Total],
sum(RunningRevenue*1.0) / sum(t.Revenue) over (partition by user) as [Running Total %]
from t cross apply
(select sum(Revenue) as RunningRevenue
from t t2
where t2.Revenue >= t.Revenue and t2.user = t.user
) t2;
注意:*1.0
只是以防Revenue
存储为整数。 SQL Server 进行整数除法,这将为几乎所有行的两列返回0
。
编辑:
如果您只需要 James 的结果,请添加 where user = 'James'
。
【讨论】:
[% of Total]
列似乎有效,但仅适用于单个用户,但运行总数似乎到处都是。
@bendataclear 。 . .您最初的问题只有一个用户。为单个用户的运行总计进行调整是微不足道的。而且比小伙子的回答要简单得多。
不需要sum
周围的t.Revenue
。它不会起作用,因为没有GROUP BY
(或者我错过了一些东西)。第二个user
应该被引用[user]
否则你会得到错误。第三:SUM OVER()
计算整个而不是表的百分比而不是 user
。而且没有过滤。
@lad2025 。 . .当然,这行得通。这是一个apply
,对每一行使用聚合。您可能需要查看apply
(technet.microsoft.com/en-us/library/ms175156(v=sql.105).aspx) 上的文档或自己尝试。
@GordonLinoff 请检查 Demo。即使您删除 sum
并添加 wrap user
和 []
,百分比的结果将是整个表 sum(t.Revenue) over ()
。情况是,在当前形式下,代码甚至无法运行。【参考方案2】:
仅限SQL Server 2012+
你可以使用窗口化的SUM
:
WITH cte AS
(
SELECT *,
1.0 * Revenue/SUM(Revenue) OVER(PARTITION BY [User]) AS percentile,
1.0 * SUM(Revenue) OVER(PARTITION BY [User] ORDER BY [Revenue] DESC)
/SUM(Revenue) OVER(PARTITION BY [User]) AS running_percentile
FROM tab
)
SELECT *
FROM cte
WHERE running_percentile <= 0.8;
LiveDemo
SQL Server 2008:
WITH cte AS
(
SELECT *, ROW_NUMBER() OVER(PARTITION BY [User] ORDER BY Revenue DESC) AS rn
FROM t
), cte2 AS
(
SELECT c.Customer, c.[User], c.[Revenue]
,percentile = 1.0 * Revenue / NULLIF(c3.s,0)
,running_percentile = 1.0 * c2.s / NULLIF(c3.s,0)
FROM cte c
CROSS APPLY
(SELECT SUM(Revenue) AS s
FROM cte c2
WHERE c.[User] = c2.[User]
AND c2.rn <= c.rn) c2
CROSS APPLY
(SELECT SUM(Revenue) AS s
FROM cte c2
WHERE c.[User] = c2.[User]) AS c3
)
SELECT *
FROM cte2
WHERE running_percentile <= 0.8;
LiveDemo2
输出:
╔══════════╦═══════╦═════════╦════════════════╦════════════════════╗
║ Customer ║ User ║ Revenue ║ percentile ║ running_percentile ║
╠══════════╬═══════╬═════════╬════════════════╬════════════════════╣
║ 2 ║ James ║ 750 ║ 0,384615384615 ║ 0,384615384615 ║
║ 1 ║ James ║ 500 ║ 0,256410256410 ║ 0,641025641025 ║
║ 7 ║ Sarah ║ 600 ║ 0,444444444444 ║ 0,444444444444 ║
╚══════════╩═══════╩═════════╩════════════════╩════════════════════╝
编辑 2:
看起来差不多了,唯一的问题是它缺少最后一行, 詹姆斯的第三行使他超过 0.80,但需要包括在内。
WITH cte AS
(
SELECT *, ROW_NUMBER() OVER(PARTITION BY [User] ORDER BY Revenue DESC) AS rn
FROM t
), cte2 AS
(
SELECT c.Customer, c.[User], c.[Revenue]
,percentile = 1.0 * Revenue / NULLIF(c3.s,0)
,running_percentile = 1.0 * c2.s / NULLIF(c3.s,0)
FROM cte c
CROSS APPLY
(SELECT SUM(Revenue) AS s
FROM cte c2
WHERE c.[User] = c2.[User]
AND c2.rn <= c.rn) c2
CROSS APPLY
(SELECT SUM(Revenue) AS s
FROM cte c2
WHERE c.[User] = c2.[User]) AS c3
)
SELECT a.*
FROM cte2 a
CROSS APPLY (SELECT MIN(running_percentile) AS rp
FROM cte2
WHERE running_percentile >= 0.8
AND cte2.[User] = a.[User]) AS s
WHERE a.running_percentile <= s.rp;
LiveDemo3
输出:
╔══════════╦═══════╦═════════╦════════════════╦════════════════════╗
║ Customer ║ User ║ Revenue ║ percentile ║ running_percentile ║
╠══════════╬═══════╬═════════╬════════════════╬════════════════════╣
║ 2 ║ James ║ 750 ║ 0,384615384615 ║ 0,384615384615 ║
║ 1 ║ James ║ 500 ║ 0,256410256410 ║ 0,641025641025 ║
║ 3 ║ James ║ 450 ║ 0,230769230769 ║ 0,871794871794 ║
║ 7 ║ Sarah ║ 600 ║ 0,444444444444 ║ 0,444444444444 ║
║ 5 ║ Sarah ║ 500 ║ 0,370370370370 ║ 0,814814814814 ║
╚══════════╩═══════╩═════════╩════════════════╩════════════════════╝
看起来很完美,翻译到我的大桌子上并返回了我需要的东西,花了 5 分钟完成它,仍然无法理解你所做的事情!
SQL Server 2008
不支持 OVER()
子句中的所有内容,但 ROW_NUMBER
支持。
首先计算组内的位置:
╔═══════════╦════════╦══════════╦════╗
║ Customer ║ User ║ Revenue ║ rn ║
╠═══════════╬════════╬══════════╬════╣
║ 2 ║ James ║ 750 ║ 1 ║
║ 1 ║ James ║ 500 ║ 2 ║
║ 3 ║ James ║ 450 ║ 3 ║
║ 8 ║ James ║ 150 ║ 4 ║
║ 9 ║ James ║ 100 ║ 5 ║
║ 7 ║ Sarah ║ 600 ║ 1 ║
║ 5 ║ Sarah ║ 500 ║ 2 ║
║ 6 ║ Sarah ║ 150 ║ 3 ║
║ 4 ║ Sarah ║ 100 ║ 4 ║
╚═══════════╩════════╩══════════╩════╝
第二个cte:
c2
子查询根据来自ROW_NUMBER
的排名计算运行总数
c3
计算每个用户的全部金额
在最终查询中,s
子查询找到最低的 running
总数,超过 80%。
编辑 3:
使用ROW_NUMBER
实际上是多余的。
WITH cte AS
(
SELECT c.Customer, c.[User], c.[Revenue]
,percentile = 1.0 * Revenue / NULLIF(c3.s,0)
,running_percentile = 1.0 * c2.s / NULLIF(c3.s,0)
FROM t c
CROSS APPLY
(SELECT SUM(Revenue) AS s
FROM t c2
WHERE c.[User] = c2.[User]
AND c2.Revenue >= c.Revenue) c2
CROSS APPLY
(SELECT SUM(Revenue) AS s
FROM t c2
WHERE c.[User] = c2.[User]) AS c3
)
SELECT a.*
FROM cte a
CROSS APPLY (SELECT MIN(running_percentile) AS rp
FROM cte c2
WHERE running_percentile >= 0.8
AND c2.[User] = a.[User]) AS s
WHERE a.running_percentile <= s.rp
ORDER BY [User], Revenue DESC;
LiveDemo4
【讨论】:
看起来差不多了,唯一的问题是它缺少最后一行,詹姆斯的第三行超过了 0.80,但需要包括在内。如果这不是一场灾难,但这是不可能的。 看起来很完美,翻译到我的大桌子并返回我需要的东西,花了 5 分钟完成它,但仍然无法理解你所做的事情!谢谢。以上是关于选择超过总值百分比的行子集的主要内容,如果未能解决你的问题,请参考以下文章
Zabbix3.0.4监控Windows的CPU使用百分比并在CPU使用率超过90%触发报警