如何在 TSQL 中排除 1800 万条记录的最高和最低 5% 的年薪，然后使用所选数据计算平均值

Posted 2023-04-18

技术标签:

【中文标题】如何在 TSQL 中排除 1800 万条记录的最高和最低 5% 的年薪，然后使用所选数据计算平均值【英文标题】：How do I exclude top and bottom 5% of annual salary on 18 million records in TSQL, then use the seleced data to calculate average 【发布时间】：2015-08-24 23:49:14 【问题描述】：

我在一列（收入）中有 1800 万行，我想排除收入的最高和最低 5% 以计算更准确的平均收入。

【问题讨论】：

计算PERCENT_RANK，缩小数据并计算AVG。 【参考方案1】：

您没有提供任何关于结构、分组等的数据。所以这是概念证明。计算PERCENT_RANK()，缩小数据，计算平均值。

SqlFiddleDemo

/* Preparing data */
CREATE TABLE tab(id INT IDENTITY(1,1), income INT)

;WITH Nums(Number) AS
(SELECT 1 AS Number
  UNION ALL
 SELECT Number+1 FROM Nums where Number<100   /* Warn here recursive CTE */
)
INSERT INTO tab(income)
SELECT Number FROM Nums;


/* Main query */
WITH cte(id, income, [percent]) AS
(
  SELECT 
       id
      ,income
      ,[percent] = PERCENT_RANK() OVER(ORDER BY income)
  FROM tab
)
SELECT [average_income] =  AVG(income)
FROM cte
WHERE 
   [percent] > 0.05 
   AND [percent] < 0.95

【讨论】：

【参考方案2】：

with top5 as 
(select top 5 percent income from tablename)
, bottom5 as
(select top 5 percent income from tablename order by income desc)
select avg(income)
from tablename
where income not in (select income from top5 union all select income from bottom5)

您可以通过计算前 5% 和后 5%，然后使用 not in 从最终计算中排除它们来做到这一点。

【讨论】：

NOT IN 将排除中间 90% 中与其他 10% 没有区别的值，从而更改“平均值”。【参考方案3】：

这有点棘手。 Percentile_rank() 可能是要走的路。但是，以下可能会更快：

select t.*
from table t cross join
     (select max(salary) as maxs
      from (select top 5 percent salary
            from table t
            order by salary
           ) t
     ) m1 cross join
     (select min(salary) as mins
      from (select top 5 percent salary
            from table t
            order by salary desc
           )
    ) m2
where s.salary >= mins and s.salary <= maxs;

这个想法是 min 和 max 的子查询可以有效地使用索引来获取 5% 和 95% 的值。查询的其余部分将只是全表扫描。

【讨论】：

许多收入值可能会落在行中，因此您不能将它们的值用作过滤器。您必须计算出 5% 的行数。 @TomBlodget 。 . .我很困惑。你不是 OP，你怎么知道数据是什么样子的。这个问题给出了一个普遍的问题。如果此答案针对更具体的答案进行了优化，请直说。

以上是关于如何在 TSQL 中排除 1800 万条记录的最高和最低 5% 的年薪，然后使用所选数据计算平均值的主要内容，如果未能解决你的问题，请参考以下文章