分区函数 COUNT() OVER 可能使用 DISTINCT

Posted 2023-03-24

技术标签:

【中文标题】分区函数 COUNT() OVER 可能使用 DISTINCT【英文标题】：Partition Function COUNT() OVER possible using DISTINCT 【发布时间】：2012-06-27 12:11:26 【问题描述】：

我正在尝试编写以下内容以获得不同 NumUser 的运行总数，如下所示：

NumUsers = COUNT(DISTINCT [UserAccountKey]) OVER (PARTITION BY [Mth])

管理工作室似乎对此不太满意。当我删除 DISTINCT 关键字时，错误消失了，但它不会是一个独特的计数。

DISTINCT 在分区函数中似乎是不可能的。我该如何找到不同的计数？我是否使用更传统的方法，例如相关子查询？

进一步研究一下，这些 OVER 函数的工作方式可能与 Oracle 不同，因为它们不能在 SQL-Server 中用于计算运行总计。

我在SQLfiddle 上添加了一个实时示例，我尝试使用分区函数来计算运行总计。

【问题讨论】：

COUNT 与 ORDER BY 而不是 PARTITION BY 在 2008 年定义不明确。我很惊讶它让你拥有它。根据documentation，您不能将ORDER BY 用于聚合函数。是的 - 认为我对某些 oracle 功能感到困惑；这些运行总数和运行计数将涉及更多 【参考方案1】：

使用dense_rank()有一个非常简单的解决方案

dense_rank() over (partition by [Mth] order by [UserAccountKey]) 
+ dense_rank() over (partition by [Mth] order by [UserAccountKey] desc) 
- 1

这将为您提供您所要求的确切信息：每个月内不同的 UserAccountKey 的数量。

【讨论】：

dense_rank() 需要注意的一点是它会计算 NULL，而 COUNT(field) OVER 不会。因此，我无法在我的解决方案中使用它，但我仍然认为它非常聪明。但我正在寻找每年几个月内不同用户帐户密钥的运行总数：不知道这是如何回答的？ @bf2020，如果UserAccountKey中可以有NULL的值，那么你需要添加这个词：-MAX(CASE WHEN UserAccountKey IS NULL THEN 1 ELSE 0 END) OVER (PARTITION BY Mth)。想法取自以下 LarsRönnbäck 的回答。本质上，如果UserAccountKey 具有NULL 值，则需要从结果中减去额外的1，因为DENSE_RANK 计数为NULL。这里讨论了当窗口函数有框架时使用这个dense_rank 解决方案。 SQL Server 不允许 dense_rank 与窗口框架一起使用：***.com/questions/63527035/…【参考方案2】：

死灵法：

通过 DENSE_RANK 用 MAX 模拟 COUNT DISTINCT over PARTITION BY 相对简单：

;WITH baseTable AS
(
    SELECT 'RM1' AS RM, 'ADR1' AS ADR
    UNION ALL SELECT 'RM1' AS RM, 'ADR1' AS ADR
    UNION ALL SELECT 'RM2' AS RM, 'ADR1' AS ADR
    UNION ALL SELECT 'RM2' AS RM, 'ADR2' AS ADR
    UNION ALL SELECT 'RM2' AS RM, 'ADR2' AS ADR
    UNION ALL SELECT 'RM2' AS RM, 'ADR3' AS ADR
    UNION ALL SELECT 'RM3' AS RM, 'ADR1' AS ADR
    UNION ALL SELECT 'RM2' AS RM, 'ADR1' AS ADR
    UNION ALL SELECT 'RM3' AS RM, 'ADR1' AS ADR
    UNION ALL SELECT 'RM3' AS RM, 'ADR2' AS ADR
)
,CTE AS
(
    SELECT RM, ADR, DENSE_RANK() OVER(PARTITION BY RM ORDER BY ADR) AS dr 
    FROM baseTable
)
SELECT
     RM
    ,ADR

    ,COUNT(CTE.ADR) OVER (PARTITION BY CTE.RM ORDER BY ADR) AS cnt1 
    ,COUNT(CTE.ADR) OVER (PARTITION BY CTE.RM) AS cnt2 
    -- Not supported
    --,COUNT(DISTINCT CTE.ADR) OVER (PARTITION BY CTE.RM ORDER BY CTE.ADR) AS cntDist
    ,MAX(CTE.dr) OVER (PARTITION BY CTE.RM ORDER BY CTE.RM) AS cntDistEmu 
FROM CTE

注意：这假设有问题的字段是不可为空的字段。如果字段中有一个或多个 NULL 条目，则需要减 1。

【讨论】：

【参考方案3】：

我使用类似于上面David 的解决方案，但如果应从计数中排除某些行，则会有额外的变化。这假定 [UserAccountKey] 永远不会为空。

-- subtract an extra 1 if null was ranked within the partition,
-- which only happens if there were rows where [Include] <> 'Y'
dense_rank() over (
  partition by [Mth] 
  order by case when [Include] = 'Y' then [UserAccountKey] else null end asc
) 
+ dense_rank() over (
  partition by [Mth] 
  order by case when [Include] = 'Y' then [UserAccountKey] else null end desc
)
- max(case when [Include] = 'Y' then 0 else 1 end) over (partition by [Mth])
- 1

An SQL Fiddle with an extended example can be found here.

【讨论】：

当UserAccountKey 可以是NULL 时，您的想法可用于制作带有dense_rank() 的原始公式（没有您在答案中谈论的[Include] 的复杂性）。将此术语添加到公式中：-MAX(CASE WHEN UserAccountKey IS NULL THEN 1 ELSE 0 END) OVER (PARTITION BY Mth)。【参考方案4】：

我认为在 SQL-Server 2008R2 中这样做的唯一方法是使用相关子查询或外部应用：

SELECT  datekey,
        COALESCE(RunningTotal, 0) AS RunningTotal,
        COALESCE(RunningCount, 0) AS RunningCount,
        COALESCE(RunningDistinctCount, 0) AS RunningDistinctCount
FROM    document
        OUTER APPLY
        (   SELECT  SUM(Amount) AS RunningTotal,
                    COUNT(1) AS RunningCount,
                    COUNT(DISTINCT d2.dateKey) AS RunningDistinctCount
            FROM    Document d2
            WHERE   d2.DateKey <= document.DateKey
        ) rt;

这可以使用您建议的语法在SQL-Server 2012 中完成：

SELECT  datekey,
        SUM(Amount) OVER(ORDER BY DateKey) AS RunningTotal
FROM    document

但是，仍然不允许使用 DISTINCT，所以如果需要 DISTINCT 和/或如果升级不是一个选项，那么我认为 OUTER APPLY 是您的最佳选择

【讨论】：

很酷，谢谢。我发现了这个SO answer，它具有我将尝试的 OUTER APPLY 选项。您是否在该答案中看到了循环更新方法......它非常遥远并且显然很快。 2012 年的生活会变得更轻松——这是直接的 Oracle 副本吗？【参考方案5】：

在简单的SQL中有一个解决方案：

SELECT time, COUNT(DISTINCT user) OVER(ORDER BY time) AS users
FROM users

SELECT time, COUNT(*) OVER(ORDER BY time) AS users
FROM (
    SELECT user, MIN(time) AS time
    FROM users
    GROUP BY user
) t

【讨论】：

【参考方案6】：

我在这里徘徊，与 whytheq 基本相同的问题并找到了 David 的解决方案，但随后不得不查看我关于 DENSE_RANK 的旧自学笔记，因为我很少使用它：为什么使用 DENSE_RANK 而不是 RANK 或 ROW_NUMBER，它实际上是如何工作的？在此过程中，我更新了该教程以包含我的 David 针对这个特定问题的解决方案版本，然后认为它可能对 SQL 新手（或像我这样忘记东西的其他人）有所帮助。

整个教程文本可以复制/粘贴到查询编辑器中，然后可以（单独）取消注释并运行每个示例查询，以查看它们各自的结果。（默认情况下，该问题的解决方案在底部未注释。）或者，可以将每个示例单独复制到它们自己的查询编辑实例中，但每个示例都必须包含 TBLx CTE。

--WITH /* DB2 version */
--TBLx (Col_A, Col_B) AS (VALUES 
--     (  7,     7  ),
--     (  7,     7  ),
--     (  7,     7  ),
--     (  7,     8  ))

WITH /* SQL-Server version */
TBLx    (Col_A, Col_B) AS
  (SELECT  7,     7    UNION ALL
   SELECT  7,     7    UNION ALL
   SELECT  7,     7    UNION ALL
   SELECT  7,     8)

/*** Example-A: demonstrates the difference between ROW_NUMBER, RANK and DENSE_RANK ***/

  --SELECT Col_A, Col_B,
  --  ROW_NUMBER() OVER(PARTITION BY Col_A ORDER BY Col_B) AS ROW_NUMBER_,
  --  RANK() OVER(PARTITION BY Col_A ORDER BY Col_B)       AS RANK_,
  --  DENSE_RANK() OVER(PARTITION BY Col_A ORDER BY Col_B) AS DENSE_RANK_
  --FROM TBLx

  /* RESULTS:
    Col_A  Col_B  ROW_NUMBER_  RANK_  DENSE_RANK_
      7      7        1          1        1
      7      7        2          1        1
      7      7        3          1        1
      7      8        4          4        2

     ROW_NUMBER: Just increments for the three identical rows and increments again for the final unique row.
                 That is, it’s an order-value (based on "sort" order) but makes no other distinction.
                 
           RANK: Assigns the same rank value to the three identical rows, then jumps to 4 for the fourth row,
                 which is *unique* with regard to the others.
                 That is, each identical row is ranked by the rank-order of the first row-instance of that
                 (identical) value-set.
                 
     DENSE_RANK: Also assigns the same rank value to the three identical rows but the fourth *unique* row is
                 assigned a value of 2.
                 That is, DENSE_RANK identifies that there are (only) two *unique* row-types in the row set.
  */

/*** Example-B: to get only the distinct resulting "count-of-each-row-type" rows ***/

--  SELECT DISTINCT -- For unique returned "count-of-each-row-type" rows, the DISTINCT operator is necessary because
--                  -- the calculated DENSE_RANK value is appended to *all* rows in the data set.  Without DISTINCT,
--                  -- its value for each original-data row-type would just be replicated for each of those rows.
--                  
--    Col_A, Col_B,                
--    DENSE_RANK() OVER(PARTITION BY Col_A ORDER BY Col_B) AS DISTINCT_ROWTYPE_COUNT_
--  FROM TBLx

  /* RESULTS:
    Col_A  Col_B  DISTINCT_ROWTYPE_COUNT_
      7      7            1
      7      8            2
  */

/*** Example-C.1: demonstrates the derivation of the "count-of-all-row-types" (finalized in Example-C.2, below) ***/

--  SELECT
--    Col_A, Col_B,
--    
--    DENSE_RANK() OVER ( PARTITION BY Col_A ORDER BY Col_B DESC) AS ROW_TYPES_COUNT_DESC_,
--    DENSE_RANK() OVER ( PARTITION BY Col_A ORDER BY Col_B ASC) AS ROW_TYPES_COUNT_ASC_,
--    
--    -- Adding the above cases together and subtracting one gives the same total count for on each resulting row:
--    
--    DENSE_RANK() OVER ( PARTITION BY Col_A ORDER BY Col_B DESC)
--       +
--    DENSE_RANK() OVER ( PARTITION BY Col_A ORDER BY Col_B ASC)
--      - 1   /* (Because DENSE_RANK values are one-based) */
--      AS ROW_TYPES_COUNT_
--  FROM TBLx

  /* RESULTS:
    COL_A  COL_B  ROW_TYPES_COUNT_DESC_  ROW_TYPES_COUNT_ASC_  ROW_TYPES_COUNT_
      7      7            2                     1                    2
      7      7            2                     1                    2
      7      7            2                     1                    2
      7      8            1                     2                    2
      
  */

/*** Example-C.2: uses the above technique to get a *single* resulting "count-of-all-row-types" row ***/

  SELECT DISTINCT -- For a single returned "count-of-all-row-types" row, the DISTINCT operator is necessary because the
                  -- calculated DENSE_RANK value is appended to *all* rows in the data set.  Without DISTINCT, that
                  -- value would just be replicated for each original-data row.
                  
--    Col_A, Col_B, -- In order to get a *single* returned "count-of-all-row-types" row (and field), all other fields
                    -- must be excluded because their respective differing row-values will defeat the purpose of the
                    -- DISTINCT operator, above.
                   
    DENSE_RANK() OVER ( PARTITION BY Col_A ORDER BY Col_B DESC)
       +
    DENSE_RANK() OVER ( PARTITION BY Col_A ORDER BY Col_B ASC)
      - 1   /* (Because DENSE_RANK values are one-based) */
      AS ROW_TYPES_COUNT_
  FROM TBLx
  
  /* RESULTS:

    ROW_TYPES_COUNT_
          2
  */

【讨论】：

以上是关于分区函数 COUNT() OVER 可能使用 DISTINCT的主要内容，如果未能解决你的问题，请参考以下文章