在 SQL 中选择具有唯一列值的行

Posted

技术标签:

【中文标题】在 SQL 中选择具有唯一列值的行【英文标题】:Select Rows with unique column value in SQL 【发布时间】:2021-07-28 13:22:31 【问题描述】:

我做了很多搜索,我觉得这应该是一个简单的解决方案,但我什么也做不了。

我有一个包含嵌套数组的数据库表。在我取消嵌套数组并只选择我想要的列之后,我最终得到了一个如下所示的表:

SELECT time_period, name, value
FROM 
    TBL.TBLValues,
    UNNEST(nested_name) as unnested_name
WHERE time_period > '2021-07-01 00:00:00'
AND name != "None"
ORDER BY time_period;
row time_period name value
1 2021-07-01T00:00:00 Name1 100
2 2021-07-01T00:00:00 Name2 105
3 2021-07-01T00:05:00 Name1 120
4 2021-07-01T00:10:00 Name3 500
5 2021-07-01T00:15:00 Name1 110
6 2021-07-01T00:15:00 Name3 450
7 2021-07-01T00:20:00 Name1 1000

我想要做的是过滤我的查询(可能通过嵌套查询?)所以我只得到 time_period 唯一的行。在上表中,我只返回第 3 行和第 4 行,因为所有其他行在同一 time_period 中都有多个名称。

我尝试了SELECT DISTINCT(period),就返回的行数而言,它基本上过滤了它,但它肯定返回具有相同time_period 的多个名称的行。我不知道为什么,我对那个函数的理解是它应该只返回 time_period 只存在一次的行。

SELECT DISTINCT(time_period)
        FROM
        TBL.TBLValues,
        UNNEST(nested_name) as unnested_name
        WHERE time_period > '2021-07-01 00:00:00'
        AND name != 'None'
    ORDER BY period;

我还尝试了COUNT(time_period) AS counter,然后在查询结束时尝试了HAVING counter = 1。这最接近我想要的,它返回的结果很少,我认为GROUP BY 可能做了一些奇怪的事情?每个名字只给我一个time_period,但其中一些time_periods 是重复的。理想情况下,此过滤器的下一步是获取每个唯一的time_period,然后为每个唯一的Name 过滤到最新的time_period,所以最好让它最初返回每个唯一的time_period 所以接下来我可以这样做。

SELECT COUNT(time_period) as counter, time_period, name, value
FROM(
    SELECT time_period, name, value
    FROM  TBL.TBLValues,
            UNNEST(nested_name) as unnested_name
            WHERE time_period > '2021-07-01 00:00:00'
            AND name != 'None')
GROUP BY name, value, time_period
HAVING counter = 1
ORDER BY time_period;

我还尝试使用 PARTITION BY 重新设计 this question 的解决方案,但根本无法让它过滤我的结果。

SELECT time_period, name, value
FROM(
    SELECT time_period, name, value, ROW_NUMBER() OVER (PARTITION BY time_period ORDER BY value ASC) AS row_num
    FROM  TBL.TBLValues,
            UNNEST(nested_name) as unested_name
            WHERE time_period > '2021-07-01 00:00:00'
            AND name != 'None') as all_duplicated
WHERE row_num = 1
ORDER BY time_period;

【问题讨论】:

为数据库平台添加标签。不,这不是 DISTINCT 的目的。它强制只检索唯一数据的一份副本,无论是否重复行。 【参考方案1】:

有几种方法可以做到这一点。这是一个(使用标准 SQL):

WITH xrows AS (
       SELECT tbl.*
            , COUNT(*) OVER (PARTITION BY time_period) AS n
         FROM tbl
     )
SELECT *
  FROM xrows
 WHERE n = 1
 ORDER BY time_period
;

并以您的 SQL 为起点:

WITH your_sql AS (
        SELECT time_period, name, value
             , COUNT(*) OVER (PARTITION BY time_period) AS n
          FROM TBL.TBLValues,
            UNNEST(nested_name) as unnested_name
         WHERE time_period > '2021-07-01 00:00:00'
           AND name != 'None'
     )
SELECT *
  FROM your_sql
 WHERE n = 1
 ORDER BY time_period
;

现在有了给定的可用数据:

WITH your_sql (row, time_period, name, value) AS (
        SELECT 1, '2021-07-01T00:00:00', 'Name1', 100 UNION ALL
        SELECT 2, '2021-07-01T00:00:00', 'Name2', 105 UNION ALL
        SELECT 3, '2021-07-01T00:05:00', 'Name1', 120 UNION ALL
        SELECT 4, '2021-07-01T00:10:00', 'Name3', 500 UNION ALL
        SELECT 5, '2021-07-01T00:15:00', 'Name1', 110 UNION ALL
        SELECT 6, '2021-07-01T00:15:00', 'Name3', 450
     )
   , xrows AS (
          SELECT t.*
               , COUNT(*) OVER (PARTITION BY time_period) AS n
            FROM your_sql AS t
     )
SELECT * FROM xrows WHERE n = 1
 ORDER BY time_period
;

结果:

+-----+---------------------+-------+-------+---+
| row | time_period         | name  | value | n |
+-----+---------------------+-------+-------+---+
|   3 | 2021-07-01T00:05:00 | Name1 |   120 | 1 |
|   4 | 2021-07-01T00:10:00 | Name3 |   500 | 1 |
+-----+---------------------+-------+-------+---+

这是针对新要求的更新解决方案。我为 row=3 (row=7) 添加了一个重复的行,但只会显示其中一行。这种情况会在之前的 COUNT 逻辑中被删除:

WITH your_sql (row, time_period, name, value) AS (
        SELECT 1, '2021-07-01T00:00:00', 'Name1', 100 UNION ALL
        SELECT 2, '2021-07-01T00:00:00', 'Name2', 105 UNION ALL
        SELECT 3, '2021-07-01T00:05:00', 'Name1', 120 UNION ALL
        SELECT 7, '2021-07-01T00:05:00', 'Name1', 120 UNION ALL
        SELECT 4, '2021-07-01T00:10:00', 'Name3', 500 UNION ALL
        SELECT 5, '2021-07-01T00:15:00', 'Name1', 110 UNION ALL
        SELECT 6, '2021-07-01T00:15:00', 'Name3', 450
     )
   , xrows0 AS (
          SELECT t.*
               , ROW_NUMBER() OVER (PARTITION BY time_period ORDER BY name, value, row) AS n1
               , RANK()       OVER (PARTITION BY time_period ORDER BY name, value     ) AS n2
            FROM your_sql AS t
     )
   , xrows AS (
          SELECT t.*
               , MAX(n2) OVER (PARTITION BY time_period) AS m2
            FROM xrows0 AS t
     )
SELECT *
  FROM xrows
 WHERE m2 = 1
   AND n1 = 1
 ORDER BY time_period
;

Result:
+-----+---------------------+-------+-------+----+----+------+
| row | time_period         | name  | value | n1 | n2 | m2   |
+-----+---------------------+-------+-------+----+----+------+
|   3 | 2021-07-01T00:05:00 | Name1 |   120 |  1 |  1 |    1 |
|   4 | 2021-07-01T00:10:00 | Name3 |   500 |  1 |  1 |    1 |
+-----+---------------------+-------+-------+----+----+------+

还有新的要求,只有名称在那个时间段内需要相同,并且您的新数据行:

WITH your_sql (row, time_period, name, value) AS (
        SELECT 1, '2021-07-01T00:00:00', 'Name1',  100 UNION ALL
        SELECT 2, '2021-07-01T00:00:00', 'Name2',  105 UNION ALL
        SELECT 3, '2021-07-01T00:05:00', 'Name1',  120 UNION ALL
        SELECT 8, '2021-07-01T00:05:00', 'Name1',  120 UNION ALL
        SELECT 4, '2021-07-01T00:10:00', 'Name3',  500 UNION ALL
        SELECT 5, '2021-07-01T00:15:00', 'Name1',  110 UNION ALL
        SELECT 6, '2021-07-01T00:15:00', 'Name3',  450 UNION ALL
        SELECT 7, '2021-07-01T00:20:00', 'Name1', 1000
     )
   , xrows0 AS (
          SELECT t.*
               , ROW_NUMBER() OVER (PARTITION BY time_period ORDER BY name, row  ) AS n1
               , RANK()       OVER (PARTITION BY time_period ORDER BY name, value) AS n2
            FROM your_sql AS t
     )
   , xrows AS (
          SELECT t.*
               , MAX(n2) OVER (PARTITION BY time_period) AS m2
            FROM xrows0 AS t
     )
SELECT *
  FROM xrows
 WHERE m2 = 1
   AND n1 = 1
 ORDER BY time_period
;

+-----+---------------------+-------+-------+----+----+------+
| row | time_period         | name  | value | n1 | n2 | m2   |
+-----+---------------------+-------+-------+----+----+------+
|   3 | 2021-07-01T00:05:00 | Name1 |   120 |  1 |  1 |    1 |
|   4 | 2021-07-01T00:10:00 | Name3 |   500 |  1 |  1 |    1 |
|   7 | 2021-07-01T00:20:00 | Name1 |  1000 |  1 |  1 |    1 |
+-----+---------------------+-------+-------+----+----+------+

【讨论】:

这仍然比我预期的要少得多的结果 提供一个导致你不理解的行为的测试用例(带有 INSERT 语句)。然后我们可以一起回顾。注意:这很容易测试。我只是没有你的数据,这不应该真正改变行为。 您关于as all other rows have multiple names for the same time_period 的注释可能是一个提示。您是否真的关心名称是否不同,或者只是每个 time_period 的结果中只有一行?如果你的意思是别的,逻辑可能需要稍微改变。请记住,您最初的要求只是 time_period 在结果中是唯一的。 也许是一种更好的解释方式 - 此数据集每 5 分钟更新一次,使用 None 或感兴趣对象的名称和值。我正在尝试提取与它关联的恰好有 1 个名称/值对的每个时期。那会给我一个包含许多重复名称的列表。然后,我希望将该列表过滤为每个唯一名称的最新条目。它是一个大型数据集,我不完全确定出了什么问题,只是我得到的结果比我知道的要少。 您介意将该数据添加到问题中的测试用例中吗?我认为您想添加一个案例,其中 time_period 被多次看到,但名称/值对也相同。仅使用 COUNT 窗口函数和仅 time_period 分区的当前结果中不会显示这种情况。这听起来像问题。调整您的要求以反映这一点。【参考方案2】:

试试下面

select * from (
  select * from (
    SELECT time_period, name, value
    FROM TBL.TBLValues,
      UNNEST(nested_name) as unnested_name
    WHERE time_period > '2021-07-01 00:00:00'
    AND name != "None"
  )
  where true
  qualify count(*) over(partition by time_period) = 1
)
where true
qualify row_number() over(partition by name order by time_period desc) = 1    

如果应用于您问题中的样本数据 - 输出是

【讨论】:

【参考方案3】:

最简单的方法大概是qualify

SELECT time_period, name, value
FROM TBL.TBLValues t CROSS JOIN
     UNNEST(t.nested_name) as unnested_name
WHERE time_period > '2021-07-01 00:00:00' AND
      name <> 'None'
QUALIFY COUNT(*) OVER (PARTITION BY time_period) = 1
ORDER BY time_period;

【讨论】:

这给了我一个错误“Name time_period not found inside t” @ghw29 。 . .然后删除t.。你的问题没有显示表格的布局,所以我想我认为它在tblValues 而不是unnested_name

以上是关于在 SQL 中选择具有唯一列值的行的主要内容,如果未能解决你的问题,请参考以下文章

计算 Pandas 中具有相同列值的行的平均值

使用名为查询的数据 jpa 返回具有不同列值的行

SQL - 选择两列中具有相同值的行

SQL聚合函数选择唯一值

SQL / Hive 选择具有特定列值的第一行

在 SQL Server 中,如何选择共享公共列值的行?