在 SQL 中选择具有唯一列值的行
Posted
技术标签:
【中文标题】在 SQL 中选择具有唯一列值的行【英文标题】:Select Rows with unique column value in SQL 【发布时间】:2021-07-28 13:22:31 【问题描述】:我做了很多搜索,我觉得这应该是一个简单的解决方案,但我什么也做不了。
我有一个包含嵌套数组的数据库表。在我取消嵌套数组并只选择我想要的列之后,我最终得到了一个如下所示的表:
SELECT time_period, name, value
FROM
TBL.TBLValues,
UNNEST(nested_name) as unnested_name
WHERE time_period > '2021-07-01 00:00:00'
AND name != "None"
ORDER BY time_period;
row | time_period | name | value |
---|---|---|---|
1 | 2021-07-01T00:00:00 | Name1 | 100 |
2 | 2021-07-01T00:00:00 | Name2 | 105 |
3 | 2021-07-01T00:05:00 | Name1 | 120 |
4 | 2021-07-01T00:10:00 | Name3 | 500 |
5 | 2021-07-01T00:15:00 | Name1 | 110 |
6 | 2021-07-01T00:15:00 | Name3 | 450 |
7 | 2021-07-01T00:20:00 | Name1 | 1000 |
我想要做的是过滤我的查询(可能通过嵌套查询?)所以我只得到 time_period
唯一的行。在上表中,我只返回第 3 行和第 4 行,因为所有其他行在同一 time_period 中都有多个名称。
我尝试了SELECT DISTINCT(period)
,就返回的行数而言,它基本上过滤了它,但它肯定返回具有相同time_period
的多个名称的行。我不知道为什么,我对那个函数的理解是它应该只返回 time_period
只存在一次的行。
SELECT DISTINCT(time_period)
FROM
TBL.TBLValues,
UNNEST(nested_name) as unnested_name
WHERE time_period > '2021-07-01 00:00:00'
AND name != 'None'
ORDER BY period;
我还尝试了COUNT(time_period) AS counter
,然后在查询结束时尝试了HAVING counter = 1
。这最接近我想要的,它返回的结果很少,我认为GROUP BY
可能做了一些奇怪的事情?每个名字只给我一个time_period
,但其中一些time_period
s 是重复的。理想情况下,此过滤器的下一步是获取每个唯一的time_period
,然后为每个唯一的Name
过滤到最新的time_period
,所以最好让它最初返回每个唯一的time_period
所以接下来我可以这样做。
SELECT COUNT(time_period) as counter, time_period, name, value
FROM(
SELECT time_period, name, value
FROM TBL.TBLValues,
UNNEST(nested_name) as unnested_name
WHERE time_period > '2021-07-01 00:00:00'
AND name != 'None')
GROUP BY name, value, time_period
HAVING counter = 1
ORDER BY time_period;
我还尝试使用 PARTITION BY
重新设计 this question 的解决方案,但根本无法让它过滤我的结果。
SELECT time_period, name, value
FROM(
SELECT time_period, name, value, ROW_NUMBER() OVER (PARTITION BY time_period ORDER BY value ASC) AS row_num
FROM TBL.TBLValues,
UNNEST(nested_name) as unested_name
WHERE time_period > '2021-07-01 00:00:00'
AND name != 'None') as all_duplicated
WHERE row_num = 1
ORDER BY time_period;
【问题讨论】:
为数据库平台添加标签。不,这不是 DISTINCT 的目的。它强制只检索唯一数据的一份副本,无论是否重复行。 【参考方案1】:有几种方法可以做到这一点。这是一个(使用标准 SQL):
WITH xrows AS (
SELECT tbl.*
, COUNT(*) OVER (PARTITION BY time_period) AS n
FROM tbl
)
SELECT *
FROM xrows
WHERE n = 1
ORDER BY time_period
;
并以您的 SQL 为起点:
WITH your_sql AS (
SELECT time_period, name, value
, COUNT(*) OVER (PARTITION BY time_period) AS n
FROM TBL.TBLValues,
UNNEST(nested_name) as unnested_name
WHERE time_period > '2021-07-01 00:00:00'
AND name != 'None'
)
SELECT *
FROM your_sql
WHERE n = 1
ORDER BY time_period
;
现在有了给定的可用数据:
WITH your_sql (row, time_period, name, value) AS (
SELECT 1, '2021-07-01T00:00:00', 'Name1', 100 UNION ALL
SELECT 2, '2021-07-01T00:00:00', 'Name2', 105 UNION ALL
SELECT 3, '2021-07-01T00:05:00', 'Name1', 120 UNION ALL
SELECT 4, '2021-07-01T00:10:00', 'Name3', 500 UNION ALL
SELECT 5, '2021-07-01T00:15:00', 'Name1', 110 UNION ALL
SELECT 6, '2021-07-01T00:15:00', 'Name3', 450
)
, xrows AS (
SELECT t.*
, COUNT(*) OVER (PARTITION BY time_period) AS n
FROM your_sql AS t
)
SELECT * FROM xrows WHERE n = 1
ORDER BY time_period
;
结果:
+-----+---------------------+-------+-------+---+
| row | time_period | name | value | n |
+-----+---------------------+-------+-------+---+
| 3 | 2021-07-01T00:05:00 | Name1 | 120 | 1 |
| 4 | 2021-07-01T00:10:00 | Name3 | 500 | 1 |
+-----+---------------------+-------+-------+---+
这是针对新要求的更新解决方案。我为 row=3 (row=7) 添加了一个重复的行,但只会显示其中一行。这种情况会在之前的 COUNT 逻辑中被删除:
WITH your_sql (row, time_period, name, value) AS (
SELECT 1, '2021-07-01T00:00:00', 'Name1', 100 UNION ALL
SELECT 2, '2021-07-01T00:00:00', 'Name2', 105 UNION ALL
SELECT 3, '2021-07-01T00:05:00', 'Name1', 120 UNION ALL
SELECT 7, '2021-07-01T00:05:00', 'Name1', 120 UNION ALL
SELECT 4, '2021-07-01T00:10:00', 'Name3', 500 UNION ALL
SELECT 5, '2021-07-01T00:15:00', 'Name1', 110 UNION ALL
SELECT 6, '2021-07-01T00:15:00', 'Name3', 450
)
, xrows0 AS (
SELECT t.*
, ROW_NUMBER() OVER (PARTITION BY time_period ORDER BY name, value, row) AS n1
, RANK() OVER (PARTITION BY time_period ORDER BY name, value ) AS n2
FROM your_sql AS t
)
, xrows AS (
SELECT t.*
, MAX(n2) OVER (PARTITION BY time_period) AS m2
FROM xrows0 AS t
)
SELECT *
FROM xrows
WHERE m2 = 1
AND n1 = 1
ORDER BY time_period
;
Result:
+-----+---------------------+-------+-------+----+----+------+
| row | time_period | name | value | n1 | n2 | m2 |
+-----+---------------------+-------+-------+----+----+------+
| 3 | 2021-07-01T00:05:00 | Name1 | 120 | 1 | 1 | 1 |
| 4 | 2021-07-01T00:10:00 | Name3 | 500 | 1 | 1 | 1 |
+-----+---------------------+-------+-------+----+----+------+
还有新的要求,只有名称在那个时间段内需要相同,并且您的新数据行:
WITH your_sql (row, time_period, name, value) AS (
SELECT 1, '2021-07-01T00:00:00', 'Name1', 100 UNION ALL
SELECT 2, '2021-07-01T00:00:00', 'Name2', 105 UNION ALL
SELECT 3, '2021-07-01T00:05:00', 'Name1', 120 UNION ALL
SELECT 8, '2021-07-01T00:05:00', 'Name1', 120 UNION ALL
SELECT 4, '2021-07-01T00:10:00', 'Name3', 500 UNION ALL
SELECT 5, '2021-07-01T00:15:00', 'Name1', 110 UNION ALL
SELECT 6, '2021-07-01T00:15:00', 'Name3', 450 UNION ALL
SELECT 7, '2021-07-01T00:20:00', 'Name1', 1000
)
, xrows0 AS (
SELECT t.*
, ROW_NUMBER() OVER (PARTITION BY time_period ORDER BY name, row ) AS n1
, RANK() OVER (PARTITION BY time_period ORDER BY name, value) AS n2
FROM your_sql AS t
)
, xrows AS (
SELECT t.*
, MAX(n2) OVER (PARTITION BY time_period) AS m2
FROM xrows0 AS t
)
SELECT *
FROM xrows
WHERE m2 = 1
AND n1 = 1
ORDER BY time_period
;
+-----+---------------------+-------+-------+----+----+------+
| row | time_period | name | value | n1 | n2 | m2 |
+-----+---------------------+-------+-------+----+----+------+
| 3 | 2021-07-01T00:05:00 | Name1 | 120 | 1 | 1 | 1 |
| 4 | 2021-07-01T00:10:00 | Name3 | 500 | 1 | 1 | 1 |
| 7 | 2021-07-01T00:20:00 | Name1 | 1000 | 1 | 1 | 1 |
+-----+---------------------+-------+-------+----+----+------+
【讨论】:
这仍然比我预期的要少得多的结果 提供一个导致你不理解的行为的测试用例(带有 INSERT 语句)。然后我们可以一起回顾。注意:这很容易测试。我只是没有你的数据,这不应该真正改变行为。 您关于as all other rows have multiple names for the same time_period
的注释可能是一个提示。您是否真的关心名称是否不同,或者只是每个 time_period 的结果中只有一行?如果你的意思是别的,逻辑可能需要稍微改变。请记住,您最初的要求只是 time_period
在结果中是唯一的。
也许是一种更好的解释方式 - 此数据集每 5 分钟更新一次,使用 None 或感兴趣对象的名称和值。我正在尝试提取与它关联的恰好有 1 个名称/值对的每个时期。那会给我一个包含许多重复名称的列表。然后,我希望将该列表过滤为每个唯一名称的最新条目。它是一个大型数据集,我不完全确定出了什么问题,只是我得到的结果比我知道的要少。
您介意将该数据添加到问题中的测试用例中吗?我认为您想添加一个案例,其中 time_period 被多次看到,但名称/值对也相同。仅使用 COUNT 窗口函数和仅 time_period 分区的当前结果中不会显示这种情况。这听起来像问题。调整您的要求以反映这一点。【参考方案2】:
试试下面
select * from (
select * from (
SELECT time_period, name, value
FROM TBL.TBLValues,
UNNEST(nested_name) as unnested_name
WHERE time_period > '2021-07-01 00:00:00'
AND name != "None"
)
where true
qualify count(*) over(partition by time_period) = 1
)
where true
qualify row_number() over(partition by name order by time_period desc) = 1
如果应用于您问题中的样本数据 - 输出是
【讨论】:
【参考方案3】:最简单的方法大概是qualify
SELECT time_period, name, value
FROM TBL.TBLValues t CROSS JOIN
UNNEST(t.nested_name) as unnested_name
WHERE time_period > '2021-07-01 00:00:00' AND
name <> 'None'
QUALIFY COUNT(*) OVER (PARTITION BY time_period) = 1
ORDER BY time_period;
【讨论】:
这给了我一个错误“Name time_period not found inside t” @ghw29 。 . .然后删除t.
。你的问题没有显示表格的布局,所以我想我认为它在tblValues
而不是unnested_name
。以上是关于在 SQL 中选择具有唯一列值的行的主要内容,如果未能解决你的问题,请参考以下文章