如何根据不同的列选择整行

Posted 2023-04-15

技术标签:

【中文标题】如何根据不同的列选择整行【英文标题】：How to select entire rows based on distinct columns 【发布时间】：2016-06-25 23:07:15 【问题描述】：

我在 Spark 中这样做

cityId  PhysicalAddress      EmailAddress         ..many other columns of other meta info...   
1       b st                 something@email.com   
1       b st                 something@email.com   <- some rows can be entirely duplicates
1       a avenue             random@gmail.com
2       c square             anything@yahoo.com
2       d blvd               d@d.com

此表上没有主键，我想根据每个不同的 cityId 抓取一个随机行

例如这是一个正确的答案

cityId  PhysicalAddress      EmailAddress        ..many other columns 
1       b st                 something@email.com   
2       c square             anything@yahoo.com

例如这也是正确答案

cityId  PhysicalAddress      EmailAddress       ..many other columns 
1       a avenue             random@gmail.com
2       c square             anything@yahoo.com

想到的一种方法是使用group by。但是，这需要我在另一列上使用聚合函数。（例如 min()）。然而，我只想抽出一整行（不管是哪一行）。

【问题讨论】：

【参考方案1】：

;WITH CTE AS
(
   SELECT *, ROW_NUMBER() OVER(PARTITION BY cityId ORDER BY cityId) AS RN
   FROM [TABLE_NAME]
) SELECT * FROM CTE WHERE RN = 1

【讨论】：

【参考方案2】：

我有 Sql Server 2008 R2，但我试图找到适用于其他 DBMS 的方法。

create table contacts( cityId int, PhysicalAddress varchar(max), EmailAddress varchar(max) )

delete contacts
insert contacts( cityId, PhysicalAddress, EmailAddress )    /** ..many other columns of other meta info...  */  
values 
   ( 1, 'b st', 'something@email.com' )  
 , ( 1, 'b st', 'something@email.com' )                     /* some rows can be entirely duplicates */
 , ( 1, 'a avenue', 'random@gmail.com' )
 , ( 2, 'c square', 'anything@yahoo.com' )
 , ( 2, 'd blvd', 'd@d.com' )
 , ( 3, 'e circuit', 'e@e.com' )

-- using row_number()

with c as (
      select *, row_number() over (partition by cityId order by cityId) as seqnum
      from contacts
     )
    select * from c
    where seqnum = 1;


-- Add a new identity column

alter table contacts
    add id int identity(1,1)

select * from contacts where id in (select min(id) from contacts group by cityID)

-- Variation: Create a copy into a temp table and add an identity column
-- Note: It may not be possible to modify original table

select * into #contacts from contacts
alter table #contacts
    add id int identity(1,1)
select * from #contacts where id in (select min(id) from #contacts group by cityID)

我还尝试使用 newid() 使用计算列，但我的兴奋是短暂的，因为当您将表连接到自身或在该表上使用子查询时，将为每个 SELECT 重新计算计算列，因此没用。您不能创建计算列 PERSISTED - 这对于像 newid() 这样的非确定性表达式是不允许的，它每次在给定行上调用时都会返回不同的内容。

【讨论】：

以上是关于如何根据不同的列选择整行的主要内容，如果未能解决你的问题，请参考以下文章