如何避免列中重复出现

Posted

技术标签:

【中文标题】如何避免列中重复出现【英文标题】:How to avoid repeat occurrences in a column 【发布时间】:2019-11-05 16:22:22 【问题描述】:

我有一张表,其中描述了为不同客户预订机票的代理。 以下数据描述了一个客户数据。

从上面的数据我期待的是

输出的意思是,我想把队列分组,他先订了一些去新加坡的票,然后是奥斯汀,再是新加坡和德里

我们如何在 SQL 中实现这一点,请帮助我

如果输出如下也有帮助

【问题讨论】:

请将您的数据发布为表格文本(或者更好的是insert statements`)而不是图像。 为什么是“新加坡”两次? 【参考方案1】:

这是一个差距和孤岛问题。要解决它,您需要生成相邻记录组。这通常通过比较两个不同分区的行号来完成。

考虑:

select 
    agent_id,   
    travel_destination,
    min(date_of_booking) first_date_of_booking,
    max(date_of_booking) max_date_of_booking
from (
    select 
        t.*,
        row_number() 
            over(partition by agent_id order by date_of_booking) rn1,
        row_number() 
            over(partition by agent_id, travel_destination order by date_of_booking) rn2
    from mytable t
) t 
group by 
    agent_id, 
    rn1 - rn2,
    travel_destination
order by first_date_of_booking

请注意,我在答案中添加了每个组的开始和结束日期,因为我发现它使答案更有意义。

另外备注:根据您的样本数据,不清楚是否要将customerid放入组中;我假设不是(如果是,您需要将该列添加到两个分区)。

Demo on DB Fiddle

鉴于这个(简化的)数据集:

agent_id | travel_destination |客户 ID | date_of_booking :------- | :----------------- | :------------ | :-------------- A1001 |新加坡 | C1001 | 2019-06-10 A1001 |新加坡 | C1001 | 2019-06-11 A1001 |奥斯汀 | C1001 | 2019-06-12 A1001 |新加坡 | C1001 | 2019-06-13 A1001 |新加坡 | C1001 | 2019-06-14 A1001 |新德里 | C1001 | 2019-06-15

查询返回:

agent_id | travel_destination | first_date_of_booking | max_date_of_booking :------- | :----------------- | :-------------------- | :----------------- A1001 |新加坡 | 2019-06-10 | 2019-06-11 A1001 |奥斯汀 | 2019-06-12 | 2019-06-12 A1001 |新加坡 | 2019-06-13 | 2019-06-14 A1001 |新德里 | 2019-06-15 | 2019-06-15

要实现您演示的第二个输出,您可以添加另一个级别的聚合并使用string_agg()

select 
    agent_id,
    string_agg(travel_destination order by first_date_of_booking) travel_destination
from (
  -- above query
) t
group by agent_id

【讨论】:

【参考方案2】:

试试这个 - 至少如果你的数据库有像 LISTAGG 这样的功能,就像在 Vertica 中一样......

WITH
-- this is your input - next time put it in so it can be 
-- copy-pasted and formatted to the below ....                                                                                                                                                    
input(agent_id,travel_dest,cust_id,bookdt) AS (
          SELECT 'A1001','Singapore','C1001',DATE '2109-06-10'
UNION ALL SELECT 'A1001','Singapore','C1001',DATE '2019-06-11'
UNION ALL SELECT 'A1001','Austin'   ,'C1001',DATE '2019-06-19'
UNION ALL SELECT 'A1001','Austin'   ,'C1001',DATE '2019-06-19'
UNION ALL SELECT 'A1001','Austin'   ,'C1001',DATE '2019-06-20'
UNION ALL SELECT 'A1001','Singapore','C1001',DATE '2019-07-30'
UNION ALL SELECT 'A1001','Singapore','C1001',DATE '2019-07-31'
UNION ALL SELECT 'A1001','Delhi'    ,'C1001',DATE '2019-08-01'
UNION ALL SELECT 'A1001','Delhi'    ,'C1001',DATE '2019-08-10'
UNION ALL SELECT 'A1001','Delhi'    ,'C1001',DATE '2019-08-10'
UNION ALL SELECT 'A1001','Delhi'    ,'C1001',DATE '2019-08-10'
UNION ALL SELECT 'A1001','Delhi'    ,'C1001',DATE '2019-08-10'
UNION ALL SELECT 'A1001','Delhi'    ,'C1001',DATE '2019-08-25'
)
-- real WITH clause starts here - substitute comma below with "WITH" ...
,
with_prev AS (
  SELECT
    agent_id
  , travel_dest
  , LAG(travel_dest,1,'') OVER (PARTITION BY agent_id ORDER BY bookdt) AS prev_dest
  FROM input
)
,
de_duped AS (
  SELECT
    agent_id
  , travel_dest
   FROM with_prev
   WHERE travel_dest <> prev_dest
)
SELECT
  agent_id
, LISTAGG(travel_dest) AS travel_dest
FROM de_duped
GROUP BY 1
;

你得到:

 agent_id |                travel_dest                 
----------+--------------------------------------------
 A1001    | Singapore,Austin,Singapore,Delhi,Singapore                                                                                                  

【讨论】:

【参考方案3】:

我只会使用lag():

SELECT t.agent_id, t.travel_dest
FROM (SELECT t.*,
             LAG(travel_dest) OVER (PARTITION BY agent_id ORDER BY bookdt) as prev_travel_dest
      FROM t
     ) t
WHERE prev_travel_dest IS NULL OR prev_travel_dest <> travel_dest
ORDER BY agent_id, bookdt;

我想不出更简单的解决方案。

【讨论】:

【参考方案4】:

以下是 BigQuery 标准 SQL

#standardSQL
SELECT agent_id, 
  STRING_AGG(DISTINCT travel_destination) AS travel_destination
FROM `project.dataset.table`
GROUP BY agent_id    

它将产生以下输出

Row agent_id    travel_destination   
1   A1001       Singapore,Austin,Delhi      

看起来预期的输出是Singapore,Austin,Singapore,Delhi - 下面是另一个选项

#standardSQL
CREATE TEMP FUNCTION DedupConsecutive(line STRING) RETURNS STRING LANGUAGE js AS """
  return line.split(",").filter(function(value,index,arr)return value != arr[index+1];).join(",");
""";
SELECT agent_id, 
  DedupConsecutive(STRING_AGG(travel_destination ORDER BY date_of_booking)) destinations
FROM `project.dataset.table`
GROUP BY agent_id   

与 Gordon 的观点相同 - I cannot think of a simpler solution. :o)

【讨论】:

OP 想要新加坡,奥斯汀,新加坡,德里。

以上是关于如何避免列中重复出现的主要内容,如果未能解决你的问题,请参考以下文章

powerbi如何统计某列数据中,两项出现的次数

Excel 2010 如何快速统计一列中相同数值出现的个数

SQL如何查询出某一列中不同值出现的次数?

消息队列聊一下如何避免消息的重复消费

如何快速查找Excel表格中某列中连续5次出现同一数据表格

EXCEL 如何查询一个数据是不是在某一列中