PostgreSQL 按总和分组

Posted

技术标签:

【中文标题】PostgreSQL 按总和分组【英文标题】:PostgreSQL Group By Sum 【发布时间】:2016-11-09 12:16:03 【问题描述】:

我一直在 PostgreSQL 中对这个问题摸不着头脑。我有一张表test,有两列:-idcontent。例如

create table test (id integer, 
                   content varchar(1024));

insert into test (id, content) values 
    (1, 'Lorem Ipsum is simply dummy text of the printing and typesetting industry.'),
    (2, 'Lorem Ipsum has been the industrys standard dummy text '),
    (3, 'ever since the 1500s, when an unknown printer took a galley of type and scrambled it to'),
    (4, 'make a type specimen book.'),
    (5, 'It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.'),
    (6, 'It was popularised in the 1960s with the release of Letraset sheets containing Lorem '),
    (7, 'Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker'),
    (8, ' including versions of Lorem Ipsum.');

如果我运行以下查询...

select id, length(content) as characters from test order by id

...然后我得到:-

id | characters
---+-----------
 1 |         74
 2 |         55
 3 |         87
 4 |         26
 5 |        120
 6 |         85
 7 |         87
 8 |         35

我想要做的是将id 分组到内容总和超过阈值的行中。例如,如果该阈值是100,那么所需的结果将如下所示:-

ids | characters
----+-----------   
1,2 |        129
3,4 |        113    
5   |        120
6,7 |        172    
8   |         35 

注意 (1): - 查询不需要生成 characters 列 - 只需生成 ids - 他们在这里传达他们已经结束了@ 987654332@ - 除了最后一行是35

注意 (2): - ids 可以是逗号分隔的字符串或 PostgreSQL 数组 - 类型不如值重要

我可以使用窗口函数来执行此操作还是需要更复杂的东西,例如lateral join

【问题讨论】:

您的问题的答案是您需要更复杂的东西:递归 CTE。性能不会特别好。 我接受了@Abelisto 的答案,因为这是我在代码中使用的答案。 然而,@Gordon 的回答给我留下了深刻的印象,因为我通过尝试理解它学到了很多东西。谢谢大家! 【参考方案1】:

这类问题需要递归 CTE(或类似功能)。这是一个例子:

with recursive t as (
      select id, length(content) as len,
             row_number() over (order by id) as seqnum
      from test 
     ),
     cte(id, len, ids, seqnum, grp) as (
      select id, len, len as cumelen, t.id::text, 1::int as seqnum, 1 as grp
      from t
      where seqnum = 1
      union all
      select t.id,
             t.len,
             (case when cte.cumelen >= 100 then t.len else cte.cumelen + t.len end) as cumelen,
             (case when cte.cumelen >= 100 then t.id::text else cte.ids || ',' || t.id::text end) as ids,
             t.seqnum
             (case when cte.cumelen >= 100 then cte.grp + 1 else cte.grp end) as ids,
      from t join
           cte
           on cte.seqnum = t.seqnum - 1
     )
select grp, max(ids)
from cte
group by grp;

这是一个小的工作示例:

with recursive test as (
      select 1 as id, 'abcd'::text as content union all
      select 2 as id, 'abcd'::text as content union all
      select 3 as id, 'abcd'::text as content 
     ),
     t as (
      select id, length(content) as len,
             row_number() over (order by id) as seqnum
      from test 
     ),
     cte(id, len, cumelen, ids, seqnum, grp) as (
      select id, len, len as cumelen, t.id::text, 1::int as seqnum, 1 as grp
      from t
      where seqnum = 1
      union all
      select t.id,
             t.len,
             (case when cte.cumelen >= 5 then t.len else cte.cumelen + t.len end) as cumelen,
             (case when cte.cumelen >= 5 then t.id::text else cte.ids || ',' || t.id::text end) as ids,
             t.seqnum::int,
             (case when cte.cumelen >= 5 then cte.grp + 1 else cte.grp end)
      from t join
           cte
           on cte.seqnum = t.seqnum - 1
     )
select grp, max(ids)
from cte
group by grp;

【讨论】:

这不是我对这个问题的理解。他想对连续值求和,直到和值达到某个阈值,然后中断并继续与下一个连续值求和。我也在挠头。不确定在没有过程的纯 SQL 中是否存在解决方案 @ThomasG 。 . .谢谢你。我误解了这个问题并相应地修正了答案。 你真是太快了......我也想到了这一点,并在 SQLfiddle 上编写了一个递归 CTE。无与伦比的戈登 :) 可以通过标记组的最后一个元素的行(其中累积总和> 100)并按其过滤来跳过聚合。 @DuduMarkovitz 。 . .考虑到递归 CTE 可能有多慢,这可能不是一个有用的优化。【参考方案2】:

使用存储函数可以避免(有时)令人头疼的查询。

create or replace function fn_foo(ids out int[], characters out int) returns setof record language plpgsql as $$
declare
  r record;
  threshold int := 100;
begin
  ids := ''; characters := 0;
  for r in (
    select id, coalesce(length(content),0) as lng
    from test order by id)
  loop
    characters := characters + r.lng;
    ids := ids || r.id;
    if characters > threshold then
      return next;
      ids := ''; characters := 0;
    end if;
  end loop;
  if ids <> '' then
    return next;
  end if;
end $$;

select * from fn_foo();

╔═══════╤════════════╗
║  ids  │ characters ║
╠═══════╪════════════╣
║ 1,2 │        129 ║
║ 3,4 │        113 ║
║ 5   │        120 ║
║ 6,7 │        172 ║
║ 8   │         35 ║
╚═══════╧════════════╝
(5 rows)

【讨论】:

太棒了,这行得通!假设性能也会非常好,因为它只是循环。我还在开始时添加了一个threshold integer default 100, 参数,以便我可以覆盖阈值!【参考方案3】:

这里我有一个使用 LEAD() 窗口函数的查询

SELECT id || ',' || next_id, characters + next_characters total_characters 
FROM  (SELECT id, characters, row_num, 
              CASE 
                WHEN row_num % 2 = 0 
                     AND characters < 100 THEN Lead(id) OVER(ORDER BY id) 
                ELSE NULL 
              END next_id, 
              CASE 
                WHEN row_num % 2 = 0 
                     AND characters < 100 THEN NULL 
                ELSE Lead(characters) OVER(ORDER BY id) 
              END AS next_characters 
       FROM  (SELECT id, 
                     Length(content)  AS characters, 
                     Row_number() 
                       OVER( 
                         ORDER BY id) row_num 
              FROM   test 
              ORDER  BY id)) 
WHERE  next_id IS NULL;

希望对您有所帮助。

【讨论】:

以上是关于PostgreSQL 按总和分组的主要内容,如果未能解决你的问题,请参考以下文章

Pandas - 按函数和总和列分组以提取其他列总和为 0 的行

如何创建按列分组的累积总和

熊猫按时间和分组滚动条件总和

按日期分组的 MySQL 累积总和

在 sequelize 中按关联表值查找总和和分组

MySQL按查询分组,多个总和不使用索引,滞后于使用文件排序