按百分位数将类似 sql 的查询的结果分组：在 Redshift / postgresql

Posted 2023-03-31

技术标签:

【中文标题】按百分位数将类似 sql 的查询的结果分组：在 Redshift / postgresql【英文标题】：Breaking the results of an sql like query into groups by percentiles : In Redshift / postgresql 【发布时间】：2014-11-27 15:35:15 【问题描述】：

我有一组 group_name 及其计数。假设这来自以下陈述 - ：

--sample input set --
select group_name, count(*) as group_count 
     from mytable group by group_name 
     order by group_count desc ;

    group_name  group_count 
    A 205
    B 200
    C 67
    D 55
    E 50 
    F 12
    and so on..

我想要将 groups_counts 的结果及其组名组织成 3 个组，例如 Head、core 和 tail - 每个组被理解为占 group_count 总数的 33%。所以 10、5 等将被它们各自的百分位数代替。而所有这些我都需要在 redshift(postgres 8.0.2) 中完成

作为第一级，应该是这样的，

-- 这不是一个有效的语法--

select case when group_count  >10 then group_name end ) as Head_group,
case when group_count  >5  and group_count <10 then query end ) as core_group, 
case when group_count  <5   then group_name end ) as tail_group, 
 from 
 ( select group_name, count(*) as group_count 
 from mytable group by group_name 
 order by group_count desc ) ;

在所需的语法中，选择将基于 sum(group_count) - 这将是所有组计数的总和。我如何在 postgressql 中得到相同的结果，更具体地说是在 Redshift 中。请注意，Redshift 不支持创建函数。在 Redshift 中，prepare & set 也是可用的，但不是 prepare 语句。

   --sample output set---
    Head_group core_group tail_group 
    A           D          F
    B           E
    C
    --Alternative sample output set---
    Head_group 
    A
    B
    C
    core_group 
    D
    E
    tail_group 
    F

请注意，每个组可以返回不同数量的行。在 mysql 中，我可以执行以下操作：

set @total_group_count =(select count(*) from mytable ) ;
set @percentile_group_count = ( select @total_group_count*(30/100))  ;

参考我的相关问题： Storing the results of a prepared statement as a table in mysql?

【问题讨论】：

您能否提供一个样本输入集和样本期望输出集的完整示例？例如，您给出的输出集显然不是典型 SELECT 语句的结果。输出是否可以是每个输入行一行，但带有一个关于它将被分配到哪个组（头、核心、尾）的标识符？ @JohnR 输出是否可以是每个输入行的一行，但带有一个标识符，说明它将被分配到哪个组（头部、核心、尾部）？ -> 是的。我的意思是，我需要根据百分位数（按计数排序）在逻辑上将组分成 3 组 - 由于 1 条语句给出 head、core 和 tail 或 3 条语句给出 head，实际结果可能是 3 行, core & tail - 每次调用一个。我只需要 group_name(s) 的实际名称作为三个逻辑组中每一个的结果，如上所述。为第一个结果和更多视觉效果添加了选择语句 【参考方案1】：

ntile 窗口函数是您最想在这里使用的。

它可以用于您的查询：

select group_name, count(*) as group_count,
       ntile(3) over(order by group_count desc) AS group_ntile
     from mytable group by group_name 
     order by group_count desc;

这应该将group_count 列的（降序）值分成三个相等的组。然后，您可以在 CASE 语句中使用 group_ntile 值来根据它所在的组执行您想要的操作。

根据Redshift 文档，ntile 似乎是可用的。

根据 OP 的评论进行编辑：

ntile 的参数是排名组的数量。

即ntile 将结果（使用指定的 窗口函数 参数）存储到函数参数中指定的组数中。所以，如果你想要真正的percentiles，那么你应该使用ntile(100)。

【讨论】：

考虑到我们使用 ntile(3)，ntile 是否会确保分解是按百分位计算的？知道了，谢谢。你是否也想尝试一下 - ***.com/questions/27122670/… - 我已经解决了这个问题，但它似乎仍然错过了一些东西。您对 3 组的原始答案是正确的，我正在验证分手是按 group_names 的总数，然后按组数排序 - 这隐含百分位数！如果您正在查看该内容，请参阅编辑 4，这是我目前所了解的内容。

以上是关于按百分位数将类似 sql 的查询的结果分组：在 Redshift / postgresql的主要内容，如果未能解决你的问题，请参考以下文章