获取 SQL 中另一列的每个值的最常见值

Posted 2023-02-16

技术标签:

【中文标题】获取 SQL 中另一列的每个值的最常见值【英文标题】：Get most common value for each value of another column in SQL 【发布时间】：2010-09-25 13:43:19 【问题描述】：

我有一张这样的桌子：

 Column  | Type | Modifiers 
---------+------+-----------
 country | text | 
 food_id | int  | 
 eaten   | date |

对于每个国家/地区，我想获得最常食用的食物。我能想到的最好的（我正在使用 postgres）是：

CREATE TEMP TABLE counts AS 
   SELECT country, food_id, count(*) as count FROM munch GROUP BY country, food_id;

CREATE TEMP TABLE max_counts AS 
   SELECT country, max(count) as max_count FROM counts GROUP BY country;

SELECT country, max(food_id) FROM counts 
   WHERE (country, count) IN (SELECT * from max_counts) GROUP BY country;

在最后一条语句中，需要 GROUP BY 和 max() 来打破关系，其中两种不同的食物具有相同的计数。

对于概念上简单的事情，这似乎需要做很多工作。有没有更直接的方法？

【问题讨论】：

【参考方案1】：

现在更简单了：PostgreSQL 9.4 引入了mode() 函数：

select mode() within group (order by food_id)
from munch
group by country

返回（如 user2247323 的示例）：

country | mode
--------------
GB      | 3
US      | 1

在此处查看文档： https://wiki.postgresql.org/wiki/Aggregate_Mode

https://www.postgresql.org/docs/current/static/functions-aggregate.html#FUNCTIONS-ORDEREDSET-TABLE

【讨论】：

【参考方案2】：

PostgreSQL 在 8.4 中引入了对 window functions 的支持，也就是提出这个问题的第二年。值得注意的是，今天可能解决如下：

SELECT country, food_id
  FROM (SELECT country, food_id, ROW_NUMBER() OVER (PARTITION BY country ORDER BY freq DESC) AS rn
          FROM (  SELECT country, food_id, COUNT('x') AS freq
                    FROM country_foods
                GROUP BY 1, 2) food_freq) ranked_food_req
 WHERE rn = 1;

以上将打破关系。如果你不想打破平局，你可以使用 DENSE_RANK() 代替。

【讨论】：

【参考方案3】：

SELECT DISTINCT
"F1"."food",
"F1"."country"
FROM "foo" "F1"
WHERE
"F1"."food" =
    (SELECT "food" FROM
        (
            SELECT "food", COUNT(*) AS "count"
            FROM "foo" "F2" 
            WHERE "F2"."country" = "F1"."country" 
            GROUP BY "F2"."food" 
            ORDER BY "count" DESC
        ) AS "F5"
        LIMIT 1
    )

嗯，我写的很匆忙，没有仔细检查。子选择可能很慢，但这是我能想到的最短和最简单的 SQL 语句。当我不那么醉时，我可能会告诉更多。

PS：哦，好吧，“foo”是我的桌子的名称，“food”包含食物的名称，“country”包含国家/地区的名称。示例输出：

   food    |  country   
-----------+------------
 Bratwurst | Germany
 Fisch     | Frankreich

【讨论】：

我认为大多数地方都需要单引号。【参考方案4】：

试试这个：

Select Country, Food_id
From Munch T1
Where Food_id= 
    (Select Food_id
     from Munch T2
     where T1.Country= T2.Country
     group by Food_id
     order by count(Food_id) desc
      limit 1)
group by Country, Food_id

【讨论】：

【参考方案5】：

试试这样的

select country, food_id, count(*) cnt 
into #tempTbl 
from mytable 
group by country, food_id

select country, food_id
from  #tempTbl as x
where cnt = 
  (select max(cnt) 
  from mytable 
  where country=x.country 
  and food_id=x.food_id)

这可以全部放在一个选择中，但我现在没有时间处理它。

祝你好运。

【讨论】：

【参考方案6】：

以下是不使用任何临时表的方法：

编辑：简化

select nf.country, nf.food_id as most_frequent_food_id
from national_foods nf
group by country, food_id 
having
  (country,count(*)) in (  
                        select country, max(cnt)
                        from
                          (
                          select country, food_id, count(*) as cnt
                          from national_foods nf1
                          group by country, food_id
                          )
                        group by country
                        having country = nf.country
                        )

【讨论】：

我很想看看这个执行的计划与临时表的对比——那些“有”子句在 select 检索到匹配的行之后被评估，对吧？似乎可能会有大量额外的 IO。计划中有几个全表扫描，是的。【参考方案7】：

SELECT country, MAX( food_id )
  FROM( SELECT m1.country, m1.food_id
          FROM munch m1
         INNER JOIN ( SELECT country
                           , food_id
                           , COUNT(*) as food_counts
                        FROM munch m2
                    GROUP BY country, food_id ) as m3
                 ON m1.country = m3.country
         GROUP BY m1.country, m1.food_id 
        HAVING COUNT(*) / COUNT(DISTINCT m3.food_id) = MAX(food_counts) ) AS max_foods
  GROUP BY country

我不喜欢 MAX(.) GROUP BY 打破关系...必须有一种方法以某种方式将吃过的日期合并到 JOIN 中，以任意选择最近的日期...

如果你在你的实时数据上运行它，我对这个东西的查询计划很感兴趣！

【讨论】：

【参考方案8】：

select country,food_id, count(*) ne  
from   food f1  
group by country,food_id    
having count(*) = (select max(count(*))  
                   from   food f2  
                   where  country = f1.country  
                   group by food_id)

【讨论】：

【参考方案9】：

我相信这是一个简单明了的陈述，可以满足您的需求：

select distinct on (country) country, food_id
from munch
group by country, food_id
order by country, count(*) desc

请告诉我你的想法。

顺便说一句，distinct on 功能仅在 Postgres 中可用。

示例，源数据：

country | food_id | eaten
US        1         2017-1-1
US        1         2017-1-1
US        2         2017-1-1
US        3         2017-1-1
GB        3         2017-1-1
GB        3         2017-1-1
GB        2         2017-1-1

输出：

country | food_id
US        1
GB        3

【讨论】：

如果您打算在这么久之后提出新的答案，我建议您在示例表上尝试一下，然后发布您得到的结果。另外，请说明您使用的是哪个数据库服务器（mysql 或其他）。 distinct on 功能仅在 Postgres 中可用，因此，我不确定您将如何在另一个数据库中执行此类操作。 OP 正在使用 Postgres，所以看起来很合适。我使用 op 建议的名为 munch 的数据库表编写了这个，它具有三个字段：国家（文本）、食物 ID（整数）和吃过（日期）

以上是关于获取 SQL 中另一列的每个值的最常见值的主要内容，如果未能解决你的问题，请参考以下文章