Hive - 分层组的多个（平均）计数不同

Posted 2023-03-31

技术标签:

【中文标题】Hive - 分层组的多个（平均）计数不同【英文标题】：Hive - multiple (average) count distincts over layered groups 【发布时间】：2018-08-22 02:38:52 【问题描述】：

给定以下源数据（假设表名是user_activity）：

+---------+-----------+------------+
| user_id | user_type | some_date  |
+---------+-----------+------------+
| 1       | a         | 2018-01-01 |
| 1       | a         | 2018-01-02 |
| 2       | a         | 2018-01-01 |
| 3       | a         | 2018-01-01 |
| 4       | b         | 2018-01-01 |
| 4       | b         | 2018-01-02 |
| 5       | b         | 2018-01-02 |
+---------+-----------+------------+

我想得到以下结果：

+-----------+------------+---------------------+
| user_type | user_count | average_daily_users |
+-----------+------------+---------------------+
| a         | 3          | 2                   |
| b         | 2          | 1.5                 |
+-----------+------------+---------------------+

在同一个表上使用没有多个子查询的单个查询。

使用多个查询，我可以得到：

user_count:

select
  user_type,
  count(distinct user_id)
from user_activity
group by user_type

对于average_daily_users：

select
  user_type,
  avg(distinct_users) as average_daily_users
from (
  select
    count(distinct user_id) as distinct_users
  from user_activity
  group by user_type, some_date
)
group by user_type

但我似乎无法编写一个一次性完成我想要的查询。我担心同一张表上的多个子查询对性能的影响（它必须扫描表两次......对吗？）我有一个相当大的数据源，希望最大限度地减少运行时间。

注意：这个问题的标题是 Hive，因为这是我正在处理的问题，但我认为这是一个足够通用的 SQL 问题，所以我不排除其他语言的答案。

注意 2：此问题与 my other question 共享窗口函数中 partition by 列的详细信息（用于计算平均每日用户列）。

【问题讨论】：

【参考方案1】：

这应该做你想做的：

select ua.user_type,
       count(distinct ua.user_id) as user_count,
       count(distinct some_date || ':' || ua.user_id) / count(distinct some_date)
from user_activity ua
group by ua.user_type;

【讨论】：

好主意！我什至没有考虑自己做平均。感觉太不对了，又太对了。 TYVM！

以上是关于Hive - 分层组的多个（平均）计数不同的主要内容，如果未能解决你的问题，请参考以下文章