蜂巢 sql 聚合

Posted 2023-04-18

技术标签:

【中文标题】蜂巢 sql 聚合【英文标题】：hive sql aggregate 【发布时间】：2011-09-29 16:56:12 【问题描述】：

我在 Hive 中有两个表，t1 和 t2

>describe t1;
>date_id    string

>describe t2;
>messageid string,
 createddate string,
 userid int

> select * from t1 limit 3;        
> 2011-01-01 00:00:00 
  2011-01-02 00:00:00 
  2011-01-03 00:00:00 

> select * from t2 limit 3;
87211389    2011-01-03 23:57:01 13864753
87211656    2011-01-03 23:57:59 13864769
87211746    2011-01-03 23:58:25 13864785

我想要的是计算给定日期前三天的不同用户 ID。例如，对于日期2011-01-03，我想计算从2011-01-01 到2011-01-03 的不同用户ID。对于日期2011-01-04，我想计算从2011-01-02 到2011-01-04 的不同用户ID

我写了以下查询。但它不会返回三天的结果。而是每天返回不同的用户 ID。

SELECT to_date(t1.date_id), count(distinct t2.userid) FROM t1 JOIN t2 
ON (to_date(t2.createddate) = to_date(t1.date_id))  
WHERE date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3) 
AND to_date(t2.createddate) <= to_date(t1.date_id) 
GROUP by to_date(t1.date_id);

`to_date()` and `date_sub()` are date function in Hive.

也就是说，下面的部分不生效。

WHERE date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3) 
AND to_date(t2.createddate) <= to_date(t1.date_id)

编辑：一种解决方案可以是（但速度非常慢）：

SELECT to_date(t3.date_id), count(distinct t3.userid) FROM
(
 SELECT * FROM t1  LEFT OUTER JOIN t2
 WHERE 
 (date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3)
  AND to_date(t2.createddate) <= to_date(t1.date_id)
 )
) t3 
GROUP by to_date(t3.date_id);

更新：感谢所有答案。他们很好。但是 Hive 与 SQL 有点不同。不幸的是，它们不能在 HIVE 中使用。我目前的解决方案是使用UNION ALL。

 SELECT * FROM t1 JOIN t2 ON (to_date(t1.date_id) = to_date(t2.createddate))
 UNION ALL
 SELECT * FROM t1 JOIN t2 ON (to_date(t1.date_id) = date_add(to_date(t2.createddate), 1)
 UNION ALL 
 SELECT * FROM t1 JOIN t2 ON (to_date(t1.date_id) = date_add(to_date(t2.createddate), 2)

然后，我做group by 和count。这样，我就能得到我想要的。虽然不优雅，但比cross join效率高很多。

【问题讨论】：

【参考方案1】：

以下内容似乎可以在标准 SQL 中工作...

SELECT
  to_date(t1.date_id),
  count(distinct t2.userid)
FROM
  t1
LEFT JOIN
  t2
    ON  to_date(t2.createddate) >= date_sub(to_date(t1.date_id), 2)
    AND to_date(t2.createddate) <  date_add(to_date(t1.date_id), 1)
GROUP BY
  to_date(t1.date_id)

它会，但是，很慢。因为您将日期存储为字符串，所以使用 to_date() 将它们转换为日期。这意味着不能使用索引，SQL 引擎也不能做任何聪明的事情来减少工作量。

因此，每个可能的行组合都需要进行比较。如果您在 T1 中有 100 个条目，在 T2 中有 10,000 个条目，则您的 SQL 引擎正在处理一百万个组合。

如果您将这些值存储为日期，则不需要to_date()。如果您为日期编制索引，SQL 引擎可以快速定位到指定的日期范围。

注意：ON 子句的格式意味着您不需要将t2.createddate 向下舍入为每日值。

编辑为什么您的代码不起作用...

SELECT to_date(t1.date_id), count(distinct t2.userid) FROM t1 JOIN t2 
ON (to_date(t2.createddate) = to_date(t1.date_id))  
WHERE date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3) 
AND to_date(t2.createddate) <= to_date(t1.date_id) 
GROUP by to_date(t1.date_id);

这使用(to_date(t2.createddate) = to_date(t1.date_id)) 的ON 子句将t1 连接到t2。由于联接是 LEFT OUTER JOIN，t2.createddate 中的值现在必须要么为 NULL（不匹配）或与 t1.date_id 相同。

WHERE 子句允许的范围更广（3 天）。但是JOIN 的ON 子句已经将您的数据限制为一天。

我上面给出的示例只是将您的 WHERE 子句替换为旧的 ON 子句。

编辑

Hive 不允许在 ON 子句中使用 <= 和 >=？你真的坚持使用 HIVE 吗？？？

如果你真的是，那么 BETWEEN 呢？

SELECT
  to_date(t1.date_id),
  count(distinct t2.userid)
FROM
  t1
LEFT JOIN
  t2
    ON to_date(t2.createddate) BETWEEN date_sub(to_date(t1.date_id), 2) AND date_add(to_date(t1.date_id), 1)
GROUP BY
  to_date(t1.date_id)

或者，重构您的日期表以枚举您想要包含的日期...

TABLE t1 (calendar_date, inclusive_date) =
 2011-01-03, 2011-01-01
  2011-01-03, 2011-01-02
  2011-01-03, 2011-01-03

  2011-01-04, 2011-01-02
  2011-01-04, 2011-01-03
  2011-01-04, 2011-01-04

  2011-01-05, 2011-01-03
  2011-01-05, 2011-01-04
  2011-01-05, 2011-01-05 

SELECT
  to_date(t1.calendar_date),
  count(distinct t2.userid)
FROM
  t1
LEFT JOIN
  t2
    ON to_date(t2.createddate) = to_date(t1.inclusive_date)
GROUP BY
  to_date(t1.calendar_date)

【讨论】：

问题是 Hive 在ON 子句中不支持Between AND。我们可以在Where 子句上使用它。但是，问题是如果我们在 HIVE JOIN 中只使用 where 而不使用 on 会非常慢。【参考方案2】：

你需要一个子查询：

尝试这样的事情（我无法测试，因为我没有配置单元）

SELECT to_date(t1.date_id), count(distinct t2.userid) FROM t1 JOIN t2 
ON (to_date(t2.createddate) = to_date(t1.date_id))  
WHERE t2.messageid in 
    (
    select t2.messageid from t2 where 
    date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3) 
    AND 
    to_date(t2.createddate) <= to_date(t1.date_id) 
   )
GROUP by to_date(t1.date_id);

关键是在 t1 中使用 FOR EACH 日期的子查询，在 t2 中选择正确的记录。

编辑：

在 from 子句中强制子查询你可以试试这个：

SELECT to_date(t1.date_id), count(distinct t2.userid) FROM t1 JOIN 

(select userid, createddate  from t2 where 

    date_sub(to_date(t2.createddate),0) > date_sub(to_date(t1.date_id), 3) 
    AND 
    to_date(t2.createddate) <= to_date(t1.date_id) 
) as t2

ON (to_date(t2.createddate) = to_date(t1.date_id))  

GROUP by to_date(t1.date_id);

但不知道能不能用。

【讨论】：

不起作用。它不能在 Hive 中使用。它想将t1.date_id 传递给子查询。但 Hive 不允许。【参考方案3】：

我假设 t1 用于定义 3 天期间。我怀疑这种令人费解的方法是由于 Hive 的缺点。这允许您拥有任意数量的 3 天期间。尝试以下 2 个查询

SELECT substring(t1.date_id,1,10), count(distinct t2.userid) 
FROM t1 
JOIN t2 
ON substring(t2.createddate,1,10) >= date_sub(substring(t1.date_id,1,10), 2) 
AND substring(t2.createddate,1,10) <=  substring(t1.date_id,1,10) 
GROUP BY t1.date_id

--或--

SELECT substring(t1.date_id,1,10), count(distinct t2.userid) 
FROM t1 
JOIN t2 
ON t2.createddate like substring(t1.date_id ,1,10) + '%' 
OR t2.createddate like substring(date_sub(t1.date_id, 1) ,1,10) + '%' 
OR t2.createddate like substring(date_sub(t1.date_id, 2) ,1,10) + '%' 
GROUP BY t1.date_id

后者最小化了对 t2 表的函数调用。我还假设 t1 是 2 中较小的一个。子字符串应该返回与 to_date 相同的结果。根据文档https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions，to_date 返回字符串数据类型。对日期数据类型的支持似乎很少，但我对 hive 并不熟悉。

【讨论】：

如果您只想要单个日期的时间段，则可以通过删除连接来优化此查询。此外，如果您在 t1 中填充了很多日期，那么它并不是真正需要的。只需删除连接并在 WHERE 中指定时间段。如果这不能提高性能，请说明如何使用 t1。第一个解决方案在 on 子句中有 >= 和 <=。在 Hive 中是不允许的。第二种方案，Hive 不支持 on 子句中的or。

以上是关于蜂巢 sql 聚合的主要内容，如果未能解决你的问题，请参考以下文章