HiveSql&SparkSql—COUNT(DISTINCT ) OVER (PARTITION BY )报错解决方案
Posted 扫地增
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了HiveSql&SparkSql—COUNT(DISTINCT ) OVER (PARTITION BY )报错解决方案相关的知识,希望对你有一定的参考价值。
背景:
笔者在为公司搭建学生知识点画像时遇到了这种场景,在使用Spark DataFrame
开发时,发现count(distinct user_id) over(partition by knowledge_id order by exam_time desc)
时报错。如下:
select
count(distinct user_id) over(partition by knowledge_id order by exam_time desc)
from exam_knowledge_detail;
Error in query: Distinct window functions are not supported: count(distinct user_id#0)
windowspecdefinition(knowledge_id#3,
exam_time#1 DESC NULLS LAST,
specifiedwindowframe(RangeFrame, unboundedpreceding$(), currentrow$()));;
- 调研分析
经过调研发现spark和hive中的窗口函数都不支持COUNT(DISTINCT ) OVER (PARTITION BY )。
解决思路分析:
那么既然不支持去重与聚合一起,能不能拆开呢?我们
先去重在聚合
。
因为我们知道count() over()
在hive和spark中均是支持的,那么我们能不能先对明细进行去重呢?在实际的统计中这是不可行的因为实际的问题中其实是要求明细数据不能发生改变的情况下增加对用户数的统计。说到这里常用的去重方式开窗排序
和DISTINCT
就已经全部失效。那我们应该如何在去重的同时对user_id
,进行去重呢?我们这时候想到了如果我们构造一个像redis
中set
集合作为对user_id
的存储,就可以实现对user_id
的去重。那么hive中有没有这样的集合呢?显然是有的于是我们想到使用collect_set
利用其天然的去重特性实现,然后使用size()
函数实现count()
功能。
实例表
CREATE EXTERNAL TABLE test.student_score (
`student_id` string,
`date_key` string,
`school_id` string,
`grade` string,
`class` string,
`score` string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
stored as textfile
location '/tmp/test/student_score/';
数据准备
10001,2021-05-20,1001,初一,1,11
10002,2021-05-21,1001,初二,2,55
10003,2021-05-23,1001,初三,1,77
10004,2021-05-24,1001,初一,3,33
10005,2021-05-25,1001,初一,1,22
10006,2021-05-26,1001,初三,2,99
10007,2021-05-27,1001,初二,2,99
10001,2021-05-20,1001,初一,1,22
10002,2021-05-21,1001,初二,2,66
10003,2021-05-23,1001,初三,1,88
10004,2021-05-24,1001,初一,3,44
10005,2021-05-25,1001,初一,1,33
10006,2021-05-26,1001,初三,2,33
10007,2021-05-27,1001,初二,2,11
size(collect_set() over(partition by order by))
- 实现方式:
count(distinct ) over(partition by order by)
替换成size(collect_set() over(partition by order by))
来实现, 含义为求分组后的去重个数。- 适用:
如果应用场景是既要保证原数据明细表不变,又要保证分组统计数据。
测试
select *,
collect_set(student_id) over(partition by school_id,grade) AS group_detail,
size(collect_set(student_id) over(partition by school_id,grade)) AS group_size,
collect_set(student_id) over(partition by grade rows BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS group_detail_1,
size(collect_set(student_id) over(partition by grade rows BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)) group_size_1
from test.student_score
order by school_id,grade,date_key asc;
结果:
student_id | date_key | school_id | grade | class | score | group_detail | group_size | group_detail_1 | group_size_1 |
---|---|---|---|---|---|---|---|---|---|
10001 | 2021-05-20 | 1001 | 初一 | 1 | 11 | [“10004”,“10005”,“10001”] | 3 | [“10004”,“10005”,“10001”] | 3 |
10004 | 2021-05-24 | 1001 | 初一 | 3 | 33 | [“10004”,“10005”,“10001”] | 3 | [“10004”,“10005”,“10001”] | 3 |
10005 | 2021-05-25 | 1001 | 初一 | 1 | 22 | [“10004”,“10005”,“10001”] | 3 | [“10004”,“10005”,“10001”] | 3 |
10003 | 2021-05-23 | 1001 | 初三 | 1 | 77 | [“10006”,“10003”] | 2 | [“10003”] | 1 |
10006 | 2021-05-26 | 1001 | 初三 | 2 | 99 | [“10006”,“10003”] | 2 | [“10006”,“10003”] | 2 |
10002 | 2021-05-21 | 1001 | 初二 | 2 | 55 | [“10002”,“10007”] | 2 | [“10002”,“10007”] | 2 |
10007 | 2021-05-27 | 1001 | 初二 | 2 | 99 | [“10002”,“10007”] | 2 | [“10002”,“10007”] | 2 |
10001 | 2021-05-20 | 1002 | 初一 | 1 | 22 | [“10004”,“10005”,“10001”] | 3 | [“10001”] | 1 |
10004 | 2021-05-24 | 1002 | 初一 | 3 | 44 | [“10004”,“10005”,“10001”] | 3 | [“10004”,“10001”] | 2 |
10005 | 2021-05-25 | 1002 | 初一 | 1 | 33 | [“10004”,“10005”,“10001”] | 3 | [“10004”,“10005”,“10001”] | 3 |
10003 | 2021-05-23 | 1002 | 初三 | 1 | 88 | [“10006”,“10003”] | 2 | [“10006”,“10003”] | 2 |
10006 | 2021-05-26 | 1002 | 初三 | 2 | 33 | [“10006”,“10003”] | 2 | [“10006”,“10003”] | 2 |
10002 | 2021-05-21 | 1002 | 初二 | 2 | 66 | [“10002”,“10007”] | 2 | [“10002”] | 1 |
10007 | 2021-05-27 | 1002 | 初二 | 2 | 11 | [“10002”,“10007”] | 2 | [“10002”,“10007”] | 2 |
结果分析:
发现
group_detail
在school_id=1002
且grade=初一
时,它的值是[“10004”,“10005”,“10001”]
,group_detail_1
的值是[“10001”]
, 这就是窗口加不加
rows BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
的区别。
网上还有介绍笛卡尔积的方式实现这里就不介绍了,有兴趣大家可以自己寻找。
以上是关于HiveSql&SparkSql—COUNT(DISTINCT ) OVER (PARTITION BY )报错解决方案的主要内容,如果未能解决你的问题,请参考以下文章
HiveSql&SparkSql—COUNT(DISTINCT ) OVER (PARTITION BY )报错解决方案
oracleSQL 转 SPARKSQL(hiveSql) 及常用优化