HiveSql&SparkSql—COUNT(DISTINCT ) OVER (PARTITION BY )报错解决方案

Posted 扫地增

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了HiveSql&SparkSql—COUNT(DISTINCT ) OVER (PARTITION BY )报错解决方案相关的知识,希望对你有一定的参考价值。

背景:

笔者在为公司搭建学生知识点画像时遇到了这种场景,在使用Spark DataFrame开发时,发现count(distinct user_id) over(partition by knowledge_id order by exam_time desc)时报错。如下:

select
    count(distinct user_id) over(partition by knowledge_id order by exam_time desc)
from exam_knowledge_detail;

Error in query: Distinct window functions are not supported: count(distinct user_id#0)
windowspecdefinition(knowledge_id#3, 
exam_time#1 DESC NULLS LAST, 
specifiedwindowframe(RangeFrame, unboundedpreceding$(), currentrow$()));;
  • 调研分析
    经过调研发现spark和hive中的窗口函数都不支持COUNT(DISTINCT ) OVER (PARTITION BY )。

解决思路分析:

那么既然不支持去重与聚合一起,能不能拆开呢?我们先去重在聚合
因为我们知道count() over()在hive和spark中均是支持的,那么我们能不能先对明细进行去重呢?在实际的统计中这是不可行的因为实际的问题中其实是要求明细数据不能发生改变的情况下增加对用户数的统计。说到这里常用的去重方式开窗排序DISTINCT就已经全部失效。那我们应该如何在去重的同时对user_id,进行去重呢?我们这时候想到了如果我们构造一个像redisset集合作为对user_id的存储,就可以实现对user_id的去重。那么hive中有没有这样的集合呢?显然是有的于是我们想到使用collect_set利用其天然的去重特性实现,然后使用size()函数实现count()功能。

实例表

CREATE EXTERNAL TABLE test.student_score (
`student_id` string,
`date_key` string,
`school_id` string,
`grade` string,
`class` string,
`score` string
) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
stored as textfile 
location '/tmp/test/student_score/';

数据准备

10001,2021-05-20,1001,初一,1,11
10002,2021-05-21,1001,初二,2,55
10003,2021-05-23,1001,初三,1,77
10004,2021-05-24,1001,初一,3,33
10005,2021-05-25,1001,初一,1,22
10006,2021-05-26,1001,初三,2,99
10007,2021-05-27,1001,初二,2,99
10001,2021-05-20,1001,初一,1,22
10002,2021-05-21,1001,初二,2,66
10003,2021-05-23,1001,初三,1,88
10004,2021-05-24,1001,初一,3,44
10005,2021-05-25,1001,初一,1,33
10006,2021-05-26,1001,初三,2,33
10007,2021-05-27,1001,初二,2,11

size(collect_set() over(partition by order by))

  • 实现方式:
    count(distinct ) over(partition by order by) 替换成 size(collect_set() over(partition by order by)) 来实现, 含义为求分组后的去重个数。
  • 适用:
    如果应用场景是既要保证原数据明细表不变,又要保证分组统计数据。

测试

select *,
    collect_set(student_id) over(partition by school_id,grade) AS group_detail,
    size(collect_set(student_id) over(partition by school_id,grade)) AS group_size,
    collect_set(student_id) over(partition by grade rows BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS group_detail_1,
    size(collect_set(student_id) over(partition by grade rows BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)) group_size_1
from test.student_score
order by school_id,grade,date_key asc;

结果:

student_iddate_keyschool_idgradeclassscoregroup_detailgroup_sizegroup_detail_1group_size_1
100012021-05-201001初一111[“10004”,“10005”,“10001”]3[“10004”,“10005”,“10001”]3
100042021-05-241001初一333[“10004”,“10005”,“10001”]3[“10004”,“10005”,“10001”]3
100052021-05-251001初一122[“10004”,“10005”,“10001”]3[“10004”,“10005”,“10001”]3
100032021-05-231001初三177[“10006”,“10003”]2[“10003”]1
100062021-05-261001初三299[“10006”,“10003”]2[“10006”,“10003”]2
100022021-05-211001初二255[“10002”,“10007”]2[“10002”,“10007”]2
100072021-05-271001初二299[“10002”,“10007”]2[“10002”,“10007”]2
100012021-05-201002初一122[“10004”,“10005”,“10001”]3[“10001”]1
100042021-05-241002初一344[“10004”,“10005”,“10001”]3[“10004”,“10001”]2
100052021-05-251002初一133[“10004”,“10005”,“10001”]3[“10004”,“10005”,“10001”]3
100032021-05-231002初三188[“10006”,“10003”]2[“10006”,“10003”]2
100062021-05-261002初三233[“10006”,“10003”]2[“10006”,“10003”]2
100022021-05-211002初二266[“10002”,“10007”]2[“10002”]1
100072021-05-271002初二211[“10002”,“10007”]2[“10002”,“10007”]2

结果分析:

发现 group_detailschool_id=1002grade=初一 时,它的值是[“10004”,“10005”,“10001”], group_detail_1的值是[“10001”], 这就是窗口加不加
rows BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW的区别。

网上还有介绍笛卡尔积的方式实现这里就不介绍了,有兴趣大家可以自己寻找。

以上是关于HiveSql&SparkSql—COUNT(DISTINCT ) OVER (PARTITION BY )报错解决方案的主要内容,如果未能解决你的问题,请参考以下文章

HiveSql&SparkSql—COUNT(DISTINCT ) OVER (PARTITION BY )报错解决方案

HiveSql/SparkSQL常用函数

oracleSQL 转 SPARKSQL(hiveSql) 及常用优化

oracleSQL 转 SPARKSQL(hiveSql) 及常用优化

sparkSQL flinkSQL hiveSQL性能对比

sparksql比hivesql优化的点(窗口函数)