Grouping__ID 在不同版本中的使用方法不一样
Posted javartisan
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Grouping__ID 在不同版本中的使用方法不一样相关的知识,希望对你有一定的参考价值。
Hive2.3版本之后
Grouping__ID function
When aggregates are displayed for a column its value is null. This may conflict in case the column itself has some null values. There needs to be some way to identify NULL in column, which means aggregate and NULL in column, which means value. GROUPING__ID function is the solution to that.
This function returns a bitvector corresponding to whether each column is present or not. For each column, a value of "1" is produced for a row in the result set if that column has been aggregated in that row, otherwise the value is "0". This can be used to differentiate when there are nulls in the data.
Consider the following example:
Column1 (key) | Column2 (value) |
---|---|
1 | NULL |
1 | 1 |
2 | 2 |
3 | 3 |
3 | NULL |
4 | 5 |
The following query:
|
will have the following results:
Column 1 (key) | Column 2 (value) | GROUPING__ID | count(*) |
---|---|---|---|
NULL | NULL | 3 | 6 |
1 | NULL | 0 | 2 |
1 | NULL | 1 | 1 |
1 | 1 | 0 | 1 |
2 | NULL | 1 | 1 |
2 | 2 | 0 | 1 |
3 | NULL | 0 | 2 |
3 | NULL | 1 | 1 |
3 | 3 | 0 | 1 |
4 | NULL | 1 | 1 |
4 | 5 | 0 | 1 |
Note that the third column is a bitvector of columns being selected.
For the first row, none of the columns are being selected.
For the second row, both the columns are being selected (and the second column happens to be null), which explains the value 0.
For the third row, only the first column is being selected, which explains the value 1.
hive 2.3版本之前
Grouping__ID function (before Hive 2.3.0)
补充:2.3之前,的grouping__id值的二进制每一位数值1表示使用该列进行聚合。0表示不适用该列聚合。
Grouping__ID function was fixed in Hive 2.3.0, thus behavior before that release is different (this is expected). For each column, the function would return a value of "0" iif that column has been aggregated in that row, otherwise the value is "1".
Hence the following query:
|
will have the following results.
Column 1 (key) | Column 2 (value) | GROUPING__ID | count(*) |
---|---|---|---|
NULL | NULL | 0 | 6 |
1 | NULL | 1 | 2 |
1 | NULL | 3 | 1 |
1 | 1 | 3 | 1 |
2 | NULL | 1 | 1 |
2 | 2 | 3 | 1 |
3 | NULL | 1 | 2 |
3 | NULL | 3 | 1 |
3 | 3 | 3 | 1 |
4 | NULL | 1 | 1 |
4 | 5 | 3 | 1 |
解析:
对于第一行GROUPING__ID为0,因此表示使用0列进行聚合,即是整体作为一个分组聚合。
对于第一行GROUPING__ID为1,表示使用第一列,即key列进行聚合,因此共计有2条
对于第一行GROUPING__ID为3,二进制为11,表示使用key与value两列进行聚合,进行共计只有一行。
补充:
hive文档个人觉着不如spark文档质量高,可以尝试阅读spark文档,个人认为还是spark文档简单清晰明了,如下:
https://spark.apache.org/docs/latest/api/sql/index.html#grouping_id
以上是关于Grouping__ID 在不同版本中的使用方法不一样的主要内容,如果未能解决你的问题,请参考以下文章
hive的多维度分析函数with cube和grouping__id的理解
hive的多维度分析函数with cube和grouping__id的理解
hive的多维度分析函数with cube和grouping__id的理解
Hive GROUPING SETS和GROUPING__IDCUBEROLLUP