使用窗口 Hive 或 spark scala 进行数据排列

Posted 2023-03-23

技术标签:

【中文标题】使用窗口 Hive 或 spark scala 进行数据排列【英文标题】：Data arrangement with window Hive or spark scala 【发布时间】：2020-05-19 15:33:06 【问题描述】：

我必须整理数据。

我/p：

ID |VALUE
1|a
2|null
3|null
4|b
5|null
6|null
7|c

需要使用配置单元或数据框进行输出。

O/P：

ID|Value
1|a
2|b
3|b
4|b
5|c
6|c
7|c

【问题讨论】：

【参考方案1】：

在 Spark 中使用 first(expr[, isIgnoreNull=true]) 和窗口 orderBy monotonically_increasing_id() 函数和 rowsBetween 作为 currentRow到unboundedFollowing。

Example:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._

df.show()
//+---+-----+
//| ID|VALUE|
//+---+-----+
//|  1|    a|
//|  2| null|
//|  3| null|
//|  4|    b|
//|  5| null|
//|  6| null|
//|  7|    c|
//+---+-----+

//if ID will be sequentially increasing
val w=Window.orderBy("ID").rowsBetween(0,Window.unboundedFollowing) 

val w=Window.orderBy(monotonically_increasing_id()).rowsBetween(0,Window.unboundedFollowing) 
df.withColumn("VALUE", first("value",true).over(w)).show()

//+---+-----+
//| ID|VALUE|
//+---+-----+
//|  1|    a|
//|  2|    b|
//|  3|    b|
//|  4|    b|
//|  5|    c|
//|  6|    c|
//|  7|    c|
//+---+-----+

【讨论】：

为什么按monotonically_increasing_id()而不是ID订购？ @RaphaelRoth，如果数据是连续的，我们可以使用ID，如果ID 不是连续的，monotonically_increasing_id 也可以工作..！【参考方案2】：

Hive 解决方案：

with mytable as (
select stack(7,
 1,'a'  ,
 2,null ,
 3,null ,
 4,'b'  ,
 5,null ,
 6,null ,
 7,'c'
) as (id, value)
)

SELECT id, 
       first_value(value,true) over(order by id rows between current row and unbounded following) value
  FROM mytable;

结果：

id  value
1   a
2   b
3   b
4   b
5   c
6   c
7   c

【讨论】：

以上是关于使用窗口 Hive 或 spark scala 进行数据排列的主要内容，如果未能解决你的问题，请参考以下文章