spark scala中的分区函数
Posted
技术标签:
【中文标题】spark scala中的分区函数【英文标题】:Partition functions in spark scala 【发布时间】:2018-07-05 00:54:08 【问题描述】:DF:
ID col1 . .....coln.... Date
1 1991-01-11 11:03:46.0
1 1991-01-11 11:03:46.0
1 1991-02-22 12:05:58.0
1 1991-02-22 12:05:58.0
1 1991-02-22 12:05:58.0
我正在创建一个新列“identify”来查找 (ID, DATE) 的分区,并通过“identify”排序选择最顶部的组合
预期 DF:
ID col1 . .....coln.... Date . identify
1 1991-01-11 11:03:46.0 . 1
1 1991-01-11 11:03:46.0 1
1 1991-02-22 12:05:58.0 . 2
1 1991-02-22 12:05:58.0 . 2
1 1991-02-22 12:05:58.0 . 2
代码尝试 1:
var window = Window.partitionBy("ID").orderBy("DATE")
df = df.orderBy($"DATE").withColumn("identify", row_number().over(window))
我的操作:
ID col1 . .....coln.... Date . identify
1 1991-01-11 11:03:46.0 . 1
1 1991-01-11 11:03:46.0 2
1 1991-02-22 12:05:58.0 . 3
1 1991-02-22 12:05:58.0 . 4
1 1991-02-22 12:05:58.0 . 5
代码尝试 2:
var window = Window.partitionBy("ID","DATE").orderBy("DATE")
df = df.orderBy($"DATE").withColumn("identify", row_number().over(window))
我的操作:
ID col1 . .....coln.... Date . identify
1 1991-01-11 11:03:46.0 . 1
1 1991-01-11 11:03:46.0 2
1 1991-02-22 12:05:58.0 . 1
1 1991-02-22 12:05:58.0 . 2
1 1991-02-22 12:05:58.0 . 3
任何关于如何调整代码以获得所需 OP 的建议都会有所帮助
【问题讨论】:
【参考方案1】:var window = Window.partitionBy("ID").orderBy("DATE")
df = df.orderBy($"DATE").withColumn("identify", dense_rank().over(window))
【讨论】:
以上是关于spark scala中的分区函数的主要内容,如果未能解决你的问题,请参考以下文章
Spark mapPartitions 及mapPartitionsWithIndex算子