如何使用 Scala/Spark 添加不基于数据框中现有列的新列？ [复制]

Posted 2023-04-15

技术标签:

【中文标题】如何使用 Scala/Spark 添加不基于数据框中现有列的新列？ [复制]【英文标题】：How to add new column not based on exist column in dataframe with Scala/Spark? [duplicate] 【发布时间】：2017-07-21 03:05:21 【问题描述】：

我有一个DataFrame，我想添加一个新的column，但不是基于退出列，我该怎么办？

这是我的数据框：

+----+
|time|
+----+
|   1|
|   4|
|   3|
|   2|
|   5|
|   7|
|   3|
|   5|
+----+

这是我的预期结果：

+----+-----+  
|time|index|  
+----+-----+  
|   1|    1|  
|   4|    2|  
|   3|    3|  
|   2|    4|  
|   5|    5|  
|   7|    6|  
|   3|    7|  
|   5|    8|  
+----+-----+

【问题讨论】：

【参考方案1】：

使用 rdd zipWithIndex 可能是你想要的。

val newRdd = yourDF.rdd.zipWithIndex.mapcase (r: Row, id: Long) => Row.fromSeq(r.toSeq :+ id)
val schema = StructType(Array(StructField("time", IntegerType, nullable = true), StructField("index", LongType, nullable = true)))
val newDF = spark.createDataFrame(newRdd, schema)
newDF.show
+----+-----+                                                                    
|time|index|
+----+-----+
|   1|    0|
|   4|    1|
|   3|    2|
|   2|    3|
|   5|    4|
|   7|    5|
|   3|    6|
|   8|    7|
+----+-----+

我假设您的时间列在这里是 IntegerType。

【讨论】：

按你的方式，我得把DataFrame改成rdd，再把rdd改成DataFrame，效率低下我不确定这是不是最好的解决方案，但应该没有严重的性能问题。 @mentongwu 转换为 rdd 不会是严重的性能问题。但是数据框使用 tugsten 格式，而 rdd 不会。【参考方案2】：

使用Window function 并转换为rdd 和使用zipWithIndex 比较慢，您可以使用内置函数monotonically_increasing_id as

import org.apache.spark.sql.functions._
df.withColumn("index", monotonically_increasing_id())

希望有帮助！

【讨论】：

以上是关于如何使用 Scala/Spark 添加不基于数据框中现有列的新列？ [复制]的主要内容，如果未能解决你的问题，请参考以下文章