Spark dataframe 列内容修改
Posted
技术标签:
【中文标题】Spark dataframe 列内容修改【英文标题】:Spark dataframe Column content modification 【发布时间】:2019-03-13 05:53:23 【问题描述】:我有一个如下所示的数据框df.show()
:
+--------+---------+---------+---------+---------+
| Col11 | Col22 | Expend1 | Expend2 | Expend3 |
+--------+---------+---------+---------+---------+
| Value1 | value1 | 123 | 2264 | 56 |
| Value1 | value2 | 124 | 2255 | 23 |
+--------+---------+---------+---------+---------+
我可以使用一些 SQL 将上面的数据框转换为下面的数据框吗?
+--------+---------+-------------+---------------+------------+
| Col11 | Col22 | Expend1 | Expend2 | Expend3 |
+--------+---------+-------------+---------------+------------+
| Value1 | value1 | Expend1:123 | Expend2: 2264 | Expend3:56 |
| Value1 | value2 | Expend1:124 | Expend2: 2255 | Expend3:23 |
+--------+---------+-------------+---------------+------------+
【问题讨论】:
【参考方案1】:这里可以使用foldLeft
的思路
import spark.implicits._
import org.apache.spark.sql.functions._
val df = spark.sparkContext.parallelize(Seq(
("Value1", "value1", "123", "2264", "56"),
("Value1", "value2", "124", "2255", "23")
)).toDF("Col11", "Col22", "Expend1", "Expend2", "Expend3")
//Lists your columns for operation
val cols = List("Expend1", "Expend2", "Expend3")
val newDF = cols.foldLeft(df)(acc, name) =>
acc.withColumn(name, concat(lit(name + ":"), col(name)))
newDF.show()
输出:
+------+------+-----------+------------+----------+
| Col11| Col22| Expend1| Expend2| Expend3|
+------+------+-----------+------------+----------+
|Value1|value1|Expend1:123|Expend2:2264|Expend3:56|
|Value1|value2|Expend1:124|Expend2:2255|Expend3:23|
+------+------+-----------+------------+----------+
【讨论】:
【参考方案2】:你可以使用简单的 sql select 语句来做到这一点,如果你想也可以使用 udf
Ex -> select Col11 , Col22 , 'Expend1:' + cast(Expend1 as varchar(10)) as Expend1, .... from table
【讨论】:
【参考方案3】: val df = Seq(("Value1", "value1", "123", "2264", "56"), ("Value1", "value2", "124", "2255", "23") ).toDF("Col11", "Col22", "Expend1", "Expend2", "Expend3")
val cols = df.columns.filter(!_.startsWith("Col")) // It will only fetch other than col% prefix columns
val getCombineData = udf (colName:String, colvalue:String) => colName + ":"+ colvalue
var in = df
for (e <- cols)
in = in.withColumn(e, getCombineData(lit(e), col(e)) )
in.show
// results
+------+------+-----------+------------+----------+
| Col11| Col22| Expend1| Expend2| Expend3|
+------+------+-----------+------------+----------+
|Value1|value1|Expend1:123|Expend2:2264|Expend3:56|
|Value1|value2|Expend1:124|Expend2:2255|Expend3:23|
+------+------+-----------+------------+----------+
【讨论】:
以上是关于Spark dataframe 列内容修改的主要内容,如果未能解决你的问题,请参考以下文章