Pyspark 中数组元素上的 UDF 还添加了静态元素
Posted
技术标签:
【中文标题】Pyspark 中数组元素上的 UDF 还添加了静态元素【英文标题】:UDF over the array elements in Pyspark also add the static element 【发布时间】:2021-08-16 09:45:09 【问题描述】:我有一个如下所示的数据框,
df.select("col1").show(1,False)
col1
-------------
[[2,1,0,1,,free],[3,1,0,1,4,free]]
另一种展示方式:)
df.select(to_json(struct("col1")))show(1,False)
col1
-----------------
"col1":[ "0":"2","1":"1","2":"0","3":"1","5":"free","0":"3","1":"1","2":"0","3":"1","4":"4","5":"free"]
现在我想实现下面的数据框,有一个结构要从现有的列中创建,还需要添加新的静态字段'value:zzz'
col1
--------------
"col1":["1":"1","2":"0","3":"1","5":"free","value":"ZZZ","newattrib":"0":"2","1":"1","2":"0","3":"1","4":"4","5":"free","value":"ZZZ","newattrib":"0":"3"]
请向我建议实现这一目标的方法。
【问题讨论】:
只看输入/输出,我们应该明白你在做什么?请逻辑解释... 【参考方案1】:使用transform
函数
df = (df
.selectExpr("""transform(col1, v -> struct(v.`1` `1`,
v.`2` `2`,
v.`3` `3`,
v.`4` `4`,
v.`5` `5`,
'ZZZ' value,
struct(v.`0` `0`) newattrib)) col1""")
.select(to_json(struct("col1")).alias('col1'))
)
df.show(truncate=False)
# +--------------------------------------------------------------------------------------------------------------------------------------------------+
# |col1 |
# +--------------------------------------------------------------------------------------------------------------------------------------------------+
# |"col1":["1":1,"2":0,"3":1,"5":"free","value":"ZZZ","newattrib":"0":2,"1":1,"2":0,"3":1,"4":4,"5":"free","value":"ZZZ","newattrib":"0":3]|
# +--------------------------------------------------------------------------------------------------------------------------------------------------+
【讨论】:
以上是关于Pyspark 中数组元素上的 UDF 还添加了静态元素的主要内容,如果未能解决你的问题,请参考以下文章
Pyspark:从 Struct 中识别 arrayType 列并调用 udf 将数组转换为字符串