pyspark将单列转换为多列[重复]
Posted
技术标签:
【中文标题】pyspark将单列转换为多列[重复]【英文标题】:pyspark to convert single col into multiple cols [duplicate] 【发布时间】:2020-01-28 13:54:31 【问题描述】:我有一个如下的数据框:
+-----------------------------------------------------------------+
|ID|DATASET |
+-----------------------------------------------------------------+
|4A|["col.1":"12ABC","col.2":"141","col.3":"","col.4":"ABCD"] |
|8B|["col.1":"12ABC","col.2":"141","col.3":"","col.4":"ABCD"] |
+-----------------------------------------------------------------+
预期输出:
+----------------------------------+
|ID|col_1 | col_2 | col_3| col_4 |
+----------------------------------+
|4A|"12ABC"|"141"||"ABCD" |
|8B|"12ABC"|"141"||"ABCD" |
+--------------------------------- +
我尝试使用 regex_extract:
df.withColumn("col_1", regexp_extract("DATASET", "(?
但在结果中得到空值
+----------------------------------+
|ID|col_1 | col_2 | col_3| col_4 |
+----------------------------------+
|4A|||| |
|8B|||| |
+--------------------------------- +
对此的任何意见
提前致谢
已编辑
感谢您的回复,效果很好,我的输入发生了一些变化,我想根据 col1 进行分组并将值放在单独的行中
更新数据集:
+---------------------------------------------------------------------------------------------------------------------------+
|ID|DATASET |
+---------------------------------------------------------------------------------------------------------------------------+
|4A|["col.1":"12ABC","col.2":"141","col.3":"","col.4":"ABCD","col.1":"13ABC","col.2":"141","col.3":"","col.4":"ABCD"] |
+---------------------------------------------------------------------------------------------------------------------------+
预期结果:
+-----------------------------------------------------------------+
|ID|col_1 | col_2 | col |
+-----------------------------------------------------------------+
|4A|"12ABC"|""col.2":"141","col.3":"","col.4":"ABCD"" |
|4A|"13ABC"|""col.2":"141","col.3":"","col.4":"ABCD"" |
+-----------------------------------------------------------------+
提前致谢
【问题讨论】:
【参考方案1】:尝试使用结构域
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType, StructField, StringType
schema = StructType(
[
StructField('`col.1`', StringType(), True),
StructField('`col.2`', StringType(), True),
StructField('`col.3`', StringType(), True),
StructField('`col.4`', StringType(), True),
]
)
df.withColumn("DATASET", from_json("DATASET", schema))\
.select(col('ID'), col('data.*'))\
.show()
【讨论】:
以上是关于pyspark将单列转换为多列[重复]的主要内容,如果未能解决你的问题,请参考以下文章