在 PySpark 中转换数据框模式 [重复]
Posted
技术标签:
【中文标题】在 PySpark 中转换数据框模式 [重复]【英文标题】:Convert dataframe schema In PySpark [duplicate] 【发布时间】:2020-06-13 13:48:40 【问题描述】:我有一个数据框
+------------------+-------------------+--------------------+
| name| sku| description|
+------------------+-------------------+--------------------+
| Mary Rodriguez| hand-couple-manage|Senior word socia...|
| Jose Henderson| together-table-oil|Apply girl treatm...|
| Karen Villegas| child-somebody|Every tell serve....|
| Olivia Lynch|forget-matter-avoid|Perhaps environme...|
| Whitney Wiley| side-blue-dream|Quickly short soc...|
| Brittany Johnson| east-pretty|Indicate view sim...|
| Paul Morris| radio-window-us|Society month sho...|
| Jason Patterson| night-art-be-act|Entire around pla...|
| Kiara Gentry| compare-politics|Air my kind staff...|
架构
root
|-- sku: string (nullable = true)
|-- name_description: array (nullable = true)
| |-- element: string (containsNull = true)
如何按列 sku
分组并合并 name
和 description
中的值以获取 name_description
列,其值作为 JSON
的数组,格式为 ["name":..., "description":..., "name":..., "description":..., ....]
中的每个值PySpark 中的 987654331@?
【问题讨论】:
这些答案有帮助吗? pyspark create dictionary from data in two columns、Spark scala dataframe: Merging multiple columns into single column 等等…… @mazaneicha 不,正在寻找特定格式。 【参考方案1】:检查下面的代码。
df.show(false)
+---------------+-------------------+-------------------+
|name |sku |description |
+---------------+-------------------+-------------------+
|MaryRodriguez |hand-couple-manage |Seniorwordsocia... |
|JoseHenderson |together-table-oil |Applygirltreatm... |
|KarenVillegas |child-somebody |Everytellserve.... |
|OliviaLynch |forget-matter-avoid|Perhapsenvironme...|
|WhitneyWiley |side-blue-dream |Quicklyshortsoc... |
|BrittanyJohnson|east-pretty |Indicateviewsim... |
|PaulMorris |radio-window-us |Societymonthsho... |
|JasonPatterson |night-art-be-act |Entirearoundpla... |
|KiaraGentry |compare-politics |Airmykindstaff... |
+---------------+-------------------+-------------------+
df.groupBy(F.col("sku").agg(F.collect_list(F.struct(F.col("name"),F.col("description"))).alias("name_description")).toJSON.show(false)
+-------------------------------------------------------------------------------------------------------------+
|value |
+-------------------------------------------------------------------------------------------------------------+
|"sku":"hand-couple-manage","name_description":["name":"MaryRodriguez","description":"Seniorwordsocia..."]|
|"sku":"night-art-be-act","name_description":["name":"JasonPatterson","description":"Entirearoundpla..."] |
|"sku":"forget-matter-avoid","name_description":["name":"OliviaLynch","description":"Perhapsenvironme..."]|
|"sku":"compare-politics","name_description":["name":"KiaraGentry","description":"Airmykindstaff..."] |
|"sku":"child-somebody","name_description":["name":"KarenVillegas","description":"Everytellserve...."] |
|"sku":"side-blue-dream","name_description":["name":"WhitneyWiley","description":"Quicklyshortsoc..."] |
|"sku":"radio-window-us","name_description":["name":"PaulMorris","description":"Societymonthsho..."] |
|"sku":"east-pretty","name_description":["name":"BrittanyJohnson","description":"Indicateviewsim..."] |
|"sku":"together-table-oil","name_description":["name":"JoseHenderson","description":"Applygirltreatm..."]|
+-------------------------------------------------------------------------------------------------------------+
【讨论】:
你有 SQL 方法吗?以上是关于在 PySpark 中转换数据框模式 [重复]的主要内容,如果未能解决你的问题,请参考以下文章
使用 pyspark 将 Spark 数据框中的列转换为数组 [重复]