在 PySpark 中转换数据框模式 [重复]

Posted

技术标签:

【中文标题】在 PySpark 中转换数据框模式 [重复]【英文标题】:Convert dataframe schema In PySpark [duplicate] 【发布时间】:2020-06-13 13:48:40 【问题描述】:

我有一个数据框

+------------------+-------------------+--------------------+
|              name|                sku|         description|
+------------------+-------------------+--------------------+
|    Mary Rodriguez| hand-couple-manage|Senior word socia...|
|    Jose Henderson| together-table-oil|Apply girl treatm...|
|    Karen Villegas|     child-somebody|Every tell serve....|
|      Olivia Lynch|forget-matter-avoid|Perhaps environme...|
|     Whitney Wiley|    side-blue-dream|Quickly short soc...|
|  Brittany Johnson|        east-pretty|Indicate view sim...|
|       Paul Morris|    radio-window-us|Society month sho...|
|   Jason Patterson|   night-art-be-act|Entire around pla...|
|      Kiara Gentry|   compare-politics|Air my kind staff...|

架构

root
 |-- sku: string (nullable = true)
 |-- name_description: array (nullable = true)
 |    |-- element: string (containsNull = true)

如何按列 sku 分组并合并 namedescription 中的值以获取 name_description 列,其值作为 JSON 的数组,格式为 ["name":..., "description":..., "name":..., "description":..., ....] 中的每个值PySpark 中的 987654331@?

【问题讨论】:

这些答案有帮助吗? pyspark create dictionary from data in two columns、Spark scala dataframe: Merging multiple columns into single column 等等…… @mazaneicha 不,正在寻找特定格式。 【参考方案1】:

检查下面的代码。


df.show(false)
+---------------+-------------------+-------------------+
|name           |sku                |description        |
+---------------+-------------------+-------------------+
|MaryRodriguez  |hand-couple-manage |Seniorwordsocia... |
|JoseHenderson  |together-table-oil |Applygirltreatm... |
|KarenVillegas  |child-somebody     |Everytellserve.... |
|OliviaLynch    |forget-matter-avoid|Perhapsenvironme...|
|WhitneyWiley   |side-blue-dream    |Quicklyshortsoc... |
|BrittanyJohnson|east-pretty        |Indicateviewsim... |
|PaulMorris     |radio-window-us    |Societymonthsho... |
|JasonPatterson |night-art-be-act   |Entirearoundpla... |
|KiaraGentry    |compare-politics   |Airmykindstaff...  |
+---------------+-------------------+-------------------+

df.groupBy(F.col("sku").agg(F.collect_list(F.struct(F.col("name"),F.col("description"))).alias("name_description")).toJSON.show(false)
+-------------------------------------------------------------------------------------------------------------+
|value                                                                                                        |
+-------------------------------------------------------------------------------------------------------------+
|"sku":"hand-couple-manage","name_description":["name":"MaryRodriguez","description":"Seniorwordsocia..."]|
|"sku":"night-art-be-act","name_description":["name":"JasonPatterson","description":"Entirearoundpla..."] |
|"sku":"forget-matter-avoid","name_description":["name":"OliviaLynch","description":"Perhapsenvironme..."]|
|"sku":"compare-politics","name_description":["name":"KiaraGentry","description":"Airmykindstaff..."]     |
|"sku":"child-somebody","name_description":["name":"KarenVillegas","description":"Everytellserve...."]    |
|"sku":"side-blue-dream","name_description":["name":"WhitneyWiley","description":"Quicklyshortsoc..."]    |
|"sku":"radio-window-us","name_description":["name":"PaulMorris","description":"Societymonthsho..."]      |
|"sku":"east-pretty","name_description":["name":"BrittanyJohnson","description":"Indicateviewsim..."]     |
|"sku":"together-table-oil","name_description":["name":"JoseHenderson","description":"Applygirltreatm..."]|
+-------------------------------------------------------------------------------------------------------------+

【讨论】:

你有 SQL 方法吗?

以上是关于在 PySpark 中转换数据框模式 [重复]的主要内容,如果未能解决你的问题,请参考以下文章

使用 pyspark 将 Spark 数据框中的列转换为数组 [重复]

Spark中来自pyspark的熊猫[重复]

如何将 json 转换为 pyspark 数据帧(更快的实现)[重复]

将 pyspark 数据帧转换为 pandas 数据帧

pyspark将单列转换为多列[重复]

在 Pyspark 中将流水线 RDD 转换为 Dataframe [重复]