使用键名过滤pyspark中的字典
Posted
技术标签:
【中文标题】使用键名过滤pyspark中的字典【英文标题】:Filter dictionary in pyspark with key names 【发布时间】:2021-07-21 17:30:47 【问题描述】:pyspark 的新手。 给定数据集中的字典之类的列,如果满足另一个键的值,我想从一个键中获取值。
示例: 假设我在数据集中有一列“统计”,其中每个数据行如下所示:
array
0: "hair": "black", "eye": "white", "metric": "feet"
1: "hair": "blue", "eye": "white", "metric": "m"
2: "hair": "red", "eye": "brown", "metric": "feet"
3: "hair": "yellow", "eye": "white", "metric": "cm"
每当头发是“黑色”时,我都想获得“眼睛”的值
我试过了:
select
statistics.eye("*").filter(statistics.hair, x -> x == 'black')
from arrayData
但它给出了一个错误,我无法抓住眼睛的价值,请协助。
【问题讨论】:
【参考方案1】:您可以转换为数据帧并读取它..您也可以将其注册为 temptable 并读取为 sql
from pyspark.sql import functions as F
df=sc.parallelize(["hair": "black", "eye": "white", "metric": "feet","hair": "blue", "eye": "white", "metric": "m","hair": "red", "eye": "brown", "metric": "feet","hair": "yellow", "eye": "white", "metric": "cm"]).toDF()
>>> df.show()
+-----+------+------+
| eye| hair|metric|
+-----+------+------+
|white| black| feet|
|white| blue| m|
|brown| red| feet|
|white|yellow| cm|
+-----+------+------+
>>> df.filter(F.col("hair") == 'black').show()
+-----+-----+------+
| eye| hair|metric|
+-----+-----+------+
|white|black| feet|
+-----+-----+------+
df.createOrReplaceTempView("data")
spark.sql("select * from data where hair ='black'")
【讨论】:
【参考方案2】:我最终想通了,而无需先转换为数据框。
如果满足另一个键的值,聚合命令允许您从一个键中获取值。对于这种情况,下面的命令就足够了:
select
aggregate(statistics,"",(agg,item)->concat(agg,CASE WHEN item.hair == 'black' THEN item.eye ELSE "" END)) as EyeColor
from arrayData
有关如何使用此功能的更多详细信息,请参阅here
【讨论】:
以上是关于使用键名过滤pyspark中的字典的主要内容,如果未能解决你的问题,请参考以下文章