如何在 PySpark 上将所有功能组合成一列?

Posted

技术标签:

【中文标题】如何在 PySpark 上将所有功能组合成一列?【英文标题】:How to combine all functions into one column on PySpark? 【发布时间】:2020-05-19 10:13:31 【问题描述】:

目前我正在尝试将所有功能组合到一个名为“性别”的列中。我已经使用 Pandas 成功地做到了这一点,但现在我想用 PySpark 做到这一点,它与 Pandas 相比有点不同。我无法在 PySpark 中调用函数 .apply

这是我使用 Pandas 完成的版本:

df['Gender'] = df['Gender'].str.lower()

male = ["male", "m", "male-ish", "maile", "mal", "male (cis)", "make", "male ", "man", "msle", "mail", "malr","cis man", "cis male"]
female = ["cis female", "f", "female", "woman",  "femake", "female ","cis-female/femme", "female (cis)", "femail", "trans-female",  "trans woman", "female (trans)"]
other = ["non-binary", "nah", "all", "enby", "fluid", "genderqueer", "androgyne", "agender", "male leaning androgynous", "guy (-ish) ^_^", "neuter", "queer", "ostensibly male, unsure what that really means", "queer/she/they", "something kinda male?", "a little about you", "p"]

new_df['Gender'] = new_df['Gender'].apply(lambda x:"Male" if x in male else x)
new_df['Gender'] = new_df['Gender'].apply(lambda x:"Female" if x in female else x)
new_df['Gender'] = new_df['Gender'].apply(lambda x:"Other" if x in other else x)

这是我尝试使用 PySpark 复制的版本,但我无法将所有转换后的值放回“性别”列:

from pyspark.sql.functions import lower, col, udf
import pyspark.sql.functions as f 

na_df = na_df.withColumn('Gender', lower(col('Gender')))

Male = ["male", "m", "male-ish", "maile", "mal", "male (cis)", "make", "male ", "man", "msle", "mail", "malr","cis man", "cis male"]
Female = ["cis female", "f", "female", "woman",  "femake", "female ","cis-female/femme", "female (cis)", "femail", "trans-female",  "trans woman", "female (trans)"]
Other = ["non-binary", "nah", "all", "enby", "fluid", "genderqueer", "androgyne", "agender", "male leaning androgynous", "guy (-ish) ^_^", "neuter", "queer", "ostensibly male, unsure what that really means", "queer/she/they", "something kinda male?", "a little about you", "p"]

na_df2 = na_df.withColumn('Gender',f.when(f.col('Gender').isin(Male),f.lit('Male')).\
when(f.col('Gender').isin(Other),f.lit('Other')).\
when(f.col('Gender').isin(Female),f.lit('Female')).\
otherwise(f.col('Gender'))).show()

na_df2.select('Gender').distinct().show()

这是我尝试过的另一个版本的解决方案,但它给了我无法将列转换为布尔值的错误:

from pyspark.sql.functions import lower, col, udf

na_df = na_df.withColumn('Gender', lower(col('Gender')))

genders = 
    'Male': ["male", "m", "male-ish", "maile", "mal", "male (cis)", "make", "male ", "man", "msle", "mail", "malr","cis man", "cis male"],
    'Female': ["cis female", "f", "female", "woman",  "femake", "female ","cis-female/femme", "female (cis)", "femail", "trans-female",  "trans woman", "female (trans)"],
    'Other': ["non-binary", "nah", "all", "enby", "fluid", "genderqueer", "androgyne", "agender", "male leaning androgynous", "guy (-ish) ^_^", "neuter", "queer", "ostensibly male, unsure what that really means", "queer/she/they", "something kinda male?", "a little about you", "p"]


na_df.withColumn('Gender', (lambda x: [g for g in genders if x in genders[g]][0])(col('Gender'))).show()

我得到的结果是,“性别”列尚未更新,因此请告知我可以采取哪些措施来解决该问题。提前致谢!

【问题讨论】:

您的 pandas 代码有一个更好的替代 bdw(与这个问题无关,只是说 - 不要在这种情况下使用 apply)看看 np.select 是如何工作的。使用 pyspark,您可以尝试使用以下答案或使用 selectExpr 的 case when 和 oterwise 【参考方案1】:

您可以通过链接 when 函数来做到这一点

import pyspark.sql.functions as f
+---+----------+
| id|    gender|
+---+----------+
|  1|      male|
|  1|         m|
|  1|  male-ish|
|  1|     maile|
|  1|       mal|
|  1|male (cis)|
|  1|      make|
|  1|     male |
|  1|       man|
|  1|      msle|
|  1|      mail|
|  1|      malr|
|  1|   cis man|
|  1|  cis male|
|  1|cis female|
|  1|         f|
|  1|    female|
|  1|     woman|
|  1|    femake|
|  1|   female |
+---+----------+

df = df.withColumn('gender',f.when(f.col('gender').isin(male),f.lit('Male')).\
when(f.col('gender').isin(other),f.lit('Other')).\
when(f.col('gender').isin(female),f.lit('Female')).\
otherwise(f.col('gender')))


df.select('Gender').distinct().show()
+---+------+
| id|gender|
+---+------+
|  1|  Male|
|  1|  Male|
|  1|  Male|
|  1|  Male|
|  1|  Male|
|  1|  Male|
|  1|  Male|
|  1|  Male|
|  1|  Male|
|  1|  Male|
|  1|  Male|
|  1|  Male|
|  1|  Male|
|  1|  Male|
|  1|Female|
|  1|Female|
|  1|Female|
|  1|Female|
|  1|Female|
|  1|Female|
+---+------+

【讨论】:

该功能不起作用,我不确定它为什么不起作用。请更新我上面的代码以供审核。 你遇到了什么错误。我希望你导入 import pyspark.sql.functions as f 没有错误,只是结果还是一样,什么都没发生。它没有将它们分组为仅显示“男性”、“女性”和“其他”。我已经更新了上面的代码以供审核。 你能更新你数据框中的数据吗 @Shubham Jain Nevermind 已经解决了,谢谢。将男性变为大写男性,女性变为大写女性等的价值观是我的错误。感谢您的帮助,这是一次学习之旅。

以上是关于如何在 PySpark 上将所有功能组合成一列?的主要内容,如果未能解决你的问题,请参考以下文章

如何在 Rails 或 iOS 上将图像组合成 JPG 图像/精灵并写入元数据

Pyspark - 将多列数据组合成跨行分布的单列[重复]

将数据框的一列与另一列匹配,拉入其他列,组合成大数据集

如何为每个组连接来自某一列的所有字符串

与 Pyspark 合并

如何使用 map 函数正确并行运行 pyspark 代码