从pyspark的多列中选择非空值
Posted
技术标签:
【中文标题】从pyspark的多列中选择非空值【英文标题】:select Not null values from mutiple columns in pyspark 【发布时间】:2021-08-21 03:34:18 【问题描述】:下面是我的数据框
+---------+-----+--------+---------+-------+
| NAME|Actor| Doctor|Professor|Singer |
+---------+-----+--------+---------+-------+
| Samantha| null|Samantha| null| null|
|Christeen| null| null|Christeen| null|
| Meera| null| null| null| Meera|
| Julia|Julia| null| null| null|
| Priya| null| null| null| Priya|
| Ashley| null| null| Ashley| null|
| Jenny| null| Jenny| null| null|
| Maria|Maria| null| null| null|
| Jane| Jane| null| null| null|
| Ketty| null| null| Ketty| null|
+---------+-----+--------+---------+-------+
我想从 ACTOR、DOCTOR、PROFESSOR 和 SINGER 中选择所有非空值
【问题讨论】:
【参考方案1】:这可以通过isNotNull 并创建一个您想要的规则的condn
来实现,最后是filter -
您可以根据您的要求进一步修改condn
-
准备数据
input_str = """
Samantha|None|Samantha| None| None|
Christeen| None| None|Christeen| None|
Meera| None| None| None| Meera|
Julia|Julia| None| None| None|
Priya| None| None| None| Priya|
Ashley| None| None| Ashley| None|
Jenny| None| Jenny| None| None|
Maria|Maria| None| None| None|
Jane| Jane| None| None| None|
Ketty| None| None| Ketty| None|
Aditya| None| None| None| None|
""".split("|")
input_values = list(map(lambda x:x.strip(),input_str))[:-1]
i = 0
n = len(input_values)
input_list = []
while i < n:
input_list += [ tuple(input_values[i:i+5]) ]
i += 5
input_list
[('Samantha', 'None', 'Samantha', 'None', 'None'),
('Christeen', 'None', 'None', 'Christeen', 'None'),
('Meera', 'None', 'None', 'None', 'Meera'),
('Julia', 'Julia', 'None', 'None', 'None'),
('Priya', 'None', 'None', 'None', 'Priya'),
('Ashley', 'None', 'None', 'Ashley', 'None'),
('Jenny', 'None', 'Jenny', 'None', 'None'),
('Maria', 'Maria', 'None', 'None', 'None'),
('Jane', 'Jane', 'None', 'None', 'None'),
('Ketty', 'None', 'None', 'Ketty', 'None'),
('Aditya', 'Aditya', 'Aditya', 'Aditya', 'Aditya')]
将 None 转换为 Null
def blank_as_null(x):
return F.when(F.col(x) == "None", None).otherwise(F.col(x))
sparkDF = sql.createDataFrame(input_list,['Name','Actor','Doctor','Professor','Singer'])
to_convert = set(['Actor','Doctor','Professor','Singer'])
sparkDF = reduce(lambda df, x: df.withColumn(x, blank_as_null(x)), to_convert, sparkDF)
sparkDF.show()
+---------+------+--------+---------+------+
| Name| Actor| Doctor|Professor|Singer|
+---------+------+--------+---------+------+
| Samantha| null|Samantha| null| null|
|Christeen| null| null|Christeen| null|
| Meera| null| null| null| Meera|
| Julia| Julia| null| null| null|
| Priya| null| null| null| Priya|
| Ashley| null| null| Ashley| null|
| Jenny| null| Jenny| null| null|
| Maria| Maria| null| null| null|
| Jane| Jane| null| null| null|
| Ketty| null| null| Ketty| null|
| Aditya|Aditya| Aditya| Aditya|Aditya|
+---------+------+--------+---------+------+
过滤非空值
condn = (
(F.col('Actor').isNotNull())
& (F.col('Doctor').isNotNull())
& (F.col('Professor').isNotNull())
& (F.col('Singer').isNotNull())
)
sparkDF.filter(condn).show()
+------+------+------+---------+------+
| Name| Actor|Doctor|Professor|Singer|
+------+------+------+---------+------+
|Aditya|Aditya|Aditya| Aditya|Aditya|
+------+------+------+---------+------+
【讨论】:
以上是关于从pyspark的多列中选择非空值的主要内容,如果未能解决你的问题,请参考以下文章