从pyspark的多列中选择非空值

Posted

技术标签:

【中文标题】从pyspark的多列中选择非空值【英文标题】:select Not null values from mutiple columns in pyspark 【发布时间】:2021-08-21 03:34:18 【问题描述】:

下面是我的数据框

+---------+-----+--------+---------+-------+
|     NAME|Actor|  Doctor|Professor|Singer |
+---------+-----+--------+---------+-------+
| Samantha| null|Samantha|     null|   null|
|Christeen| null|    null|Christeen|   null|
|    Meera| null|    null|     null|  Meera|
|    Julia|Julia|    null|     null|   null|
|    Priya| null|    null|     null|  Priya|
|   Ashley| null|    null|   Ashley|   null|
|    Jenny| null|   Jenny|     null|   null|
|    Maria|Maria|    null|     null|   null|
|     Jane| Jane|    null|     null|   null|
|    Ketty| null|    null|    Ketty|   null|
+---------+-----+--------+---------+-------+

我想从 ACTOR、DOCTOR、PROFESSOR 和 SINGER 中选择所有非空值

【问题讨论】:

【参考方案1】:

这可以通过isNotNull 并创建一个您想要的规则的condn 来实现,最后是filter -

您可以根据您的要求进一步修改condn -

准备数据

input_str = """
Samantha|None|Samantha|     None|   None|
Christeen| None|    None|Christeen|   None|
Meera| None|    None|     None|  Meera|
Julia|Julia|    None|     None|   None|
Priya| None|    None|     None|  Priya|
Ashley| None|    None|   Ashley|   None|
Jenny| None|   Jenny|     None|   None|
Maria|Maria|    None|     None|   None|
Jane| Jane|    None|     None|   None|
Ketty| None|    None|    Ketty|   None|
Aditya| None|    None|    None|   None|
""".split("|")


input_values = list(map(lambda x:x.strip(),input_str))[:-1]

i = 0

n = len(input_values)

input_list = []

while i < n:
    input_list += [ tuple(input_values[i:i+5]) ]
    i += 5

input_list

[('Samantha', 'None', 'Samantha', 'None', 'None'),
 ('Christeen', 'None', 'None', 'Christeen', 'None'),
 ('Meera', 'None', 'None', 'None', 'Meera'),
 ('Julia', 'Julia', 'None', 'None', 'None'),
 ('Priya', 'None', 'None', 'None', 'Priya'),
 ('Ashley', 'None', 'None', 'Ashley', 'None'),
 ('Jenny', 'None', 'Jenny', 'None', 'None'),
 ('Maria', 'Maria', 'None', 'None', 'None'),
 ('Jane', 'Jane', 'None', 'None', 'None'),
 ('Ketty', 'None', 'None', 'Ketty', 'None'),
 ('Aditya', 'Aditya', 'Aditya', 'Aditya', 'Aditya')]

将 None 转换为 Null

def blank_as_null(x):
    return F.when(F.col(x) == "None", None).otherwise(F.col(x))

sparkDF = sql.createDataFrame(input_list,['Name','Actor','Doctor','Professor','Singer'])

to_convert = set(['Actor','Doctor','Professor','Singer'])

sparkDF = reduce(lambda df, x: df.withColumn(x, blank_as_null(x)), to_convert, sparkDF)

sparkDF.show()
+---------+------+--------+---------+------+
|     Name| Actor|  Doctor|Professor|Singer|
+---------+------+--------+---------+------+
| Samantha|  null|Samantha|     null|  null|
|Christeen|  null|    null|Christeen|  null|
|    Meera|  null|    null|     null| Meera|
|    Julia| Julia|    null|     null|  null|
|    Priya|  null|    null|     null| Priya|
|   Ashley|  null|    null|   Ashley|  null|
|    Jenny|  null|   Jenny|     null|  null|
|    Maria| Maria|    null|     null|  null|
|     Jane|  Jane|    null|     null|  null|
|    Ketty|  null|    null|    Ketty|  null|
|   Aditya|Aditya|  Aditya|   Aditya|Aditya|
+---------+------+--------+---------+------+

过滤非空值

condn = (
           (F.col('Actor').isNotNull())
         & (F.col('Doctor').isNotNull())
         & (F.col('Professor').isNotNull())
         & (F.col('Singer').isNotNull())
)

sparkDF.filter(condn).show()

+------+------+------+---------+------+
|  Name| Actor|Doctor|Professor|Singer|
+------+------+------+---------+------+
|Aditya|Aditya|Aditya|   Aditya|Aditya|
+------+------+------+---------+------+

【讨论】:

以上是关于从pyspark的多列中选择非空值的主要内容,如果未能解决你的问题,请参考以下文章

Pyspark 计数非空值之间的空值

在 PySpark 中为每一行查找最新的非空值

我只需要在 pyspark 数据框中附加那些具有非空值的人

如何在 PySpark 中用该列的第一个非空值填充该列的空值

pyspark 将模式应用于 csv - 仅返回空值

Pyspark中组连接函数的持久循环数据帧