pyspark 中特定列的每个值始终为 NULL 的列类别

Posted 2023-04-15

技术标签:

【中文标题】pyspark 中特定列的每个值始终为 NULL 的列类别【英文标题】：Categories of columns that are always NULL for every value of a specific column in pyspark 【发布时间】：2020-05-11 11:34:10 【问题描述】：

我有一个非常大的表，我的数据库是 spark。假设它是这样的：

+----------------------------------------------------------------------------+
| col1  | col2 | col3  | col4(event_type) | col5  | col6  |   ...     |col20 |
+----------------------------------------------------------------------------+
|  null | val1 | val2  |       'A'        | val3  | null  |   ...     | null |
|  val4 | null | val5  |       'B'        | val6  | null  |   ...     | null |
|  null | null | val7  |       'C'        | null  | val8  |   ...     | val9 |
|  null | val1 | vall8 |       'A'        | val2  | null  |   ...     | null |
|............................................................................|
+-----------------------------------------------------------------------------

在这张表中，我们有很多列有很多 NULL 值。此表还有一个 type 列。对于每个 type 值，某些列永远为空。例如上表中，type='A' col1 和 col5 和 col20 为 NULL。

我想为每个 type 提取所有不为空的列名。（例如，对于“A”类型，我想获取 col1、col5 和 col20 名称）。

谁能帮我怎么做？

更新：

正如@Mohammad 所说，我试试这个 pyspark 代码：

from pyspark.sql import functions as F

df.groupBy("event_type").agg\
(F.first(F.concat_ws(",",*[(F.when(F.col(x).isNotNull(), F.lit(x)))\
                           for x in df.columns if x!='event_type'])).alias("non_null_columns")).show()

它似乎是正确的，但结果显示不正确。结果是这样的：

+--------------------+-----------------------------------------+
|     event_type     |all_not_null_columns_for_each_event      |
+--------------------+-----------------------------------------+
|    event1_name     |                     timestamp,created...|
|    event2_comple...|                     timestamp,created...|
|    event3_name     |                     timestamp,battery...|
|    event5_name     |                     timestamp,battery...|
|    event6_name     |                     timestamp,battery...|
|    event7_comple...|                     timestamp,created...|
+--------------------+-----------------------------------------+

如您所见，结果并未完全显示，相反，我们看到...

【问题讨论】：

【参考方案1】：

您可以使用 Pyspark DataFrame API 实现列循环，而 pure SQL. 则无法实现

df.show() #sampledata
#+----+----+----+----------+
#|col1|col2|col3|event_type|
#+----+----+----+----------+
#|null|val1|val2|         A|
#|val4|null|val5|         B|
#|null|null|val7|         C|
#|null|val1|val8|         A|
#+----+----+----+----------+


from pyspark.sql import functions as F

df.groupBy("event_type").agg\
(F.first(F.concat_ws(",",*[(F.when(F.col(x).isNotNull(), F.lit(x)))\
                           for x in df.columns if x!='event_type'])).alias("non_null_columns")).show()


#+----------+----------------+
#|event_type|non_null_columns|
#+----------+----------------+
#|         B|       col1,col3|
#|         C|            col3|
#|         A|       col2,col3|
#+----------+----------------+

【讨论】：

这是个好主意，但结果有很多重复的行，实际上是一样的。谢谢。现在删除了重复的行，但我不知道为什么在 non_null_columns 中我只能看到最多两个值，并且它会提醒......例如，结果的一行是这样的：“A”| val1, val2... @Saeed 我不确定你所说的最大两个值是什么意思，因为代码将获取所有非空列，你能详细说明一下我更新了我的问题并写了我的输出结果，所以你可以看到它。执行 .show(truncate=False) 以查看完整的行。【参考方案2】：

在 SQL 中，您可以这样做：

select event_type,
       concat_ws(',',
                 (case when count(col1) > 0 then 'col1' end),
                 (case when count(col2) > 0 then 'col2' end),
                 (case when count(col3) > 0 then 'col3' end),
                 . . .
                ) as non_null_columns
from t
group by event_type;

【讨论】：

你能解释一下你的代码吗？这对我不起作用。我们可以根据事件值过滤您查询的哪一部分？ @Saeed 。 . .您可以添加WHERE 子句。我将答案调整为聚合，因此您可以获得每个 event_type 的结果。谢谢。现在是工作。现在我只是想找到一种方法来循环，因为我的专栏很多。无论如何，谢谢您的回复。

以上是关于pyspark 中特定列的每个值始终为 NULL 的列类别的主要内容，如果未能解决你的问题，请参考以下文章