在 pyspark 中的特定列上应用过滤器描述

Posted 2023-04-15

技术标签:

【中文标题】在 pyspark 中的特定列上应用过滤器描述【英文标题】：Apply describe with filter on column specific in pyspark 【发布时间】：2018-08-03 11:46:41 【问题描述】：

我在 hive hive_tbl 中有一个表，列有 'col_1','col_2','col_3'。我在上述数据之上创建了一个数据框。

现在我正在获取使用describe() 指定的列的统计信息，结果如下。

+-------+------------------+------------------+------------------+
|summary|          col1    |          col2    |   col3           |
+-------+------------------+------------------+------------------+
|  count|          17547479|          17547479|          17547479|
|   mean|2.0946498354549963| 1.474746257282603|1943.9881619448768|
| stddev|1.7921560893864912|1.2898177241581452| 40126.73218327477|
|    min|               0.0|               0.0|               0.0|
|    max|              99.0|              60.0|       1.6240624E8|
+-------+------------------+------------------+------------------+

上面的计数是给出整个表中记录的计数。但是我们是否可以在使用 describe 时应用特定于列的过滤器，即在获取某些列的计数时我有一些空白/值要被忽略，例如 col_1 的记录计数具有良好的值 549023。

我们可以得到以下结果吗？

+-------+------------------+------------------+------------------+
|summary|          col1    |          col2    |   col3           |
+-------+------------------+------------------+------------------+
|  count|          549023  |            854049|          17547479|
|   mean|2.0946498354549963| 1.474746257282603|1943.9881619448768|
| stddev|1.7921560893864912|1.2898177241581452| 40126.73218327477|
|    min|               0.0|               0.0|               0.0|
|    max|              99.0|              60.0|       1.6240624E8|
+-------+------------------+------------------+------------------+

【问题讨论】：

忽略汇总统计中的值的条件是什么？据我所知，describe() 没有考虑 Null 值。所以看起来你的DataFrame 不包含空值？像col_1='NAME'一样过滤并在使用describe()时同样获取该列的统计信息你可以做df.filter(df['col1'].isNotNull()).describe() 【参考方案1】：

您可以使用df.na().drop() 来丢弃特定列中包含NaN 或NULL 值的任何行。例如，

df.na.drop(subset=["col1"])

将删除col1 为NaN 或NULL 的所有行。最后，您现在可以describe() 过滤后的数据框：

filtered_df = df.na.drop(subset=["col1"])
filtered_df.describe()

【讨论】：

以上是关于在 pyspark 中的特定列上应用过滤器描述的主要内容，如果未能解决你的问题，请参考以下文章