使用 toPandas 时强制将 null 一致转换为 nan

Posted 2023-04-15

技术标签:

【中文标题】使用 toPandas 时强制将 null 一致转换为 nan【英文标题】：Force consistent conversion of null to nan when using toPandas 【发布时间】：2020-05-20 09:54:54 【问题描述】：

pyspark 中的toPandas 方法对于数值列中的空值不一致。有没有办法让它更加一致？

一个例子

sc 是 sparkContext。火花版本是 2.3.2。我不确定如何包含笔记本结果，但我只会评论输出。这很简单，您可以自己在笔记本中查看。

sparkTest = sc.createDataFrame(
    [
        (1,    1   ),
        (2,    None),
        (None, None),
    ],
    ['a', 'b']
)
sparkTest.show() # all None values are neatly converted to null

pdTest1 = sparkTest.toPandas()
pdTest1 # all None values are NaN
np.isnan(pdTest1['b']) # this a series of dtype bool

pdTest2 = sparkTest.filter(col('b').isNull()).toPandas()
pdTest2 # the null value in column a is still NaN, but the two null in column b are now None
np.isnan(pdTest2['b']) # this throws an error

这在编程时当然是有问题的，并且无法预先预测一列是否将全部为空。

顺便说一句，我想将此作为问题报告，但我不确定在哪里。 github page 似乎没有问题部分？

【问题讨论】：

【参考方案1】：

np.isnan 可以应用于原生 dtype 的 NumPy 数组（例如 np.float64），但在应用于对象数组时会引发 TypeError：

pdTest1['b']
0    1.0
1    NaN
2    NaN
Name: b, dtype: float64

pdTest2['b']
0    None
1    None
Name: b, dtype: object

如果你有 pandas，你可以改用 pandas.isnull：

import pandas as pd


pd.isnull(pdTest1['b'])
0    False
1     True
2     True
Name: b, dtype: bool


pd.isnull(pdTest2['b'])
0    True
1    True
Name: b, dtype: bool

这对于np.nan 和None 都是一致的。

或者，您可以（如果可能的话）将您的pdTest2['b'] 数组转换为本机numpy 类型之一（例如np.float64），以确保np.isnan 正常工作，例如：

pdTest2 = sparkTest.filter(f.col('b').isNull()).toPandas()
np.isnan(pdTest2['b'].astype(np.float64)) 
0    True
1    True
Name: b, dtype: bool

【讨论】：

这绝对有帮助，但严格来说并不能回答我如何更一致地转换的问题。它只是避免了这种情况下的问题。但是，这种不一致的转换在其他方面可能仍然存在其他问题（我现在想不出）。我不认为你可以 - 这是预期的行为，因为 numpy 将仅在本机类型上实现 np.isnan，而 pyspark 没有 NaN 的概念，而是翻译Python None 到 JVM null。这意味着，否则您需要检查 NaN，捕获异常，然后检查 None，因为您要检查两种不同的类型。 @Willem 我已经用另一种解决方案添加了更多风味，将您的第二个数组作为np.float64 类型以使np.isnan 工作。希望这有助于并更清楚地回答:)

以上是关于使用 toPandas 时强制将 null 一致转换为 nan的主要内容，如果未能解决你的问题，请参考以下文章