Spark 数据框无法比较 Null 值

Posted 2023-04-15

技术标签:

【中文标题】Spark 数据框无法比较 Null 值【英文标题】：Spark dataframe not able to Compare Null values 【发布时间】：2020-05-16 19:12:08 【问题描述】：

大家好，我有 2 个数据帧，我正在比较数据帧的值，并基于将值分配给一个新数据帧的值。所有场景都工作正常，期望空字段比较，即如果在两个数据帧中的值都是空的，那么它应该显示为“已验证”，但它给我作为“未可变”我正在共享我的数据帧数据和我正在使用的代码和下面是最终数据框的结果。

    scala> df1.show()
    +---+-----+---+--------+------+-------+
    | id| name|age|lastname|  city|country|
    +---+-----+---+--------+------+-------+
    |  1|rohan| 26|  sharma|mumbai|  india|
    |  2|rohan| 26|  sharma|  null|  india|
    |  3|rohan| 26|    null|mumbai|  india|
    |  4|rohan| 26|  sharma|mumbai|  india|
    +---+-----+---+--------+------+-------+
    scala> df2.show()
    +----+------+-----+----------+------+---------+
    |o_id|o_name|o_age|o_lastname|o_city|o_country|
    +----+------+-----+----------+------+---------+
    |   1| rohan|   26|    sharma|mumbai|    india|
    |   2| rohan|   26|    sharma|  null|    india|
    |   3| rohan|   26|    sharma|mumbai|    india|
    |   4| rohan|   26|      null|mumbai|    india|
    +----+------+-----+----------+------+---------+

    val df3 = df1.join(df2, df1("id") === df2("o_id"))
    .withColumn("result", when(df1("name") === df2("o_name") && 
    df1("age") === df2("o_age") && 
    df1("lastname") === df2("o_lastname") && 
    df1("city") === df2("o_city")  &&
    df1("country") === df2("o_country"), "Varified")
    .otherwise("Not Varified")).show()

    +---+-----+---+--------+------+-------+----+------+-----+----------+------+---------+------------+
    | id| name|age|lastname|  city|country|o_id|o_name|o_age|o_lastname|o_city|o_country|      result|
    +---+-----+---+--------+------+-------+----+------+-----+----------+------+---------+------------+
    |  1|rohan| 26|  sharma|mumbai|  india|   1| rohan|   26|    sharma|mumbai|    india|    Varified|
    |  2|rohan| 26|  sharma|  null|  india|   2| rohan|   26|    sharma|  null|    india|Not Varified|
    |  3|rohan| 26|    null|mumbai|  india|   3| rohan|   26|    sharma|mumbai|    india|Not Varified|
    |  4|rohan| 26|  sharma|mumbai|  india|   4| rohan|   26|      null|mumbai|    india|Not Varified|
    +---+-----+---+--------+------+-------+----+------+-----+----------+------+---------+------------+

我希望 id '2' 也应该显示为 'Varified'。但是该城市在两个列中都为空，然后显示为 'Not Varified'。有人可以指导我如何修改我的 df3 查询，以便它也可以检查 null 并且对于 id '2' 也可以在结果列中显示为 'Varified'。

【问题讨论】：

一般情况下，null != null（在 sql 和 spark 中）。但是有一个运算符<=>（空安全相等）可能会对您有所帮助。详情见***.com/questions/41728762/… Including null values in an Apache Spark Join的可能重复 【参考方案1】：

使用<=> 代替===

val df3 = df1.join(df2, df1("id") === df2("o_id"))
    .withColumn("result", when(df1("name") <=> df2("o_name") && 
    df1("age") <=> df2("o_age") && 
    df1("lastname") <=> df2("o_lastname") && 
    df1("city") <=> df2("o_city")  &&
    df1("country") <=> df2("o_country"), "Varified")
    .otherwise("Not Varified")).show()

spark.sql("SELECT NULL AS city1, NULL AS city2").select($"city1" <=> $"city2").show

结果

+-----------------+
|(city1 <=> city2)|
+-----------------+
|            true |
+-----------------+

【讨论】：

【参考方案2】：

在您的 when+otherwise 语句中添加 <=>（或）|| 运算符并检查 .isNull 以获得 last_name and city 列。

null=null 返回 null 无法匹配的原因。

spark.sql("select null=null").show()
//+-------------+
//|(NULL = NULL)|
//+-------------+
//|         null|
//+-------------+

Using <=>,isnull():

spark.sql("select null<=>null, isnull(null) = isnull(null)").show()
//+---------------+---------------------------------+
//|(NULL <=> NULL)|((NULL IS NULL) = (NULL IS NULL))|
//+---------------+---------------------------------+
//|           true|                             true|
//+---------------+---------------------------------+

Example:

df1.join(df2, df1("id") === df2("o_id")).
withColumn("result", when( (df1("name") === df2("o_name")) && (df1("age") === df2("o_age") ) && 
(df1("lastname") === df2("o_lastname")|| (df1("lastname").isNull === df2("o_lastname").isNull)) && 
(df1("city") === df2("o_city")|| (df1("city").isNull === df2("o_city").isNull))  && 
(df1("country") === df2("o_country")), "Varified").otherwise("Not Varified")).
show()

//or using <>
df1.join(df2, df1("id") === df2("o_id")).withColumn("result", when( (df1("name") === df2("o_name")) && (df1("age") === df2("o_age")) && (df1("lastname") <=> df2("o_lastname")) && (df1("city") <=> df2("o_city"))  && (df1("country") === df2("o_country")), "Varified").otherwise("Not Varified")).show()

//+---+-----+---+--------+------+-------+----+------+-----+----------+------+---------+------------+
//| id| name|age|lastname|  city|country|o_id|o_name|o_age|o_lastname|o_city|o_country|      result|
//+---+-----+---+--------+------+-------+----+------+-----+----------+------+---------+------------+
//|  1|rohan| 26|  sharma|mumbai|  india|   1| rohan|   26|    sharma|mumbai|    india|    Varified|
//|  2|rohan| 26|  sharma|  null|  india|   2| rohan|   26|    sharma|  null|    india|    Varified|
//|  3|rohan| 26|    null|mumbai|  india|   3| rohan|   26|    sharma|mumbai|    india|Not Varified|
//|  4|rohan| 26|  sharma|mumbai|  india|   4| rohan|   26|      null|mumbai|    india|Not Varified|
//+---+-----+---+--------+------+-------+----+------+-----+----------+------+---------+------------+

【讨论】：

以上是关于Spark 数据框无法比较 Null 值的主要内容，如果未能解决你的问题，请参考以下文章