与 Spark 数据框匹配的 Python 字符串

Posted 2023-02-22

技术标签:

【中文标题】与 Spark 数据框匹配的 Python 字符串【英文标题】：Python string matching with Spark dataframe 【发布时间】：2022-01-07 10:30:00 【问题描述】：

我有一个 spark 数据框

id  | city  | fruit | quantity
-------------------------
0   |  CA   | apple | 300
1   |  CA   | appel | 100
2   |  CA   | orange| 20
3   |  CA   | berry | 10

我想得到水果是apple 或orange 的行。所以我使用 Spark SQL：

SELECT * FROM table WHERE fruit LIKE '%apple%' OR fruit LIKE '%orange%';

id  | city  | fruit | quantity
-------------------------
0   |  CA   | apple | 300
2   |  CA   | orange| 20

但它应该返回

id  | city  | fruit | quantity
-------------------------
0   |  CA   | apple | 300
1   |  CA   | appel | 100
2   |  CA   | orange| 20

因为第 1 行只是一个拼写错误。

所以我打算使用fuzzywuzzy 进行字符串匹配。我知道

import fuzzywuzzy
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

print(fuzz.partial_ratio('apple', 'apple')) -> 100
print(fuzz.partial_ratio('apple', 'appel')) -> 83

但我不确定如何将其应用于数据框中的列以获取相关行

【问题讨论】：

可以将fuzzywuzzy函数注册为udf。 【参考方案1】：

由于您有兴趣将模糊匹配实现为过滤器，因此您必须首先确定您希望匹配的相似程度的阈值。

方法一

对于您的 fuzzywuzzy 导入，这可能是 80 用于此演示的目的（根据您的需要进行调整）。然后，您可以实现一个 udf 来应用您导入的模糊逻辑代码，例如

from pyspark.sql import functions as F
from pyspark.sql import types as T

F.udf(T.BooleanType())
def is_fuzzy_match(field_value,search_value, threshold=80):
    from fuzzywuzzy import fuzz
    return fuzz.partial_ratio(field_value, search_value) > threshold

然后将您的 udf 作为过滤器应用于您的数据框

df = (
    df.where(
          is_fuzzy_match(F.col("fruit"),F.lit("apple"))  | 
          is_fuzzy_match(F.col("fruit"),F.lit("orange"))  
    )
)

方法 2：推荐

但是，在 spark 上执行 udfs 可能会很昂贵，并且 spark 已经实现了 levenshtein 函数，这在这里也很有用。您可以开始阅读更多关于levenshtein distance accomplishes fuzzy matching 的信息。

使用这种方法，您的代码看起来像使用低于3 的阈值

from pyspark.sql import functions as F

df = df.where(
    (
        F.levenshtein(
            F.col("fruit"),
            F.lit("apple")
        ) < 3
     ) |
     (
        F.levenshtein(
            F.col("fruit"),
            F.lit("orange")
        ) < 3
     ) 
)

df.show()

+---+----+------+--------+
| id|city| fruit|quantity|
+---+----+------+--------+
|  0|  CA| apple|     300|
|  1|  CA| appel|     100|
|  2|  CA|orange|      20|
+---+----+------+--------+

出于调试目的，levenshtein 的结果已包含在下面

df.withColumn("diff",
    F.levenshtein(
        F.col("fruit"),
        F.lit("apple")
    )
).show()

+---+----+------+--------+----+
| id|city| fruit|quantity|diff|
+---+----+------+--------+----+
|  0|  CA| apple|     300|   0|
|  1|  CA| appel|     100|   2|
|  2|  CA|orange|      20|   5|
|  3|  CA| berry|      10|   5|
+---+----+------+--------+----+

更新 1

针对 Op 在 cmets 中提供的额外样本数据：

如果我有克什米尔苹果这样的水果，并希望它与苹果相配

方法 3

您可以尝试以下方法并根据需要调整阈值。

由于您有兴趣在整个文本中匹配拼写错误的水果的可能性，您可以尝试将 levenshtein 应用于整个水果名称的每个部分。下面的函数（不是 udfs 而是为了可读性简化了任务的应用）实现了这种方法。 matches_fruit_ratio 尝试确定找到多少匹配项，而 matches_fruit 确定每个水果名称由空格分隔的最大 matches_fruit_ratio。


from pyspark.sql import functions as F

def matches_fruit_ratio(fruit_column,fruit_search,threshold=0.3):
    return (F.length(fruit_column) - F.levenshtein(
        fruit_column,
        F.lit(fruit_search)
    )) / F.length(fruit_column) 

def matches_fruit(fruit_column,fruit_search,threshold=0.6):
    return F.array_max(F.transform(
        F.split(fruit_column," "),
        lambda fruit_piece : matches_fruit_ratio(fruit_piece,fruit_search)
    )) >= threshold

可以这样使用：

df = df.where(
    
    matches_fruit(
        F.col("fruit"),
        "apple"
    ) | matches_fruit(
        F.col("fruit"),
        "orange"
    )
)
df.show()

+---+----+-------------+--------+
| id|city|        fruit|quantity|
+---+----+-------------+--------+
|  0|  CA|        apple|     300|
|  1|  CA|        appel|     100|
|  2|  CA|       orange|      20|
|  4|  CA|  apply berry|       3|
|  5|  CA|  apple berry|       1|
|  6|  CA|kashmir apple|       5|
|  7|  CA|kashmir appel|       8|
+---+----+-------------+--------+

出于调试目的，我为每个函数的不同组件添加了额外的示例数据和输出列，同时演示了如何使用此函数

df.withColumn("length",
    F.length(
        "fruit"
    )
).withColumn("levenshtein",
    F.levenshtein(
        F.col("fruit"),
        F.lit("apple")
    )
).withColumn("length - levenshtein",
    F.length(
        "fruit"
    ) - F.levenshtein(
        F.col("fruit"),
        F.lit("apple")
    )
).withColumn(
    "matches_fruit_ratio",
    matches_fruit_ratio(
        F.col("fruit"),
        "apple"
    )
).withColumn(
    "matches_fruit_values_before_threshold",
    F.array_max(F.transform(
        F.split("fruit"," "),
        lambda fruit_piece : matches_fruit_ratio(fruit_piece,"apple")
    ))
).withColumn(
    "matches_fruit",
    matches_fruit(
        F.col("fruit"),
        "apple"
    )
).show()

+---+----+-------------+--------+------+-----------+--------------------+-------------------+-------------------------------------+-------------+
| id|city|        fruit|quantity|length|levenshtein|length - levenshtein|matches_fruit_ratio|matches_fruit_values_before_threshold|matches_fruit|
+---+----+-------------+--------+------+-----------+--------------------+-------------------+-------------------------------------+-------------+
|  0|  CA|        apple|     300|     5|          0|                   5|                1.0|                                  1.0|         true|
|  1|  CA|        appel|     100|     5|          2|                   3|                0.6|                                  0.6|         true|
|  2|  CA|       orange|      20|     6|          5|                   1|0.16666666666666666|                  0.16666666666666666|        false|
|  3|  CA|        berry|      10|     5|          5|                   0|                0.0|                                  0.0|        false|
|  4|  CA|  apply berry|       3|    11|          6|                   5|0.45454545454545453|                                  0.8|         true|
|  5|  CA|  apple berry|       1|    11|          6|                   5|0.45454545454545453|                                  1.0|         true|
|  6|  CA|kashmir apple|       5|    13|          8|                   5|0.38461538461538464|                                  1.0|         true|
|  7|  CA|kashmir appel|       8|    13|         10|                   3|0.23076923076923078|                                  0.6|         true|
+---+----+-------------+--------+------+-----------+--------------------+-------------------+-------------------------------------+-------------+

【讨论】：

如果我有克什米尔苹果这样的水果，并希望它与苹果相匹配。如果我使用 levenshtein 距离，它会将距离显示为 8，这就是为什么我要使用 fuzz.partial_ratio @JohnConstantine 感谢您分享额外的示例，让您更好地了解您的用例。您可以尝试答案中提到的 udf 或按照 here 的描述在 spark 中实现 partial_ratio。我用另一种可能更有用的方法更新了答案。让我知道这是否适合您。

以上是关于与 Spark 数据框匹配的 Python 字符串的主要内容，如果未能解决你的问题，请参考以下文章

Python spark从数据框中提取字符

如何将带有无效字符（重音）的 Pandas 数据框与数组匹配？ [复制]

如何在 Spark 中将双行与阈值匹配？

字符串列包含通过 spark scala 精确匹配的单词

如何匹配两个数据框的架构

熊猫数据框列表部分字符串匹配python [重复]