不同数据框的模糊匹配列

Posted 2023-05-08

技术标签:

【中文标题】不同数据框的模糊匹配列【英文标题】：Fuzzy Match columns of Different Dataframe 【发布时间】：2019-02-14 19:34:24 【问题描述】：

背景

我有 2 个数据框，它们没有可以合并它们的通用键。两个 df 都有一个包含“实体名称”的列。一个 df 包含 8000 多个实体，另一个包含接近 2000 个实体。

样本数据：

vendor_df=
     Name of Vendor                             City         State  ZIP
     FREDDIE LEES AMERICAN GOURMET SAUCE       St. Louis    MO     63101
     CITYARCHRIVER 2015 FOUNDATION             St. Louis    MO     63102
     GLAXOSMITHKLINE CONSUMER HEALTHCARE       St. Louis    MO     63102
     LACKEY SHEET METAL                        St. Louis    MO     63102

regulator_df = 
     Name of Entity                    Committies
     LACKEY SHEET METAL                 Private
     PRIMUS STERILIZER COMPANY LLC      Private  
     HELGET GAS PRODUCTS INC            Autonomous
     ORTHOQUEST LLC                     Governmant

问题说明：

我必须模糊匹配这两个(Name of vendor & Name of Entity) 列的实体并获得分数。因此，需要知道数据帧 1(vendor_df) 的第一个值是否与数据帧 2(regulator_df) 的 2000 个实体中的任何一个匹配。

我检查过的 *** 链接：

fuzzy match between 2 columns (Python)

create new column in dataframe using fuzzywuzzy

Apply fuzzy matching across a dataframe column and save results in a new column

代码

import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

vendor_df = pd.read_excel('C:\\Users\\40101584\\Desktop\\AUS CUB AML\\Vendors_Sheet.xlsx', sheet_name=0)

regulator_df = pd.read_excel('C:\\Users\\40101584\\Desktop\\AUS CUB AML\\Regulated_Vendors_Sheet.xlsx', sheet_name=0)

compare = pd.MultiIndex.from_product([vendor_df['Name of vendor'],
                                      regulator_df['Name of Entity']]).to_series()


def metrics(tup):
    return pd.Series([fuzz.ratio(*tup),
                      fuzz.token_sort_ratio(*tup)],
                     ['ratio', 'token'])

#compare.apply(metrics) -- Either this works or the below line

result = compare.apply(metrics).unstack().idxmax().unstack(0)

以上代码的问题：

如果两个数据框都很小，则代码可以工作，但是当我提供完整的数据集时，它会花费很长时间。以上代码取自第三个链接。

如果同样的事情可以快速运行或可以处理大型数据集，任何解决方案？

更新 1

如果我们通过或硬编码一个分数，比如 80，上面的代码可以更快吗？ 80 只过滤系列/数据帧，模糊分数 > 80？

【问题讨论】：

我遇到了同样的问题，但在这里你运行compare.apply(metrics) 两次，应用比率和令牌需要很长时间，也许你最好评论倒数第二行跨度> 实际上我已经尝试了这两种方法......它们都对我来说是永远的您应该尝试使用多进程或线程。 【参考方案1】：

以下解决方案比我发布的要快，但如果有人有更快的方法，请告诉：

matched_vendors = []

for row in vendor_df.index:
    vendor_name = vendor_df.get_value(row,"Name of vendor")
    for columns in regulator_df.index:
        regulated_vendor_name=regulator_df.get_value(columns,"Name of Entity")
        matched_token=fuzz.partial_ratio(vendor_name,regulated_vendor_name)
        if matched_token> 80:
            matched_vendors.append([vendor_name,regulated_vendor_name,matched_token])

【讨论】：

Aarwal 我得到 AttributeError: 'collections.OrderedDict' 对象没有属性 'index'...请思考需要将您的 dict 值转换为列表谢谢。所以我的运行时间是 24 小时并且还在继续。这是我的 2 Q ANY HELP PLZ ***.com/questions/66856883/… 在您的问题上发布了一个链接...检查一下...我实现了它并且效果很好！我在第 7 列得到错误“matched_token=fuzz.partial_ratio(vendor_name,regulation_vendor_name)” 'float' 类型的对象没有 len() 我的数据帧中没有任何浮点对象，只有字符串对象。你知道问题出在哪里吗？【参考方案2】：

在我的情况下，我也只需要查找 80 以上。我根据我的用例修改了您的代码。希望它有所帮助。

compare = compare.apply(metrics)
compare_80=compare[(compare['ratio'] >80) & (compare['token'] >80)]

【讨论】：

【参考方案3】：

我已经在 Python 中使用并行处理实现了代码，这将比串行计算快得多。此外，在模糊度量分数超过阈值的情况下，只有那些计算是并行执行的。代码见以下链接：

https://github.com/ankitcoder123/Important-Python-Codes/blob/main/Faster%20Fuzzy%20Match%20between%20two%20columns/Fuzzy_match.py

版本兼容性：

pandas version :: 1.1.5 ,
numpy vesrion:: 1.19.5,
fuzzywuzzy version :: 1.1.0 ,
joblib version :: 0.18.0

Fuzzywuzzy 度量解释： link text

代码输出：

【讨论】：

以上是关于不同数据框的模糊匹配列的主要内容，如果未能解决你的问题，请参考以下文章