加快匹配字符串python

Posted 2023-03-11

技术标签:

【中文标题】加快匹配字符串python【英文标题】：Speed up matching strings python 【发布时间】：2020-05-17 03:03:39 【问题描述】：

我有 2 个不同的数据框，我正在尝试匹配字符串列（名称）

以下只是一些 DF 的示例

df1 (127000,3)
Code     Name     PostalCode
150      Maarc    47111
250      Kirc     41111
170      Moic     42111
140      Nirc     44111
550      Lacter   47111

df2 (38000,3)
Code     NAME     POSTAL_CODE
150      Marc     47111
250      Kikc     41111
170      Mosc     49111
140      NiKc     44111
550      Lacter   47111

目的是创建另一个DF3，如下所示

Code     NAME    Best Match   Score
150      Marc    Maarc        0.9
250      Karc    Kirc         0.9

以下代码给出了预期的输出

import difflib
from functools import partial
f = partial(difflib.get_close_matches, possibilities= df1['Name'].tolist(), n=1)

matches = df2['NAME'].map(f).str[0].fillna('')

scores = [difflib.SequenceMatcher(None, x, y).ratio()
          for x, y in zip(matches, df2['NAME'])]

df3 = df2.assign(best=matches, score=scores)
df3.sort_values(by='score')

问题

仅匹配 2 行的这些字符串大约需要 30 秒。此任务必须针对 1K 行完成，这需要数小时！

问题

如何加快代码速度？我在想像fetchall这样的东西？

编辑

连fuzzywuzzy库都试过了，比difflib用的时间长，代码如下：

from fuzzywuzzy import fuzz

def get_fuzz(df, w):
    s = df['Name'].apply(lambda y: fuzz.token_set_ratio(y, w))
    idx = s.idxmax()
    return 'Name': df['Name'].iloc[idx], 'CODE': df['Code'].iloc[idx], 'Value': s.max()

df2['NAME'].apply(lambda x: get_fuzz(df1, x))

df2 = df2.assign(search= df2['NAME'].apply(lambda x: get_fuzz(df1, x)))

【问题讨论】：

很遗憾，我认为 difflib 不是完成这项任务的正确工具，它的速度并不快也许您可以尝试使用sklearn 模块构建距离矩阵或类似的东西。对于您的情况，levenshtein 距离可能很有趣。 【参考方案1】：

所以我能够通过使用邮政编码列作为判别式来加快匹配步骤。我能够从 1 小时 40 次计算到 7 百万次。

以下只是一些 DF 的示例

df1 (127000,3)
Code     Name     PostalCode
150      Maarc    47111
250      Kirc     41111
170      Moic     42111
140      Nirc     44111
550      Lacter   47111

df2 (38000,3)
Code     NAME     POSTAL_CODE
150      Marc     47111
250      Kikc     41111
170      Mosc     49111
140      NiKc     44111
550      Lacter   47111

下面是匹配名称列并检索得分最高的名称的代码

%%time
import difflib
from functools import partial

def difflib_match (df1, df2, set_nan = True):

    # Fill NaN
    df2['best']= np.nan
    df2['score']= np.nan

    # Apply function to retrieve unique first letter of Name's column
    first= df2['POSTAL_CODE'].unique()

    # Loop over each first letter to apply the matching by starting with the same Postal code for both DF
    for m, letter in enumerate(first):

        # IF Divid by 100, print Unique values processed 
        if m%100 == 0:
            print(m, 'of', len(first))

        df1_first= df1[df1['PostalCode'] == letter]
        df2_first= df2[df2['POSTAL_CODE'] == letter]

        # Function to match using the Name column from the Web                   
        f = partial(difflib.get_close_matches, possibilities= df1_first['Name'].tolist(), n=1) 

        # Define which columns to compare while mapping with first letter
        matches = df2_first['NAME'].map(f).str[0].fillna('')

        # Retrieve the best score for each match
        scores = [difflib.SequenceMatcher(None, x, y).ratio()
              for x, y in zip(matches, df2_first['NAME'])]

        # Assign the result to the DF
        for i, name in enumerate(df2_first['NAME']):
            df2['best'].where(df2['NAME'] != name, matches.iloc[i], inplace = True)
            df2['score'].where(df2['NAME'] != name, scores[i], inplace = True)

    return df2

# Apply Function
df_diff= difflib_match(df1, df2)

# Display DF
print('Shape: ', df_diff.shape)
df_diff.head()

【讨论】：

【参考方案2】：

我能想到匹配字符串的最快方法是使用正则表达式。

这是一种在字符串中查找匹配项的搜索语言设计。

你可以在这里看到一个例子：

import re

txt = "The rain in Spain"
x = re.search("^The.*Spain$", txt)

//Outputs: x == true

*取自：https://www.w3schools.com/python/python_regex.asp

由于我对Dataframe一无所知，所以我不知道如何在您的代码中实现Regex，但我希望Regex函数可以帮助您。

【讨论】：

为什么你认为正则表达式会比文字匹配更快？这是有争议的，但它通常取决于匹配的复杂性以及您可以编写正则表达式的程度，您可以在以下链接中看到：***.com/questions/16638637/…blog.codinghorror.com/regex-performance

以上是关于加快匹配字符串python的主要内容，如果未能解决你的问题，请参考以下文章