使用dict替换pandas数据帧中的字符串时性能很慢

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了使用dict替换pandas数据帧中的字符串时性能很慢相关的知识,希望对你有一定的参考价值。

以下代码有效但需要运行得更快。该字典有~25K键,数据帧为~3M行。有没有办法产生相同的结果,但python代码将运行得更快? (没有多处理,处理速度会慢8倍)。

miscdict={" isn't ": ' is not '," aren't ":' are not '," wasn't ":' was not '," snevada ":' Sierra Nevada '}

df=pd.DataFrame({"q1":["beer is ok","beer isn't ok","beer wasn't available"," snevada is good"]})

def parse_text(data):
    for key, replacement in miscdict.items():
        data['q1'] = data['q1'].str.replace( key, replacement )
    return data

if __name__ == '__main__':
    t1_1 = datetime.datetime.now()
    p = multiprocessing.Pool(processes=8)
    split_dfs = np.array_split(df,8)
    pool_results = p.map(parse_text, split_dfs)
    p.close()
    p.join()
    parts = pd.concat(pool_results, axis=0)
    df = pd.concat([parts], axis=1)
    t2_1 = datetime.datetime.now()
    print("done"+ str(t2_1-t1_1)) 
答案

我测试了其中一些。 @ A-Za-z的建议是一项重大改进,但有可能更快地完成。

编辑:我重新运行测试,我预先计算了替换字典和数据帧(以及预编译的正则表达式)。新的时间是:

  • 原价:11.71秒
  • @ A-Za-z:4.72秒,改善60%。
  • @piRSquared:4.95秒,提升了58%。
  • 预编译:2.81秒,改进76%。

数据生成和正则表达式编译包含在时间中的原始结果:

“测试你的代码我得到了15秒,@ A-Za-z的代码给了8-9秒,我自己的解决方案将它降低到6秒。它使用预编译的正则表达式。看到这个答案的结尾。”


进口:

import pandas as pd
import re
import timeit

你原来的代码:

miscdict = {" isn't ": ' is not '," aren't ":' are not '," wasn't ":' was not '," snevada ":' Sierra Nevada '}
data=pd.DataFrame({"q1":["beer is ok","beer isn't ok","beer wasn't available"," snevada is good"]})
def org(printout=False):
    def parse_text(data):
        for key, replacement in miscdict.items():
            data['q1'] = data['q1'].str.replace( key, replacement )
        return data
    data2 = parse_text(data)
    if printout:
        print(data2)
org(printout=True)
print(timeit.timeit(org, number=10000))

这用了11.7秒:

                       q1
0              beer is ok
1          beer is not ok
2  beer was not available
3   Sierra Nevada is good
11.71043858179268

用户@ A-Za-z的代码:

miscdict = {" isn't ": ' is not '," aren't ":' are not '," wasn't ":' was not '," snevada ":' Sierra Nevada '}
data=pd.DataFrame({"q1":["beer is ok","beer isn't ok","beer wasn't available"," snevada is good"]})
def alt1(printout=False):
    data['q1'].replace(miscdict, regex = True, inplace = True)
    if printout:
        print(data)
alt1(printout=True)
print(timeit.timeit(alt1, number=10000))

这用了4.7秒:

                       q1
0              beer is ok
1          beer is not ok
2  beer was not available
3   Sierra Nevada is good
4.721581550644499

用户@ piRSquared的代码:

miscdict = {" isn't ": ' is not '," aren't ":' are not '," wasn't ":' was not '," snevada ":' Sierra Nevada '}
data=pd.DataFrame({"q1":["beer is ok","beer isn't ok","beer wasn't available"," snevada is good"]})
def alt2(printout=False):
    # regex = True is added later because it doesn't work without it.
    data = data.replace(miscdict, regex = True)
    if printout:
        print(data)
alt2(printout=True)
print(timeit.timeit(alt2, number=10000))

这用了5.0秒:

                       q1
0              beer is ok
1          beer is not ok
2  beer was not available
3   Sierra Nevada is good
4.951810616074919

miscdict = {" isn't ": ' is not '," aren't ":' are not '," wasn't ":' was not '," snevada ":' Sierra Nevada '}
miscdict_comp = {re.compile(k): v for k, v in miscdict.items()}
data=pd.DataFrame({"q1":["beer is ok","beer isn't ok","beer wasn't available"," snevada is good"]})
def alt3(printout=False):
    def parse_text(text):
        for pattern, replacement in miscdict_comp.items():
            text = pattern.sub(replacement, text)
        return text
    data["q1"] = data["q1"].apply(parse_text)
    if printout:
        print(data)
alt3(printout=True)
print(timeit.timeit(alt3, number=10000))

这用了2.8秒:

                       q1
0              beer is ok
1          beer is not ok
2  beer was not available
3   Sierra Nevada is good
2.810334940701157

我们的想法是预编译您想要改变的模式。

我从这里得到了这个想法:https://jerel.co/blog/2011/12/using-python-for-super-fast-regex-search-and-replace

另一答案

你不需要这里的循环,df.replace与regex = True一起完成工作,它将时间缩短了一半以上。

df['q1'].replace(miscdict, regex = True, inplace = True)
1000 loops, best of 3: 1.08 ms per loop

得到你

        q1
0   beer is ok
1   beer is not ok
2   beer was not available
3   Sierra Nevada is good

将其与当前解决方案进行比较

for key, replacement in miscdict.items(): df['q1'] = df['q1'].str.replace( key, replacement )
100 loops, best of 3: 2.35 ms per loop
另一答案

哇!我们重新设计了轮子并设计了一些时髦的辐条和尖刺......

......就这样做

df.replace(miscdict)

                       q1
0              beer is ok
1          beer is not ok
2  beer was not available
3   Sierra Nevada is good

除非我遗漏了一些明显的东西。

另一答案

使用来自Vaishali的示例的预编译的miscdict在我的情况下使用其他数据快了大约10倍,如下所示:

data=pd.DataFrame({"q1":["beer is ok","beer isn't ok","beer wasn't available"," snevada is good"]})

miscdict = {" isn't ": ' is not '," aren't ":' are not '," wasn't ":' was not '," snevada ":' Sierra Nevada '}
miscdict_comp = {re.compile(k): v for k, v in miscdict.items()}

data['q1'].replace(miscdict_comp, regex = True, inplace = True)

以上是关于使用dict替换pandas数据帧中的字符串时性能很慢的主要内容,如果未能解决你的问题,请参考以下文章

如何将破折号 (-) 的所有实例替换为 pandas 数据帧中字符串中间的数字零 (0)?

Python Pandas dict 到数据框(不工作)

删除重复的pandas数据帧

根据另一个数据框 python pandas 替换列值 - 更好的方法?

字符串中的 Pyspark 双字符替换避免某些单词而不映射到 pandas 或 rdd

使用 Pandas 动态创建数据框