Python/Pandas - 性能改进 - 将列分成多个部分并将字符串序列转换为列表
Posted
技术标签:
【中文标题】Python/Pandas - 性能改进 - 将列分成多个部分并将字符串序列转换为列表【英文标题】:Python/Pandas - Performance improvement - Breaking a column in multiple parts and turn string sequences into lists 【发布时间】:2017-11-28 13:43:01 【问题描述】:我有一个名为 target 的数据框。该数据框有一个名为“CNAE2”的列。
如果我print(target.CNAE2)
我得到以下信息:
id
3 NaN
7 NaN
17 50.30-1-02, 52.11-7-01, 52.32-0-00, 52.50-8-05
18 32.67-1-00
19 46.93-1-00, 49.40-0-00
20 NaN
列的非 NaN 值是字符串。他们遵循一定的关系逻辑,我的意图是做以下事情: a) 把它变成列表 b) 将它们分成多个级别(我称之为“pai”、“vo”、“bisavo”)并将它们分开在不同的列中
id CNAE2 CNAE2pai CNAE2vo CNAE2bisavo
3 NaN NaN NaN NaN
7 NaN NaN NaN NaN
17 [50.30-1-02, 52.32-0-00, 52.50-8-05] [50.30-1, 52.32-0, 52.50-8] [50.30, 52.32, 52.50] [50, 52, 52]
18 [32.67-1-00] [32.67-1] [32.67] [32]
19 [46.93-1-00, 46.40-0-00] [46.93-1, 46.40-0] [46.93, 46.40] [46, 46]
20 NaN NaN NaN NaN
我能够达到这个结果,但是,我的代码依赖于很多循环,并且由于我正在运行一个相当大的数据帧,所以它需要很长时间。这是不可行的。我使用了以下代码:
for i in target.index:
cnaes=str(target['CNAE2'][i]).split(', ')
target.CNAE2[i]=cnaes
if cnaes == ['nan'] or cnaes == 'NaN' or cnaes == "":
target.CNAE2[i]='NaN'
else:
target.CNAE2pai[i]=[]
target.CNAE2vo[i]=[]
target.CNAE2bisavo[i]=[]
for k in range(len(cnaes)):
y=cnaes[k][:7]
target['CNAE2pai'][i].append(y)
for k in range(len(cnaes)):
y=cnaes[k][:5]
target['CNAE2vo'][i].append(y)
for k in range(len(cnaes)):
y=cnaes[k][:2]
target['CNAE2bisavo'][i].append(y)
target.CNAE2pai[i]=list(set(target.CNAE2pai[i]))
target.CNAE2vo[i]=list(set(target.CNAE2vo[i]))
target.CNAE2bisavo[i]=list(set(target.CNAE2bisavo[i]))
有人可以提出一种更有效的方法来实现这一结果吗?
【问题讨论】:
【参考方案1】:尚未尝试过,但最好避免使用 .append。最好先制作一个列表并附加到该列表中,然后在结果完成后将其输入到您的数据框中。
【讨论】:
【参考方案2】:我在这里使用了apply
函数,它应该比遍历行更快,set
查找应该比你的 or 函数更快,最后是列表理解,它往往比嵌套的 for
循环更快。我尚未对此进行测试,但希望对您有所帮助。
import pandas as pd
# Create dummy data and dataframe
d = "3":"NaN","7":"NaN","17":"50.30-1-02, 52.11-7-01, 52.32-0-00, 52.50-8-05","18":"32.67-1-00",
"19":"46.93-1-00, 49.40-0-00","20":"NaN"
target = pd.DataFrame([[k, d[k]] for k in d], columns = ["id","CNAE"])
# Loop across desired columns
nans = set(["nan","NaN",""])
for col in [("CNAE2pai",7),("CNAE2vo",5),("CNAE2bisavo",2)]:
target[col[0]] = target.CNAE.apply(lambda x: "NaN" if x in nans else [i[:col[1]] for i in x.split(", ")])
target["CNAE2"] = target.CNAE.apply(lambda x: "NaN" if x in nans else [i for i in x.split(", ")])
编辑
在我的系统上,利用 lambda
函数和列表解析比 groupby
产生更快的结果:
d = "3":"NaN","7":"NaN","17":"50.30-1-02, 52.11-7-01, 52.32-0-00, 52.50-8-05","18":"32.67-1-00",
"19":"46.93-1-00, 49.40-0-00","20":"NaN"
target = pd.DataFrame([[k, d[k]] for k in d], columns = ["id","CNAE"])
def lambda_func(target):
# Loop across desired columns
nans = set(["nan","NaN",""])
for col in [("CNAE2pai",7),("CNAE2vo",5),("CNAE2bisavo",2)]:
target[col[0]] = target.CNAE.apply(lambda x: "NaN" if x in nans else [i[:col[1]] for i in x.split(", ")])
target["CNAE2"] = target.CNAE.apply(lambda x: "NaN" if x in nans else [i for i in x.split(", ")])
return target
def groupby_func(target):
s = target.CNAE.str.split(', ', expand=True).stack()
pai = s.str.rsplit('-', 1).str[0].groupby(level=0).apply(list)
vo = s.str.split('-', 1).str[0].groupby(level=0).apply(list)
bisavo = s.str.split('.').str[0].groupby(level=0).apply(list)
base = s.groupby(level=0).apply(list)
target = pd.concat(
[base, pai, vo, bisavo], axis=1,
keys=['', 'pai', 'vo', 'bisavo']
).add_prefix('CNAE2').reindex(target.index)
return target
结果:
%timeit lambda_func(target) 1000 loops, best of 3: 930 µs per loop
%timeit groupby_func(target) 100 loops, best of 3: 6.3 ms per loop
【讨论】:
【参考方案3】:s = target.CNAE2.str.split(', ', expand=True).stack()
pai = s.str.rsplit('-', 1).str[0].groupby(level=0).apply(list)
vo = s.str.split('-', 1).str[0].groupby(level=0).apply(list)
bisavo = s.str.split('.').str[0].groupby(level=0).apply(list)
base = s.groupby(level=0).apply(list)
pd.concat(
[base, pai, vo, bisavo], axis=1,
keys=['', 'pai', 'vo', 'bisavo']
).add_prefix('CNAE2').reindex(target.index)
CNAE2 CNAE2pai CNAE2vo CNAE2bisavo
id
3 NaN NaN NaN NaN
7 NaN NaN NaN NaN
17 [50.30-1-02, 52.11-7-01, 52.32-0-00, 52.50-8-05] [50.30-1, 52.11-7, 52.32-0, 52.50-8] [50.30, 52.11, 52.32, 52.50] [50, 52, 52, 52]
18 [32.67-1-00] [32.67-1] [32.67] [32]
19 [46.93-1-00, 49.40-0-00] [46.93-1, 49.40-0] [46.93, 49.40] [46, 49]
20 NaN NaN NaN NaN
【讨论】:
以上是关于Python/Pandas - 性能改进 - 将列分成多个部分并将字符串序列转换为列表的主要内容,如果未能解决你的问题,请参考以下文章
Python Pandas:如何将列中的分组列表作为字典返回
Python pandas:使用方法链接将列添加到分组的 DataFrame