Pandas 从第二个数据帧动态模式匹配并提取字符串
Posted
技术标签:
【中文标题】Pandas 从第二个数据帧动态模式匹配并提取字符串【英文标题】:Pandas dynamically pattern match from second dataframe and extract string 【发布时间】:2020-05-21 16:28:07 【问题描述】:尝试从第二个数据框列表动态构建正则表达式提取模式并用字符串填充另一列(如果它包含在列表中)。
这是两个起始表:
import pandas as pd
import re
# this is a short extract, there are 1000's of records in this table:
provinces = pd.DataFrame('country': ['Brazil','Brazil','Brazil','Colombia','Colombia','Colombia'],
'area': ['Cerrado','Sul de Minas', 'Mococoa','Tolima','Huila','Quindio'],
'index': [13,21,19,35,36,34])
# test dataframe
df_test = pd.DataFrame('country':['brazil','brazil','brazil','brazil','colombia','colombia','brazil'],
'locality':['sul de minas minas gerais','chapadao cerrado','cerrado cerrado','mococa sao paulo','pitalito huila','pijao quindio','espirito santo'])
print(provinces)
country area index
0 Brazil Cerrado 13
1 Brazil Sul de Minas 21
2 Brazil Mococoa 19
3 Colombia Tolima 35
4 Colombia Huila 36
5 Colombia Quindio 34
print(df_test)
country locality
0 brazil sul de minas minas gerais
1 brazil chapadao cerrado
2 brazil cerrado cerrado
3 brazil mococa sao paulo
4 colombia pitalito huila
5 colombia pijao quindio
6 brazil espirito santo
和最终结果:
df_result = pd.DataFrame('country':['brazil','brazil','brazil','brazil','colombia','colombia','brazil'],
'locality':['minas gerais','chapadao','cerrado','sao paulo','pitalito','pijao','espirito santo'],
'area': ['sul de minas','cerrado','cerrado','mococoa','huila','quindio',''],
'index': [21,13,13,19,36,34,np.nan])
print(df_result)
country locality area index
0 brazil minas gerais sul de minas 21.0
1 brazil chapadao cerrado 13.0
2 brazil cerrado cerrado 13.0
3 brazil sao paulo mococoa 19.0
4 colombia pitalito huila 36.0
5 colombia pijao quindio 34.0
6 brazil espirito santo NaN
无法绕过填充区域列的第一步。一旦 area 列包含一个字符串,从 locality 列中删除相同的字符串并在国家和地区列上添加带有左连接的索引列是简单的部分(!)
# to create the area column and extract the area string if there's a match (by string and country) in the provinces table
df_test['area'] = ''
df_test.area = df_test.locality.str.extract(flags=re.IGNORECASE, pat = r'(\b\b)'.format('|'.join(provinces.loc[provinces.country.str.lower()==df_test.country,'area'].str.lower().to_list()), expand=False))
而且我还需要应用地图来排除此步骤中的一些记录。
# as above but for added complexity, populate the area column only if df_test.country == 'brazil':
df_test['area'] = ''
mapping = df_test.country =='brazil'
df_test.loc[mapping,'area'] = df_test.loc[mapping,'locality'].str.extract(flags=re.IGNORECASE, pat = r'(\b\b)'.format('|'.join(provinces.loc[provinces.country.str.lower()==df_test.country,'area'].str.lower().to_list()), expand=False))
我发现的所有矢量化正则表达式提取解决方案都依赖于pre-defined regex patterns,但鉴于这些模式需要来自国家匹配的省份数据框,这个question and answer 似乎是这种情况下的最佳匹配,但我不能搞不懂……
提前致谢
【问题讨论】:
【参考方案1】:跟踪错误消息(和睡眠!),“只能比较标记相同的系列对象”解决了这个answer
然后“ValueError: Lengths must match to compare”这个answer
解决办法如下:
df_test['area'] = ''
df_test.area = df_test.locality.str.extract(flags=re.IGNORECASE, pat = r'()'.format('|'.join(provinces.loc[provinces.country.str.lower().isin(df_test.country),'area'].str.lower().to_list()), expand=False))
[out]
country locality area
0 brazil sul de minas minas gerais sul de minas
1 brazil chapadao cerrado cerrado
2 brazil cerrado cerrado cerrado
3 brazil mococoa sao paulo mococoa
4 colombia pitalito huila huila
5 colombia pijao quindio quindio
6 brazil espirito santo NaN
【讨论】:
以上是关于Pandas 从第二个数据帧动态模式匹配并提取字符串的主要内容,如果未能解决你的问题,请参考以下文章
Python Pandas - 当我从第二个 Dataframe 添加两列时,Dataframe 列被吞下 [重复]