Pandas 从第二个数据帧动态模式匹配并提取字符串

Posted

技术标签:

【中文标题】Pandas 从第二个数据帧动态模式匹配并提取字符串【英文标题】:Pandas dynamically pattern match from second dataframe and extract string 【发布时间】:2020-05-21 16:28:07 【问题描述】:

尝试从第二个数据框列表动态构建正则表达式提取模式并用字符串填充另一列(如果它包含在列表中)。

这是两个起始表:

import pandas as pd
import re

# this is a short extract, there are 1000's of records in this table:
provinces = pd.DataFrame('country': ['Brazil','Brazil','Brazil','Colombia','Colombia','Colombia'],
                  'area': ['Cerrado','Sul de Minas', 'Mococoa','Tolima','Huila','Quindio'],
                  'index': [13,21,19,35,36,34])

# test dataframe
df_test = pd.DataFrame('country':['brazil','brazil','brazil','brazil','colombia','colombia','brazil'],
                       'locality':['sul de minas minas gerais','chapadao cerrado','cerrado cerrado','mococa sao paulo','pitalito huila','pijao quindio','espirito santo'])
print(provinces)

    country          area  index
0    Brazil       Cerrado     13
1    Brazil  Sul de Minas     21
2    Brazil       Mococoa     19
3  Colombia        Tolima     35
4  Colombia         Huila     36
5  Colombia       Quindio     34

print(df_test)
    country                   locality
0    brazil  sul de minas minas gerais
1    brazil           chapadao cerrado
2    brazil            cerrado cerrado
3    brazil           mococa sao paulo
4  colombia             pitalito huila
5  colombia              pijao quindio
6    brazil             espirito santo

和最终结果:

df_result = pd.DataFrame('country':['brazil','brazil','brazil','brazil','colombia','colombia','brazil'],
                       'locality':['minas gerais','chapadao','cerrado','sao paulo','pitalito','pijao','espirito santo'],
                         'area': ['sul de minas','cerrado','cerrado','mococoa','huila','quindio',''],
                         'index': [21,13,13,19,36,34,np.nan])
print(df_result)
    country        locality          area  index
0    brazil    minas gerais  sul de minas   21.0
1    brazil        chapadao       cerrado   13.0
2    brazil         cerrado       cerrado   13.0
3    brazil       sao paulo       mococoa   19.0
4  colombia        pitalito         huila   36.0
5  colombia           pijao       quindio   34.0
6    brazil  espirito santo                  NaN

无法绕过填充区域列的第一步。一旦 area 列包含一个字符串,从 locality 列中删除相同的字符串并在国家和地区列上添加带有左连接的索引列是简单的部分(!)

# to create the area column and extract the area string if there's a match (by string and country) in the provinces table
df_test['area'] = ''
df_test.area = df_test.locality.str.extract(flags=re.IGNORECASE, pat = r'(\b\b)'.format('|'.join(provinces.loc[provinces.country.str.lower()==df_test.country,'area'].str.lower().to_list()), expand=False))

而且我还需要应用地图来排除此步骤中的一些记录。

# as above but for added complexity, populate the area column only if df_test.country == 'brazil':
df_test['area'] = ''
mapping = df_test.country =='brazil'
df_test.loc[mapping,'area'] = df_test.loc[mapping,'locality'].str.extract(flags=re.IGNORECASE, pat = r'(\b\b)'.format('|'.join(provinces.loc[provinces.country.str.lower()==df_test.country,'area'].str.lower().to_list()), expand=False))

我发现的所有矢量化正则表达式提取解决方案都依赖于pre-defined regex patterns,但鉴于这些模式需要来自国家匹配的省份数据框,这个question and answer 似乎是这种情况下的最佳匹配,但我不能搞不懂……

提前致谢

【问题讨论】:

【参考方案1】:

跟踪错误消息(和睡眠!),“只能比较标记相同的系列对象”解决了这个answer

然后“ValueError: Lengths must match to compare”这个answer

解决办法如下:

df_test['area'] = ''
df_test.area = df_test.locality.str.extract(flags=re.IGNORECASE, pat = r'()'.format('|'.join(provinces.loc[provinces.country.str.lower().isin(df_test.country),'area'].str.lower().to_list()), expand=False))

[out]

   country                   locality          area
0    brazil  sul de minas minas gerais  sul de minas
1    brazil           chapadao cerrado       cerrado
2    brazil            cerrado cerrado       cerrado
3    brazil          mococoa sao paulo       mococoa
4  colombia             pitalito huila         huila
5  colombia              pijao quindio       quindio
6    brazil             espirito santo           NaN

【讨论】:

以上是关于Pandas 从第二个数据帧动态模式匹配并提取字符串的主要内容,如果未能解决你的问题,请参考以下文章

Python Pandas - 当我从第二个 Dataframe 添加两列时,Dataframe 列被吞下 [重复]

Pyspark:内部连接两个 pyspark 数据帧并从第一个数据帧中选择所有列,从第二个数据帧中选择几列

根据第二个数据帧的匹配列更新熊猫数据帧

根据第一个数据帧从第二个数据帧获取数据

批处理正则表达式怎样替换第二个指定字符

将时间戳数据与另一个数据集中的最接近时间相匹配。正确矢量化?更快的方式?