使用字典替换 Pandas 列中字符串中的字符串

Posted

技术标签:

【中文标题】使用字典替换 Pandas 列中字符串中的字符串【英文标题】:Use dictionary to replace a string within a string in Pandas columns 【发布时间】:2018-03-02 17:18:37 【问题描述】:

我正在尝试使用dictionary keypandas 列中的strings 替换为其values。但是,每一列都包含句子。因此,我必须先对句子进行分词,并检测句子中的某个单词是否与我的字典中的某个键对应,然后将字符串替换为对应的值。

但是,我继续得到它的结果没有。有没有更好的 Pythonic 方法来解决这个问题?

这是我目前的 MVC。在 cmets 中,我指定了问题发生的位置。

import pandas as pd

data = 'Categories': ['animal','plant','object'],
    'Type': ['tree','dog','rock'],
        'Comment': ['The NYC tree is very big','The cat from the UK is small','The rock was found in LA.']


ids = 'Id':['NYC','LA','UK'],
      'City':['New York City','Los Angeles','United Kingdom']


df = pd.DataFrame(data)
ids = pd.DataFrame(ids)

def col2dict(ids):
    data = ids[['Id', 'City']]
    idDict = data.set_index('Id').to_dict()['City']
    return idDict

def replaceIds(data,idDict):
    ids = idDict.keys()
    types = idDict.values()
    data['commentTest'] = data['Comment']
    words = data['commentTest'].apply(lambda x: x.split())
    for (i,word) in enumerate(words):
        #Here we can see that the words appear
        print word
        print ids
        if word in ids:
        #Here we can see that they are not being recognized. What happened?
            print ids
            print word
            words[i] = idDict[word]
            data['commentTest'] = ' '.apply(lambda x: ''.join(x))
    return data

idDict = col2dict(ids)
results = replaceIds(df, idDict)

结果:

None

我正在使用python2.7,当我打印出dict 时,有u' 的Unicode。

我的预期结果是:

类别

评论

类型

评论测试

  Categories  Comment  Type commentTest
0 animal  The NYC tree is very big tree The New York City tree is very big 
1 plant The cat from the UK is small dog  The cat from the United Kingdom is small 
2 object  The rock was found in LA. rock  The rock was found in Los Angeles. 

【问题讨论】:

【参考方案1】:

您可以创建dictionary,然后创建replace

ids = 'Id':['NYC','LA','UK'],
      'City':['New York City','Los Angeles','United Kingdom']

ids = dict(zip(ids['Id'], ids['City']))
print (ids)
'UK': 'United Kingdom', 'LA': 'Los Angeles', 'NYC': 'New York City'

df['commentTest'] = df['Comment'].replace(ids, regex=True)
print (df)
  Categories                       Comment  Type  \
0     animal      The NYC tree is very big  tree   
1      plant  The cat from the UK is small   dog   
2     object     The rock was found in LA.  rock   

                                commentTest  
0        The New York City tree is very big  
1  The cat from the United Kingdom is small  
2        The rock was found in Los Angeles.  

【讨论】:

为什么是regex=True?从文档中我虽然它应该是 False:“是否将 to_replace 和/或 value 解释为正则表达式。如果这是 True 那么 to_replace 必须是一个字符串。否则,to_replace 必须是 None 因为这个参数将被解释为一个正则表达式或一个列表、字典或正则表达式数组。” @pceccon - 我认为在文档中应该注意它更常用于替换子字符串,现在从文档中完全不清楚。【参考方案2】:

实际上使用str.replace() 比使用replace() 快得多,尽管str.replace() 需要循环:

ids = 'NYC': 'New York City', 'LA': 'Los Angeles', 'UK': 'United Kingdom'

for old, new in ids.items():
    df['Comment'] = df['Comment'].str.replace(old, new, regex=False)

#   Categories  Type                                   Comment
# 0     animal  tree        The New York City tree is very big
# 1      plant   dog  The cat from the United Kingdom is small
# 2     object  rock         The rock was found in Los Angeles

replace() 唯一优于 str.replace() 循环的情况是使用小数据帧:

计时函数供参考:

def Series_replace(df):
    df['Comment'] = df['Comment'].replace(ids, regex=True)
    return df

def Series_str_replace(df):
    for old, new in ids.items():
        df['Comment'] = df['Comment'].str.replace(old, new, regex=False)
    return df

请注意,如果ids 是数据帧而不是字典,则可以使用itertuples() 获得相同的性能:

ids = pd.DataFrame('Id': ['NYC', 'LA', 'UK'], 'City': ['New York City', 'Los Angeles', 'United Kingdom'])

for row in ids.itertuples():
    df['Comment'] = df['Comment'].str.replace(row.Id, row.City, regex=False)

【讨论】:

以上是关于使用字典替换 Pandas 列中字符串中的字符串的主要内容,如果未能解决你的问题,请参考以下文章

pandas使用replace函数将dataframe指定数据列中的特定字符串进行自定义替换(replace substring in dataframe column values)

用字典值替换 Pandas Dataframe 中的部分字符串

Pandas使用split函数基于指定分隔符拆分数据列的内容为列表设置expand参数将拆分结果列表内容转化为多列数据并添加到原数据中replace函数基于正则表达式替换字符串数据列中的匹配内容

如何从 Python Pandas Dataframe 中的 STRING 列中提取嵌套字典?

Pandas:替换字符串中的子字符串

python pandas用数字替换数据框中的字符串