使用字典替换 Pandas 列中字符串中的字符串
Posted
技术标签:
【中文标题】使用字典替换 Pandas 列中字符串中的字符串【英文标题】:Use dictionary to replace a string within a string in Pandas columns 【发布时间】:2018-03-02 17:18:37 【问题描述】:我正在尝试使用dictionary
key
将pandas
列中的strings
替换为其values
。但是,每一列都包含句子。因此,我必须先对句子进行分词,并检测句子中的某个单词是否与我的字典中的某个键对应,然后将字符串替换为对应的值。
但是,我继续得到它的结果没有。有没有更好的 Pythonic 方法来解决这个问题?
这是我目前的 MVC。在 cmets 中,我指定了问题发生的位置。
import pandas as pd
data = 'Categories': ['animal','plant','object'],
'Type': ['tree','dog','rock'],
'Comment': ['The NYC tree is very big','The cat from the UK is small','The rock was found in LA.']
ids = 'Id':['NYC','LA','UK'],
'City':['New York City','Los Angeles','United Kingdom']
df = pd.DataFrame(data)
ids = pd.DataFrame(ids)
def col2dict(ids):
data = ids[['Id', 'City']]
idDict = data.set_index('Id').to_dict()['City']
return idDict
def replaceIds(data,idDict):
ids = idDict.keys()
types = idDict.values()
data['commentTest'] = data['Comment']
words = data['commentTest'].apply(lambda x: x.split())
for (i,word) in enumerate(words):
#Here we can see that the words appear
print word
print ids
if word in ids:
#Here we can see that they are not being recognized. What happened?
print ids
print word
words[i] = idDict[word]
data['commentTest'] = ' '.apply(lambda x: ''.join(x))
return data
idDict = col2dict(ids)
results = replaceIds(df, idDict)
结果:
None
我正在使用python2.7
,当我打印出dict
时,有u'
的Unicode。
我的预期结果是:
类别
评论
类型
评论测试
Categories Comment Type commentTest
0 animal The NYC tree is very big tree The New York City tree is very big
1 plant The cat from the UK is small dog The cat from the United Kingdom is small
2 object The rock was found in LA. rock The rock was found in Los Angeles.
【问题讨论】:
【参考方案1】:您可以创建dictionary
,然后创建replace
:
ids = 'Id':['NYC','LA','UK'],
'City':['New York City','Los Angeles','United Kingdom']
ids = dict(zip(ids['Id'], ids['City']))
print (ids)
'UK': 'United Kingdom', 'LA': 'Los Angeles', 'NYC': 'New York City'
df['commentTest'] = df['Comment'].replace(ids, regex=True)
print (df)
Categories Comment Type \
0 animal The NYC tree is very big tree
1 plant The cat from the UK is small dog
2 object The rock was found in LA. rock
commentTest
0 The New York City tree is very big
1 The cat from the United Kingdom is small
2 The rock was found in Los Angeles.
【讨论】:
为什么是regex=True
?从文档中我虽然它应该是 False:“是否将 to_replace 和/或 value 解释为正则表达式。如果这是 True 那么 to_replace 必须是一个字符串。否则,to_replace 必须是 None 因为这个参数将被解释为一个正则表达式或一个列表、字典或正则表达式数组。”
@pceccon - 我认为在文档中应该注意它更常用于替换子字符串,现在从文档中完全不清楚。【参考方案2】:
实际上使用str.replace()
比使用replace()
快得多,尽管str.replace()
需要循环:
ids = 'NYC': 'New York City', 'LA': 'Los Angeles', 'UK': 'United Kingdom'
for old, new in ids.items():
df['Comment'] = df['Comment'].str.replace(old, new, regex=False)
# Categories Type Comment
# 0 animal tree The New York City tree is very big
# 1 plant dog The cat from the United Kingdom is small
# 2 object rock The rock was found in Los Angeles
replace()
唯一优于 str.replace()
循环的情况是使用小数据帧:
计时函数供参考:
def Series_replace(df):
df['Comment'] = df['Comment'].replace(ids, regex=True)
return df
def Series_str_replace(df):
for old, new in ids.items():
df['Comment'] = df['Comment'].str.replace(old, new, regex=False)
return df
请注意,如果ids
是数据帧而不是字典,则可以使用itertuples()
获得相同的性能:
ids = pd.DataFrame('Id': ['NYC', 'LA', 'UK'], 'City': ['New York City', 'Los Angeles', 'United Kingdom'])
for row in ids.itertuples():
df['Comment'] = df['Comment'].str.replace(row.Id, row.City, regex=False)
【讨论】:
以上是关于使用字典替换 Pandas 列中字符串中的字符串的主要内容,如果未能解决你的问题,请参考以下文章
pandas使用replace函数将dataframe指定数据列中的特定字符串进行自定义替换(replace substring in dataframe column values)
用字典值替换 Pandas Dataframe 中的部分字符串
Pandas使用split函数基于指定分隔符拆分数据列的内容为列表设置expand参数将拆分结果列表内容转化为多列数据并添加到原数据中replace函数基于正则表达式替换字符串数据列中的匹配内容