如果另一个系列包含特定字符串,如何在数据框中创建一个返回值的系列?
Posted
技术标签:
【中文标题】如果另一个系列包含特定字符串,如何在数据框中创建一个返回值的系列?【英文标题】:How to create a series in a dataframe that returns a value if another series contains a specific string? 【发布时间】:2021-02-12 04:48:44 【问题描述】:我正在尝试查看是否可以在数据框中创建一个系列,该系列返回一个单元格的值,具体取决于另一个系列中的单元格是否包含给定的字符串。让我解释一下:
我有一个包含“restaurant_name”和“brand_name”列的数据框
data = [["mcdonalds central london", ""], ["pizza hut downtown new york" ,""],
["dominos new jersey",""], ["mac donald berlin", ""]]
restaurants = pd.DataFrame(data, columns=['restaurant_name', 'brand_name'])
我有一个字典,其中字符串作为键,格式化的品牌名称作为值。我希望算法检查餐厅 ["restaurant_name"] 是否包含来自 brand_dictionary 的键,如果包含,我希望它返回与 data["brand_name"] 中的键对应的值
brand_dictionary =
"mcdonalds" : "McDonald's",
"mac donald" : "McDonald's",
"dominos" : "Dominos Pizza",
"pizza hut" : "Pizza Hut"
真的不知道该怎么做..
【问题讨论】:
【参考方案1】:您可以使用.apply()
中的自定义函数搜索字典(如果没有找到品牌则返回-
):
import pandas as pd
data = [["mcdonalds central london", ""], ["pizza hut downtown new york" ,""], ["dominos new jersey",""], [ "mac donald berlin", ""] ]
restaurants = pd.DataFrame(data, columns = ['restaurant_name', 'brand_name'])
brand_dictionary =
"mcdonalds" : "McDonald's",
"mac donald" : "McDonald's",
"dominos" : "Dominos Pizza",
"pizza hut" : "Pizza Hut"
def get_name(restaurant, dct):
for r in dct:
if r in restaurant:
return dct[r]
return '-'
restaurants['brand_name'] = restaurants['restaurant_name'].apply(lambda x: get_name(x, brand_dictionary))
print(restaurants)
打印:
restaurant_name brand_name
0 mcdonalds central london McDonald's
1 pizza hut downtown new york Pizza Hut
2 dominos new jersey Dominos Pizza
3 mac donald berlin McDonald's
【讨论】:
【参考方案2】:您可以str.extract
第一个匹配的单词,然后将匹配与字典进行映射。
pat = f'("|".join(brand_dictionary.keys()))'
#'(mcdonalds|mac donald|dominos|pizza hut)'
df['brand_name'] = df['restaurant_name'].str.extract(pat)[0].map(brand_dictionary)
restaurant_name brand_name
0 mcdonalds central london McDonald's
1 pizza hut downtown new york Pizza Hut
2 dominos new jersey Dominos Pizza
3 mac donald berlin McDonald's
如果您想处理单行上存在多个匹配项的可能性,您可以切换到 str.extractall
,然后使用一些聚合(例如列表)来存储所有匹配的品牌。
df['brand_name'] = (df['restaurant_name'].str.extractall(pat)[0].map(brand_dictionary)
.groupby(level=0).agg(list))
restaurant_name brand_name
0 mcdonalds central london [McDonald's]
1 pizza hut downtown new york [Pizza Hut]
2 dominos new jersey [Dominos Pizza]
3 mac donald berlin [McDonald's]
【讨论】:
谢谢!我很难理解@Andrej Keselys 的回答,但这绝对有效,而且聚合部分也很有帮助【参考方案3】:嗯,确保字典中的所有元素都是字符串
brand_dictionary =
"mcdonalds" : "McDonald's",
"mac donald" : "McDonald's",
"dominos" : "Dominos Pizza",
"pizza hut" : "Pizza Hut"
然后,我们可以像这样循环:
brand_names = []
for i in range(0, restaurants.shape[0]):
# read the name in the column
current_key = restaurants['restaurant_name'].iloc[i]
try:
# Now if that name exists in dictionary, check its value and add that to list
brand_names.append(brand_dictionary[current_key])
except:
#If that key dosent exist in dict, then just add nan, or 'NA'
brand_names.append(np.nan) # or.append('NA')
#Add the final list to main dataframe.
restaurants['BrandNames'] = brand_names
【讨论】:
以上是关于如果另一个系列包含特定字符串,如何在数据框中创建一个返回值的系列?的主要内容,如果未能解决你的问题,请参考以下文章