如果另一个系列包含特定字符串，如何在数据框中创建一个返回值的系列？

Posted 2023-03-11

技术标签:

【中文标题】如果另一个系列包含特定字符串，如何在数据框中创建一个返回值的系列？【英文标题】：How to create a series in a dataframe that returns a value if another series contains a specific string? 【发布时间】：2021-02-12 04:48:44 【问题描述】：

我正在尝试查看是否可以在数据框中创建一个系列，该系列返回一个单元格的值，具体取决于另一个系列中的单元格是否包含给定的字符串。让我解释一下：

我有一个包含“restaurant_name”和“brand_name”列的数据框

data = [["mcdonalds central london", ""], ["pizza hut downtown new york" ,""], 
        ["dominos new jersey",""], ["mac donald berlin", ""]]

restaurants = pd.DataFrame(data, columns=['restaurant_name', 'brand_name'])

我有一个字典，其中字符串作为键，格式化的品牌名称作为值。我希望算法检查餐厅 ["restaurant_name"] 是否包含来自 brand_dictionary 的键，如果包含，我希望它返回与 data["brand_name"] 中的键对应的值

brand_dictionary = 
"mcdonalds" : "McDonald's",
"mac donald" : "McDonald's",
"dominos" : "Dominos Pizza",
"pizza hut" : "Pizza Hut"

真的不知道该怎么做..

【问题讨论】：

【参考方案1】：

您可以使用.apply()中的自定义函数搜索字典（如果没有找到品牌则返回-）：

import pandas as pd


data = [["mcdonalds central london", ""], ["pizza hut downtown new york" ,""], ["dominos new jersey",""], [ "mac donald berlin", ""] ]

restaurants = pd.DataFrame(data, columns = ['restaurant_name', 'brand_name'])
    
brand_dictionary = 
"mcdonalds" : "McDonald's",
"mac donald" : "McDonald's",
"dominos" : "Dominos Pizza",
"pizza hut" : "Pizza Hut"

def get_name(restaurant, dct):
    for r in dct:
        if r in restaurant:
            return dct[r]
    return '-'

restaurants['brand_name'] = restaurants['restaurant_name'].apply(lambda x: get_name(x, brand_dictionary))
print(restaurants)

打印：

               restaurant_name     brand_name
0     mcdonalds central london     McDonald's
1  pizza hut downtown new york      Pizza Hut
2           dominos new jersey  Dominos Pizza
3            mac donald berlin     McDonald's

【讨论】：

【参考方案2】：

您可以str.extract第一个匹配的单词，然后将匹配与字典进行映射。

pat = f'("|".join(brand_dictionary.keys()))'
#'(mcdonalds|mac donald|dominos|pizza hut)'

df['brand_name'] = df['restaurant_name'].str.extract(pat)[0].map(brand_dictionary)

               restaurant_name     brand_name
0     mcdonalds central london     McDonald's
1  pizza hut downtown new york      Pizza Hut
2           dominos new jersey  Dominos Pizza
3            mac donald berlin     McDonald's

如果您想处理单行上存在多个匹配项的可能性，您可以切换到 str.extractall，然后使用一些聚合（例如列表）来存储所有匹配的品牌。

df['brand_name'] = (df['restaurant_name'].str.extractall(pat)[0].map(brand_dictionary)
                      .groupby(level=0).agg(list))

               restaurant_name       brand_name
0     mcdonalds central london     [McDonald's]
1  pizza hut downtown new york      [Pizza Hut]
2           dominos new jersey  [Dominos Pizza]
3            mac donald berlin     [McDonald's]

【讨论】：

谢谢！我很难理解@Andrej Keselys 的回答，但这绝对有效，而且聚合部分也很有帮助【参考方案3】：

嗯，确保字典中的所有元素都是字符串

 brand_dictionary = 
   "mcdonalds" : "McDonald's",
   "mac donald" : "McDonald's",
   "dominos" : "Dominos Pizza",
   "pizza hut" : "Pizza Hut"

然后，我们可以像这样循环：

brand_names = []
for i in range(0, restaurants.shape[0]):
    # read the name in the column 
    current_key =  restaurants['restaurant_name'].iloc[i]
    try:
        # Now if that name exists in dictionary, check its value and add that to list
        brand_names.append(brand_dictionary[current_key])
    except:
        #If that key dosent exist in dict, then just add nan, or 'NA'
        brand_names.append(np.nan) # or.append('NA')

#Add the final list to main dataframe.
restaurants['BrandNames'] = brand_names

【讨论】：

以上是关于如果另一个系列包含特定字符串，如何在数据框中创建一个返回值的系列？的主要内容，如果未能解决你的问题，请参考以下文章