Pandas - 检查数据框列是不是包含字典中的键:值对
Posted
技术标签:
【中文标题】Pandas - 检查数据框列是不是包含字典中的键:值对【英文标题】:Pandas - check if dataframe columns contain key:value pairs from a dictionaryPandas - 检查数据框列是否包含字典中的键:值对 【发布时间】:2017-09-12 16:42:00 【问题描述】:这个问题与我发布的另一个问题有关。 Pandas - check if a string column in one dataframe contains a pair of strings from another dataframe
我的目标是检查数据框的两个不同列是否包含一对字符串值,如果满足条件,则提取其中一个值。
我有两个这样的数据框:
df1 = pd.DataFrame('consumption':['squirrelate apple', 'monkey likesapple',
'monkey banana gets', 'badger/getsbanana', 'giraffe eats grass', 'badger apple.loves', 'elephant is huge', 'elephant/eats/', 'squirrel.digsingrass'],
'name': ['apple', 'appleisred', 'banana is tropical', 'banana is soft', 'lemon is sour', 'washington apples', 'kiwi', 'bananas', 'apples'])
df2 = pd.DataFrame('food':['apple', 'apple', 'banana', 'banana'], 'creature':['squirrel', 'badger', 'monkey', 'elephant'])
In [187]:df1
Out[187]:
consumption name
0 squirrelate apple apple
1 monkey likesapple appleisred
2 monkey banana gets banana is tropical
3 badger/getsbanana banana is soft
4 giraffe eats grass lemon is sour
5 badger apple.loves washington apples
6 elephant is huge kiwi
7 elephant/eats/ bananas
8 squirrel.digsingrass apples
In[188]: df2
Out[188]:
creature food
0 squirrel apple
1 badger apple
2 monkey banana
3 elephant banana
我想要做的是测试'apple'是否出现在df1['name']
中和'squirrel'出现在df1['consumption']
中,如果这两个条件都满足然后从df1['consumption']
中提取'squirrel'到一个新列df['creature']
中.结果应如下所示:
Out[189]:
consumption creature name
0 squirrelate apple squirrel apple
1 monkey likesapple NaN appleisred
2 monkey banana gets monkey banana is tropical
3 badger/getsbanana NaN banana is soft
4 giraffe eats grass NaN lemon is sour
5 badger apple.loves badger washington apples
6 elephant is huge NaN kiwi
7 elephant/eats/ elephant bananas
8 squirrel.digsingrass NaN apples
如果没有配对值约束,我可以做一些简单的事情,比如:
np.where((df1['consumption'].str.contains(<creature_string>, case = False)) & (df1['name'].str.contains(<food_string>, case = False)), df['consumption'].str.extract(<creature_string>), np.nan)
但我必须检查对所以我尝试将食物作为键和生物作为值的字典,然后为给定的食物键创建所有生物的字符串 var 并查找使用 str.contains 的那些:
unique_food = df2.food.unique()
food_dict = elem : pd.DataFrame for elem in unique_food
for key in food_dict.keys():
food_dict[key] = df2[:][df2.food == key]
# create key:value pairs of food key and creature strings
food_strings =
for key, values in food_dict.items():
food_strings.update(key: '|'.join(map(str, list(food_dict[key]['creature'].unique()))))
In[199]: food_strings
Out[199]: 'apple': 'squirrel|badger', 'banana': 'monkey|elephant'
问题是当我现在尝试应用 str.contains 时:
for key, value in food_strings.items():
np.where((df1['name'].str.contains('('+food_strings[key]+')', case = False)) &
(df1['consumption'].str.contains('('+food_strings[value]+')', case = False)), df1['consumptions'].str.extract('('+food_strings[value]+')'), np.nan)
我得到一个 KeyError:
。
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-62-7ab718066040> in <module>()
1 for key, value in food_strings.items():
2 np.where((df1['name'].str.contains('('+food_strings[key]+')', case = False)) &
----> 3 (df1['consumption'].str.contains('('+food_strings[value]+')', case = False)), df1['consumption'].str.extract('('+food_strings[value]+')'), np.nan)
KeyError: 'squirrel|badger'
当我只尝试值而不是键时,它适用于第一个键:值对但不适用于第二个:
for key in food_strings.keys():
df1['test'] = np.where(df1['consumption'].str.contains('('+food_strings[key]+')', case =False),
df1['consumption'].str.extract('('+food_strings[key]+')', expand=False),
np.nan)
df1
Out[196]:
consumption name test
0 squirrelate apple apple squirrel
1 monkey likesapple appleisred NaN
2 monkey banana gets banana is tropical NaN
3 badger/getsbanana banana is soft badger
4 giraffe eats grass lemon is sour NaN
5 badger apple.loves washington apples badger
6 elephant is huge kiwi NaN
7 elephant/eats/ bananas NaN
8 squirrel.digsingrass apples squirrel
我得到了匹配 apple 和 squirrel|badger 的,但错过了 banana:monkey|elephant。
有人可以帮忙吗?
【问题讨论】:
我认为,food_dict
的每个值都包含数据帧而不是字符串。循环进入for key, value in food_dict.items():
时会发生错误。您将value
作为数据框提供给food_strings[value]
。
@titipat 这是错别字对不起-但很好。我编辑了问题并粘贴了我得到的确切错误。
【参考方案1】:
d1 = df1.dropna()
d2 = df2.dropna()
sump = d1.consumption.values.tolist()
name = d1.name.values.tolist()
cret = d2.creature.values.tolist()
food = d2.food.values.tolist()
check = np.array(
[
[c in s and f in n for c, f in zip(cret, food)]
for s, n in zip(sump, name)
]
)
# create a new series with the index of `d1` where we dropped na
# then reindex with `df1.index` prior to `assign`
test = pd.Series(check.dot(d2[['creature']].values).ravel(), d1.index)
test = test.reindex(df1.index, fill_value='')
df1.assign(test=test)
consumption name test
0 squirrelate apple apple squirrel
1 monkey likesapple appleisred
2 monkey banana gets banana is tropical monkey
3 badger/getsbanana banana is soft
4 giraffe eats grass lemon is sour
5 badger apple.loves washington apples badger
6 elephant is huge kiwi
7 elephant/eats/ bananas elephant
8 squirrel.digsingrass apples squirrel
【讨论】:
嗨!谢谢 - 很棒的解决方案。一个问题 - 当列表包含 None 值时,它会中断。我收到此错误:TypeError: argument of type 'NoneType' is not iterable
。我列出了一个没有 Nonetypes 的列表,sump = df1[df1.consumption.notnull()]['consumption'].values.tolist()
用于油底壳、姓名、cret 和食物。然后 check
函数起作用,但在 df1.assign 中,我得到:ValueError: Length of values does not match length of index
在遍历 zip(sump, name) 时,当 c - s 为非类型或 f - n 均为非类型时,我必须以某种方式获得 NaN / None 值。
dropna 不起作用。 . .然后我会改变数据框!
作为更新:当我这样做时:test = test.reindex(df1.index, fill_value='')
df1.assign(test=test)
,重新索引会创建多个匹配项 - 例如:squirrelsquirrelsquirrelsquirrel。我这样做了:test = pd.DataFrame(check.dot(d2[['creature']].values).ravel(), d1.index, columns=['some_var'])
,然后执行了索引合并:d1 = d1.merge(test, how='left', left_index = True, right_index = True)
- 这有意义吗?
合并是有道理的...但是,join
默认情况下在索引上“合并”并且可能会更好....但是/但是,您不应该在重新索引上获得多个匹配项除非您首先在 df1
中有一个非唯一索引。由于您已经展示了您的数据,但您没有展示。以上是关于Pandas - 检查数据框列是不是包含字典中的键:值对的主要内容,如果未能解决你的问题,请参考以下文章