pandas df 子集按列中的字符串与列表

Posted 2023-03-12

技术标签:

【中文标题】pandas df 子集按列中的字符串与列表【英文标题】：pandas df subset by string in column with lists 【发布时间】：2018-08-28 10:55:49 【问题描述】：

我有一个复杂的大熊猫数据框，其中有一列 X 可以包含一个列表或列表列表。我很好奇该解决方案是否可以应用于任何内容，所以我给出了一个模拟示例，其中 X 的一个元素也是一个字符串：

df1 = pd.DataFrame(
    'A': [1, 1, 3], 
    'B': ['a', 'e', 'f'], 
    'X': ['something', ['hello'], [['something'],['hello']]]
)

我想获取该数据帧的子集 df2，其中 X 列包含子字符串“hello”，当其中的任何内容都被读取为字符串时。

>>> df2
   A  B                       X
0  1  e                 [hello]
1  3  f  [[something], [hello]]

我尝试了 str() 和 .str.contains、apply、map、.find()、列表推导的广泛组合，如果不进入循环，似乎什么都不起作用（相关问题 here 和 here。我错过了什么？

【问题讨论】：

【参考方案1】：

借用@wim https://***.com/a/49247980/2336654

最通用的解决方案是允许任意嵌套列表。另外，我们可以关注字符串元素是否相等而不是包含。

# This import is for Python 3
# for Python 2 use `from collections import Iterable`
from collections.abc import Iterable

def flatten(collection):
    for x in collection:
        if isinstance(x, Iterable) and not isinstance(x, str):
            yield from flatten(x)
        else:
            yield x

df1[df1.X.map(lambda x: any('hello' == s for s in flatten(x)))]

   A  B                       X
1  1  e                 [hello]
2  3  f  [[something], [hello]]

所以现在如果我们把它复杂化

df1 = pd.DataFrame(
    'A': [1, 1, 3, 7, 7], 
    'B': ['a', 'e', 'f', 's', 's'], 
    'X': [
        'something',
        ['hello'],
        [['something'],['hello']],
        ['hello world'],
        [[[[[['hello']]]]]]
    ]
)

df1

   A  B                       X
0  1  a               something
1  1  e                 [hello]
2  3  f  [[something], [hello]]
3  7  s           [hello world]
4  7  s     [[[[[['hello']]]]]]

我们的过滤器不会抓取hello world，而是抓取非常嵌套的hello

df1[df1.X.map(lambda x: any('hello' == s for s in flatten(x)))]

   A  B                       X
1  1  e                 [hello]
2  3  f  [[something], [hello]]
4  7  s     [[[[[['hello']]]]]]

【讨论】：

【参考方案2】：

在str.contains之前添加astype

df1[df1.X.astype(str).str.contains('hello')]
Out[538]: 
   A  B                       X
1  1  e                 [hello]
2  3  f  [[something], [hello]]

【讨论】：

@JRCX yw~ :-) 快乐编码【参考方案3】：

您可以使用 np.ravel() 来展平嵌套列表并使用 in 运算符

df1[df1['X'].apply(lambda x: 'hello' in np.ravel(x))]

    A   B   X
1   1   e   [hello]
2   3   f   [[something], [hello]]

【讨论】：

以上是关于pandas df 子集按列中的字符串与列表的主要内容，如果未能解决你的问题，请参考以下文章