计算嵌套列表中所有元素的计数

Posted 2023-03-11

技术标签:

【中文标题】计算嵌套列表中所有元素的计数【英文标题】：Calculate count of all the elements in nested list 【发布时间】：2018-07-25 02:24:48 【问题描述】：

我有列表列表，并希望创建包含所有唯一元素计数的数据框。这是我的测试数据：

test = [["P1", "P1", "P1", "P2", "P2", "P1", "P1", "P3"],
        ["P1", "P1", "P1"],
        ["P1", "P1", "P1", "P2"],
        ["P4"],
        ["P1", "P4", "P2"],
        ["P1", "P1", "P1"]]

我可以使用 Counter 和 for 循环来做这样的事情：

from collections import Counter
for item in test:
     print(Counter(item))

但是我怎样才能将这个循环的结果汇总到新的数据框中？

预期输出为数据框：

P1 P2 P3 P4
15 4  1  2

【问题讨论】：

【参考方案1】：

就更好的性能而言，您应该使用：

collections.Counter 与 itertools.chain.from_iterable 为：

>>> from collections import Counter
>>> from itertools import chain

>>> Counter(chain.from_iterable(test))
Counter('P1': 15, 'P2': 4, 'P4': 2, 'P3': 1)

或者，你应该使用 collections.Counter 和 列表理解 （需要少一个 itertools 的导入，具有相同的性能）：

>>> from collections import Counter

>>> Counter([x for a in test for x in a])
Counter('P1': 15, 'P2': 4, 'P4': 2, 'P3': 1)

继续阅读以了解更多替代解决方案和性能比较。 （否则跳过）

方法 1：连接您的子列表以创建单个 list 并使用 collections.Counter 查找计数。

解决方案 1：使用 itertools.chain.from_iterable 连接列表并使用 collections.Counter 查找计数：

test = [
    ["P1", "P1", "P1", "P2", "P2", "P1", "P1", "P3"],
    ["P1", "P1", "P1"],
    ["P1", "P1", "P1", "P2"],
    ["P4"],
    ["P1", "P4", "P2"],
    ["P1", "P1", "P1"]
]

from itertools import chain 
from collections import Counter

my_counter = Counter(chain.from_iterable(test))

解决方案 2：使用 列表推导 将列表组合为：

from collections import Counter

my_counter = Counter([x for a in my_list for x in a])

解决方案 3：使用 sum 连接列表

from collections import Counter

my_counter = Counter(sum(test, []))

方法 2： 使用collections.Counter 计算每个子列表中的元素数，然后使用sum 中的Counter 对象列表。

解决方案4：使用collections.Counter和map计算每个子列表的对象为：

from collections import Counter

my_counter = sum(map(Counter, test), Counter())

解决方案 5：使用 list comprehension 计算每个子列表的对象：

from collections import Counter

my_counter = sum([Counter(t) for t in test], Counter())

在上述所有解决方案中，my_counter 将保留该值：

>>> my_counter
Counter('P1': 15, 'P2': 4, 'P4': 2, 'P3': 1)

性能比较

下面是 Python 3 上的timeit 比较，其中包含 1000 个子列表和每个子列表中的 100 个元素：

使用chain.from_iterable 最快（17.1 毫秒）

mquadri$ python3 -m timeit "from collections import Counter; from itertools import chain; my_list = [list(range(100)) for i in range(1000)]" "Counter(chain.from_iterable(my_list))"
100 loops, best of 3: 17.1 msec per loop

列表中的第二个是使用 list comprehension 来组合列表，然后执行 Count（与上面的结果类似，但没有额外导入 itertools）（18.36 毫秒)

mquadri$ python3 -m timeit "from collections import Counter; my_list = [list(range(100)) for i in range(1000)]" "Counter([x for a in my_list for x in a])"
100 loops, best of 3: 18.36 msec per loop

就性能而言，第三个是在列表理解中的子列表上使用Counter：（162 毫秒）

mquadri$ python3 -m timeit "from collections import Counter; my_list = [list(range(100)) for i in range(1000)]" "sum([Counter(t) for t in my_list], Counter())"
10 loops, best of 3: 162 msec per loop

列表中的第四个是通过使用Counter 和map（结果与上面使用列表理解的结果非常相似）（176 毫秒）

mquadri$ python3 -m timeit "from collections import Counter; my_list = [list(range(100)) for i in range(1000)]" "sum(map(Counter, my_list), Counter())"
10 loops, best of 3: 176 msec per loop

使用sum连接列表的解决方法太慢了(526毫秒)

mquadri$ python3 -m timeit "from collections import Counter; my_list = [list(range(100)) for i in range(1000)]" "Counter(sum(my_list, []))"
10 loops, best of 3: 526 msec per loop

【讨论】：

是的，但它不会总结Counter，而是在组合列表上运行单个计数器。不过我想我也应该在答案中提到这一点（跳过导入itertools 的好方法）【参考方案2】：

这是另一种方法，使用itertools.groupby

>>> from itertools import groupby, chain

>>> out = [(k,len(list(g))) for k,g in groupby(sorted(chain(*test)))]
>>> out
>>> [('P1', 15), ('P2', 4), ('P3', 1), ('P4', 2)]

将其转换为类似的字典：

>>> dict(out)
>>> 'P2': 4, 'P3': 1, 'P1': 15, 'P4': 2

要将其转换为数据框，请使用

>>> import pandas as pd

>>> pd.DataFrame(dict(out), index=[0])
   P1  P2  P3  P4
0  15   4   1   2

【讨论】：

【参考方案3】：

这是一种方法。

from collections import Counter
from itertools import chain

test = [["P1", "P1", "P1", "P2", "P2", "P1", "P1", "P3"],
        ["P1", "P1", "P1"],
        ["P1", "P1", "P1", "P2"],
        ["P4"],
        ["P1", "P4", "P2"],
        ["P1", "P1", "P1"]]

c = Counter(chain.from_iterable(test))

for k, v in c.items():
    print(k, v)

# P1 15
# P2 4
# P3 1
# P4 2

作为数据框输出：

df = pd.DataFrame.from_dict(c, orient='index').transpose()

#    P1 P2 P3 P4
# 0  15  4  1  2

【讨论】：

已经有了像你一样处理导入的功能。是from itertools import chain.from_iterable as concat @Ev.Kounis 实际上不完全是，from itertools import chain as concat 虽然是可能的，但我同意他们目前拥有的一个班轮是讨厌的，但在其他方面不错的答案（我做了一个编辑，希望没问题）您不需要循环，将其转换为 DataFrame：pd.DataFrame.from_dict(c, orient='index').transpose() 或更短：pd.DataFrame(c, index=[0]) @StefanPochmann 原来他们正在重新命名 itertools.chain.from_iterable ，我一直认为这太长了。无论如何，我认为他们指的是toolz.readthedocs.io/en/latest/api.html#toolz.itertoolz.concat【参考方案4】：

“set”函数只保留列表中的唯一元素。因此，使用“len(set(mylinst))”，您可以获得列表中唯一元素的数量。然后，您只需对其进行迭代。

dict_nb_item = 
i = 0
for test_item in test:
    dict_nb_item[i] = len(set(test_item))
    i += 1
print(dict_nb_item)

【讨论】：

这如何产生 OP 所追求的结果？这个正在输出0: 3, 1: 1, 2: 2, 3: 1, 4: 3, 5: 1 (Python-3)，这是 OP 正在寻找的。span>

以上是关于计算嵌套列表中所有元素的计数的主要内容，如果未能解决你的问题，请参考以下文章