PYTHON:如何比较两个列表中的重复单词?

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了PYTHON:如何比较两个列表中的重复单词?相关的知识,希望对你有一定的参考价值。

例如:

q=["hi", "sky"]
p=["here","sky","sky","sky","sky"]

该函数定义为什么:

count_words(["hi", "sky"], ["here","sky","sky","sky","sky"])
[0, 4]
# answer where hi appears 0 times and sky appears 4 times

我开始这样的代码:

def count_words(q, p):
    count = 0
    for word in q:
        if q==p:
            (q.count("hi"))
            (q.count("sky"))
        return count

我一直得到一个0的值,它占q,但我无法得到p的值。

答案

这是一个更简单的答案(简单来说,我的意思是单行,不使用额外的库) -

q=["hi", "sky"] 
p=["here","sky","sky","sky","sky"]

def count_words(q,p):
    return [ p.count(i) for i in q ]

print(count_words(q,p))

产量

[0, 4]

说明

[ p.count(i) for i in q ]是一个list comprehension,就像在飞行中迭代q列表并计算p中的各个元素

计时(取决于数据)

# My solution
1.78 µs ± 214 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

# @Delirious Solution 1
7.55 µs ± 1.58 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

# @ Delirious Solution 2
3.86 µs ± 348 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
另一答案
>>> from collections import Counter
... 
... 
... def count_words(a, b):
...     cnt = Counter(b)
...     return [cnt[word] for word in a]
... 
>>> count_words(["hi", "sky"], ["here", "sky", "sky", "sky", "sky"])
[0, 4]

或者如果由于某种原因你不能使用collections.Counter

>>> def count_words(a, b):
...     cnt = {}
...     for word in b:
...         try:
...             cnt[word] += 1
...         except KeyError:
...             cnt[word] = 1
...     return [cnt.get(word, 0) for word in a]
... 
>>> count_words(["hi", "sky"], ["here", "sky", "sky", "sky", "sky"])
[0, 4]

编辑:

看起来应该有一些时间来“清理”哪个解决方案更有效。由于@chrisz忘记发布他的实际测试代码,我必须自己做。

不幸的是,Vivek的代码使11min 10sec运行,而我的只采取46.5ms

In[18]: def vivek_count_words(q, p):
   ...:     return [p.count(i) for i in q]
   ...: 
In[19]: def lettuce_count_words(a, b):
   ...:     cnt = {}
   ...:     for word in b:
   ...:         try:
   ...:             cnt[word] += 1
   ...:         except KeyError:
   ...:             cnt[word] = 1
   ...:     return [cnt.get(word, 0) for word in a]
   ...: 
In[20]: # https://www.gutenberg.org/files/2701/2701-0.txt
   ...: with open('moby_dick.txt', 'r') as f:
   ...:     moby_dick_words = []
   ...:     for line in f:
   ...:         moby_dick_words.extend(line.rstrip().split())
   ...: 
In[21]: len(moby_dick_words)
Out[21]: 215829
In[22]: from random import choice

In[23]: random_words = [choice(moby_dick_words) for _ in range(10)]
In[24]: %timeit vivek_count_words(moby_dick_words, random_words)
31.3 ms ± 99.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In[25]: %timeit lettuce_count_words(moby_dick_words, random_words)
20.7 ms ± 54.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In[26]: random_words = [choice(moby_dick_words) for _ in range(100)]
In[27]: %timeit vivek_count_words(moby_dick_words, random_words)
211 ms ± 642 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In[28]: %timeit lettuce_count_words(moby_dick_words, random_words)
20.6 ms ± 68.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In[29]: random_words = [choice(moby_dick_words) for _ in range(1000)]
In[30]: %timeit vivek_count_words(moby_dick_words, random_words)
2.18 s ± 2.12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In[31]: %timeit lettuce_count_words(moby_dick_words, random_words)
22.2 ms ± 97.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In[32]: random_words = [choice(moby_dick_words) for _ in range(10000)]
In[33]: %timeit vivek_count_words(moby_dick_words, random_words)
29.2 s ± 865 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In[34]: %timeit lettuce_count_words(moby_dick_words, random_words)
25.7 ms ± 198 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In[35]: random_words = [choice(moby_dick_words) for _ in range(100000)]
In[36]: %timeit vivek_count_words(moby_dick_words, random_words)
11min 10s ± 7.51 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
In[37]: %timeit lettuce_count_words(moby_dick_words, random_words)
46.5 ms ± 68.5 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

以上是关于PYTHON:如何比较两个列表中的重复单词?的主要内容,如果未能解决你的问题,请参考以下文章

如何比较python中的两个列表并返回匹配项

如何使用python> 2.0计算文件中的单词[重复]

Python:将每一行单词放在一个列表中[重复]

python基础一 ------如何统计一个列表元素的频度

Python过滤器功能-如果列表中的单词以特定字符开头[重复]

初学者问题(Python)-如何从列表中删除一定长度的单词[重复]