PYTHON:如何比较两个列表中的重复单词?
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了PYTHON:如何比较两个列表中的重复单词?相关的知识,希望对你有一定的参考价值。
例如:
q=["hi", "sky"]
p=["here","sky","sky","sky","sky"]
该函数定义为什么:
count_words(["hi", "sky"], ["here","sky","sky","sky","sky"])
[0, 4]
# answer where hi appears 0 times and sky appears 4 times
我开始这样的代码:
def count_words(q, p):
count = 0
for word in q:
if q==p:
(q.count("hi"))
(q.count("sky"))
return count
我一直得到一个0的值,它占q
,但我无法得到p
的值。
答案
这是一个更简单的答案(简单来说,我的意思是单行,不使用额外的库) -
q=["hi", "sky"]
p=["here","sky","sky","sky","sky"]
def count_words(q,p):
return [ p.count(i) for i in q ]
print(count_words(q,p))
产量
[0, 4]
说明
[ p.count(i) for i in q ]
是一个list comprehension,就像在飞行中迭代q
列表并计算p
中的各个元素
计时(取决于数据)
# My solution
1.78 µs ± 214 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# @Delirious Solution 1
7.55 µs ± 1.58 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# @ Delirious Solution 2
3.86 µs ± 348 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
另一答案
>>> from collections import Counter
...
...
... def count_words(a, b):
... cnt = Counter(b)
... return [cnt[word] for word in a]
...
>>> count_words(["hi", "sky"], ["here", "sky", "sky", "sky", "sky"])
[0, 4]
或者如果由于某种原因你不能使用collections.Counter
:
>>> def count_words(a, b):
... cnt = {}
... for word in b:
... try:
... cnt[word] += 1
... except KeyError:
... cnt[word] = 1
... return [cnt.get(word, 0) for word in a]
...
>>> count_words(["hi", "sky"], ["here", "sky", "sky", "sky", "sky"])
[0, 4]
编辑:
看起来应该有一些时间来“清理”哪个解决方案更有效。由于@chrisz忘记发布他的实际测试代码,我必须自己做。
不幸的是,Vivek的代码使11min 10sec
运行,而我的只采取46.5ms
。
In[18]: def vivek_count_words(q, p):
...: return [p.count(i) for i in q]
...:
In[19]: def lettuce_count_words(a, b):
...: cnt = {}
...: for word in b:
...: try:
...: cnt[word] += 1
...: except KeyError:
...: cnt[word] = 1
...: return [cnt.get(word, 0) for word in a]
...:
In[20]: # https://www.gutenberg.org/files/2701/2701-0.txt
...: with open('moby_dick.txt', 'r') as f:
...: moby_dick_words = []
...: for line in f:
...: moby_dick_words.extend(line.rstrip().split())
...:
In[21]: len(moby_dick_words)
Out[21]: 215829
In[22]: from random import choice
In[23]: random_words = [choice(moby_dick_words) for _ in range(10)]
In[24]: %timeit vivek_count_words(moby_dick_words, random_words)
31.3 ms ± 99.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In[25]: %timeit lettuce_count_words(moby_dick_words, random_words)
20.7 ms ± 54.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In[26]: random_words = [choice(moby_dick_words) for _ in range(100)]
In[27]: %timeit vivek_count_words(moby_dick_words, random_words)
211 ms ± 642 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In[28]: %timeit lettuce_count_words(moby_dick_words, random_words)
20.6 ms ± 68.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In[29]: random_words = [choice(moby_dick_words) for _ in range(1000)]
In[30]: %timeit vivek_count_words(moby_dick_words, random_words)
2.18 s ± 2.12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In[31]: %timeit lettuce_count_words(moby_dick_words, random_words)
22.2 ms ± 97.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In[32]: random_words = [choice(moby_dick_words) for _ in range(10000)]
In[33]: %timeit vivek_count_words(moby_dick_words, random_words)
29.2 s ± 865 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In[34]: %timeit lettuce_count_words(moby_dick_words, random_words)
25.7 ms ± 198 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In[35]: random_words = [choice(moby_dick_words) for _ in range(100000)]
In[36]: %timeit vivek_count_words(moby_dick_words, random_words)
11min 10s ± 7.51 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
In[37]: %timeit lettuce_count_words(moby_dick_words, random_words)
46.5 ms ± 68.5 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
以上是关于PYTHON:如何比较两个列表中的重复单词?的主要内容,如果未能解决你的问题,请参考以下文章