我的代码确实适用于小样本，但不适用于大样本

Posted 2023-03-11

技术标签:

【中文标题】我的代码确实适用于小样本，但不适用于大样本【英文标题】：My code does execute for small sample but not for a large 【发布时间】：2022-01-11 11:04:01 【问题描述】：

我尝试计算变量中单词出现的频率。变量计数超过 700.000 个观察值。输出应返回包含出现次数最多的单词的字典。我使用下面的代码来做到这一点：

d1 = 
for i in range(len(words)-1):
    x=words[i]
    c=0
    for j in range(i,len(words)):
        c=words.count(x)
    count=dict(x:c)
    if x not in d1.keys():
        d1.update(count)

我已经运行了前 1000 次观察的代码，它运行良好。输出如下所示：

[('semantic', 23),
 ('representations', 11),
 ('models', 10),
 ('task', 10),
 ('data', 9),
 ('parser', 9),
 ('language', 8),
 ('languages', 8),
 ('paper', 8),
 ('meaning', 8),
 ('rules', 8),
 ('results', 7),
 ('performance', 7),
 ('parsing', 7),
 ('systems', 7),
 ('neural', 6),
 ('tasks', 6),
 ('entailment', 6),
 ('generic', 6),
 ('te', 6),
 ('natural', 5),
 ('method', 5),
 ('approaches', 5)]

当我尝试运行它进行 100.000 次观察时，它会继续运行。我已经尝试了超过 24 小时，但仍然无法执行。有人有想法吗？

【问题讨论】：

定义一个字典并遍历列表一次。每次看到一个新单词时，您将其添加为值为 1 的键，否则如果该单词已存在于字典中，则增加其值。有道理，我对python比较陌生，所以也许你可以帮我提供代码？ 【参考方案1】：

您可以使用collections.Counter。

from collections import Counter

counts = Counter(words)
print(counts.most_common(20))

【讨论】：

【参考方案2】：

@Jon 答案是您的最佳答案，但在某些情况下，collections.counter 会比迭代慢。（特别是如果之后您不需要按频率排序）正如我在this question 中询问的那样

您可以通过迭代计算频率。

d1 = 
for item in words:
  if item in d1.keys():
    d1[item] += 1
  else:
    d1[item] = 1

# finally sort the dictionary of frequencies
print(dict(sorted(d1.items(), key=lambda item: item[1])))

但同样，对于您的情况，使用 @Jon 答案更快更紧凑。

【讨论】：

【参考方案3】：

#...
for i in range(len(words)-1):
    #...
    #...
    for j in range(i,len(words)):
        c=words.count(x)
    #...
    if x not in d1.keys():
        #...

我试图强调您的代码在上面遇到的问题。在英语中，这看起来像：

“重复计算我正在查看的单词之后每个单词出现的次数，对于整个列表中的每个单词。另外，请查看我正在构建的整个字典再次用于列表中的每个单词，而我正在构建它。”

这比您需要做的工作要多得多；您只需要查看列表中的每个单词一次。您确实需要为每个单词在字典中查找一次，但是通过将字典转换为另一个列表并查看整个内容，查看 d1.keys() 会大大降低速度。以下代码将更快地完成您想要的操作：

words = ['able', 'baker', 'charlie', 'dog', 'easy', 'able', 'charlie', 'dog', 'dog']

word_counts = 

# Look at each word in our list once
for word in words:
    # If we haven't seen it before, create a new count in our dictionary
    if word not in word_counts:
        word_counts[word] = 0

    # We've made sure our count exists, so just increment it by 1
    word_counts[word] += 1

print(word_counts.items())

上面的例子会给出：

[
    ('charlie', 2),
    ('baker', 1),
    ('able', 2),
    ('dog', 3),
    ('easy', 1)
]

【讨论】：

以上是关于我的代码确实适用于小样本，但不适用于大样本的主要内容，如果未能解决你的问题，请参考以下文章