执行速度差异的原因是啥？

Posted 2023-03-06

技术标签:

【中文标题】执行速度差异的原因是啥？【英文标题】：Reasons for this disparity in execution speed?执行速度差异的原因是什么？ 【发布时间】：2009-12-07 23:57:48 【问题描述】：

我编写了一个快速的 Python 脚本来比较两个文件，每个文件都包含无序哈希，以验证两个文件除了顺序之外是否相同。然后我出于教育目的用 Ruby 重写了它。

Python 实现需要几秒钟，而 Ruby 实现大约需要 4 分钟。

我觉得这很可能是由于我缺乏 Ruby 知识，对我做错了什么有什么想法吗？

环境是 Windows XP x64、Python 2.6、Ruby 1.8.6

Python

f = open('c:\\file1.txt', 'r')

hashes = dict()

for line in f.readlines():
    if not line in hashes:
        hashes[line] = 1
    else:
        hashes[line] += 1


print "Done file 1"

f.close()

f = open('c:\\file2.txt', 'r')

for line in f.readlines():
    if not line in hashes:
        print "Hash not found!"
    else:
        hashes[line] -= 1

f.close()

print "Done file 2"

num_errors = 0

for key in hashes.keys():
    if hashes[key] != 0:
        print "Uneven hash count: %s" % key
        num_errors += 1

print "Total of %d mismatches found" % num_errors

红宝石

file = File.open("c:\\file1.txt", "r")
hashes = 

file.each_line  |line|
  if hashes.has_key?(line)
    hashes[line] += 1
  else
    hashes[line] = 1
  end


file.close()

puts "Done file 1"

file = File.open("c:\\file2.txt", "r")

file.each_line  |line|
  if hashes.has_key?(line)
    hashes[line] -= 1
  else
    puts "Hash not found!"
  end


file.close()

puts "Done file 2"

num_errors = 0
hashes.each_key |key|
  if hashes[key] != 0
    num_errors += 1
  end


puts "Total of #num_errors mismatches found"

编辑为了说明规模，每个文件都非常大，超过 900 000 个哈希值。

进展

使用 nathanvda 的建议，这里是优化的 ruby 脚本：

f1 = "c:\\file1.txt"
f2 = "c:\\file2.txt"

hashes = Hash.new(0)

File.open(f1, "r") do |f|
  while line = f.gets
    hashes[line] += 1
  end
end  

not_founds = 0

File.open(f2, "r") do |f|
  while line = f.gets
    if hashes.has_key?(line)
      hashes[line] -= 1
    else
      not_founds += 1
    end
  end
end

num_errors = hashes.values.to_a.select  |z| z != 0.size   

puts "Total of #not_founds lines not found in file2"
puts "Total of #num_errors mismatches found"

在使用 Ruby 1.8.7 的 Windows 上，原始版本耗时 250 秒，优化版本耗时 223 秒。

在 Linux 虚拟机上！运行 ruby 1.9.1，原始版本运行时间为 81 秒，大约是 windows 1.8.7 时间的 1/3。有趣的是，优化后的版本耗时更长，为 89 秒。请注意，由于内存限制，而 line = ... 是必需的。

在使用 Ruby 1.9.1 的 Windows 上，原始版本耗时 457 秒，优化版本耗时 543 秒。

在带有 jRuby 的 windows 上，原始版本需要 45 秒，优化版本需要 43 秒。

我对结果有些惊讶，我预计 1.9.1 会比 1.8.7 有所改进。

【问题讨论】：

还有许多其他可用的 Ruby VM，如果不是全部的话，大多数都比 Windows 上的 Ruby 1.8.6 更快。 IronRuby 可能是您最好的选择。见antoniocangiano.com/2009/08/03/… 对文件使用 readlines() 会将整个文件读入内存并创建一个巨大的列表。正如我在答案中所示，您可以一次遍历文件一行。这实际上可能比使用 readlines 慢，但内存效率更高发帖前你应该做的第一件事是简化问题。完全删除比较和第二个文件，您仍然会看到差异，这使每个人都更容易。从那里，很容易看出 Python 的文件读取和 dicts 都比 Ruby（至少 1.8 的）快得多。在这种情况下，Python 对我来说要快 2-3 倍；额外的代码只会改变我的比例，而不是因素。不相关的警告：readlines 和 each_line 返回行，包括末尾的任何换行符。如果文件中的最后一个散列后面没有换行符终止符，它将不带\n，并且不会匹配来自另一行的先前散列具有\n。 @bobince: 行等尾随空格数量的变化等...取决于 OP 如何定义“相同”... line.rstrip() 或行。 rstrip('\n') 或根本没有带 【参考方案1】：

我发现 Ruby 的参考实现（嗯，Ruby）很慢（不科学地说）。

如果您有机会，请尝试在 JRuby 下运行您的程序！ Charles Nutter 和其他 Sun 人声称已经大大加快了 Ruby 的速度。

我对你的结果最感兴趣。

【讨论】：

Ruby 1.8.6 的速度特别慢。有问题的 Python 版本要新得多。 Ruby 1.9 肯定会更好。我尝试了 jRuby，虽然它比 1.8.6 解释器快得多，但在约 45 秒时它仍然比 Python 慢 3 倍（约 15 秒）有趣。非常感谢！【参考方案2】：

这可能是因为 Python 中的 dicts 比 Ruby 中的哈希快得多

我刚刚运行了一个快速测试，在 Ruby1.8.7 中构建 12345678 项的哈希值是 Python 的 3 倍。 Ruby1.9 大约是 Python 的两倍。

这是我的测试方法python

$ time python -c "d=
for i in xrange(12345678):d[i]=1"

红宝石

$ time ruby -e "d=;12345678.times|i|d[i]=1"

但不足以说明您的差异。

也许文件 I/O 值得研究 - 注释掉所有哈希码并查看空循环运行文件所需的时间。

这是另一个使用 defaultdict 和上下文管理器的 Python 版本

from collections import defaultdict
hashes = defaultdict(int)

with open('c:\\file1.txt', 'r') as f:
    for line in f:
        hashes[line] += 1

print "Done file 1"

with open('c:\\file2.txt', 'r') as f:
    for line in f:
        if line in hashes:
            hashes[line] -= 1
        else:
            print "Hash not found!"

print "Done file 2"

num_errors = 0
for key,value in hashes.items():  # hashes.iteritems() might be better here
    if value != 0:
        print "Uneven hash count: %s" % key
        num_errors += 1

print "Total of %d mismatches found" % num_errors

【讨论】：

我相信你的意思是for key, value in hashes.items()（或hashes.iteritems()）而不是hashes.keys()。在他的例子中，他正在散列字符串。更改您的 python 测试以使用： d[str(i)]=1 使我的时间增加了四倍。我对 Ruby 了解的不够多，无法更改 Ruby 示例并进行比较。 @Mark，对于 ruby 来说应该是 i.to_s @gnibbler，我的时代：Python（int）：0m4.359s Python（str）：0m16.202s Ruby（int）：0m8.092s Ruby（str）：0m36.648s 似乎它们可以缩放差不多。【参考方案3】：

在 python 方面，您可以像这样遍历字典项：

for key, value in hashes.iteritems():
    if value != 0:
        print "Uneven hash count: %s" % key
        num_errors += 1

还有：

for line in f.readlines():
    hashes[line] = hashes.setdefault(line, 0) + 1

...但我无法在 Ruby 方面为您提供帮助，只能建议您寻找分析器。

【讨论】：

如果字典有很多元素，你应该改用hashes.iteritems()。【参考方案4】：

我不是 Ruby 专家，所以如果我错了，请有人纠正我：

我看到了一个小的优化潜力。

如果你说

hashes = hash.new(0)

然后对未定义哈希的引用将返回 0 并存储密钥；你可以这样做

hashes[line] += 1

每次，不包括 if 和 else。

警告：未经测试！

如果存储密钥不是自动发生的，还有另一个使用块的哈希构造函数，您可以在其中显式执行它。

【讨论】：

类似地，collections.defaultdict(lambda : 0) 用于 Python。 docs.python.org/library/collections.html#defaultdict-objects 我相信它更典型地写成collections.defaultdict(int)。【参考方案5】：

Python 的字典速度非常快。请参阅How are Python's Built In Dictionaries Implemented Ruby 可能没有那么火爆。

我怀疑是哈希函数。 Ruby 开发人员不可能拥有比 Python 差一个数量级的哈希函数。

也许 Ruby 1.8 在动态调整大型散列表的大小方面很慢？您的问题如何随着较小的文件扩展？

【讨论】：

就规模而言，1 000 和 10 000 的输入之间的差异很小。 10 000 和 100 000 之间的差异几乎是线性的。 100 000 到接近 1 000 000 之间的差异是巨大的。这表明它正在呈指数级增长。 @t_scho：后者是否也表明它正在达到某种限制（例如 RAM）？【参考方案6】：

我能够加快您的 ruby 代码速度，如下所示：

require 'benchmark'

Benchmark.bm(10) do |x|

  x.report("original version") do
    file = File.open("c:\\file1.txt", "r")
    hashes = 

    file.each_line  |line|
      if hashes.has_key?(line)
        hashes[line] += 1
      else
        hashes[line] = 1
      end
    

    file.close()

    #puts "Done file 1"

    not_founds = 0

    file = File.open("c:\\file2.txt", "r")

    file.each_line  |line|
      if hashes.has_key?(line)
        hashes[line] -= 1
      else
        not_founds += 1        
      end
    

    file.close()

    #puts "Done file 2"

    num_errors = 0
    hashes.each_key |key|
      if hashes[key] != 0
        num_errors += 1
      end
    

    puts "Total of #not_founds lines not found in file2"
    puts "Total of #num_errors mismatches found"

  end


  x.report("my speedup") do
    hashes = 
    File.open("c:\\file1.txt", "r") do |f|
      lines = f.readlines
      lines.each  |line|
        if hashes.has_key?(line)
          hashes[line] += 1
        else
          hashes[line] = 1
        end
      
    end  

    not_founds = 0

    File.open("c:\\file2.txt", "r") do |f|
      lines = f.readlines
      lines.each  |line|
        if hashes.has_key?(line)
          hashes[line] -= 1
        else
          not_founds += 1
        end
      
    end

    num_errors = hashes.values.to_a.select  |z| z != 0.size   

    puts "Total of #not_founds lines not found in file2"
    puts "Total of #num_errors mismatches found"

  end

end

所以我在一个错误块中读取文件，这在我的情况下要快一些（我在 Windows XP、ruby 1.8.6 和一个 100000 行的文件上进行了测试）。我对读取文件的所有不同方式进行了基准测试（我可以考虑），这是最快的方式。我也确实加快了哈希值的计算速度，但这只有在你对非常大的数字进行时才可见:)

所以我在这里得到了一个非常小的速度提升。我机器上的输出如下：

                user     system      total        real
original versionTotal of 16 lines not found in file2
Total of 4 mismatches found
   1.000000   0.015000   1.015000 (  1.016000)
my speedup v1Total of 16 lines not found in file2
Total of 4 mismatches found
   0.812000   0.047000   0.859000 (  0.859000)

谁有进一步改进的想法？

如果f.readlines 变慢，由于尺寸的原因，我发现

File.open("c:\\file2.txt", "r") do |f|
  while (line=f.gets)
    if hashes.has_key?(line)
      hashes[line] -= 1
    else
      not_founds += 1
    end
  end
end

对我来说只是快一点。

我在想办法改进

if hashes.has_key?(line) ...

码了一下，但是什么都想不出来。

您是否尝试过使用 Ruby 1.9？

我有一个带有 Ruby 1.9.1 的 Windows 7 虚拟机，那里的 f.readlines 速度较慢，由于内存限制，我需要使用 while (line=f.gets) :)

由于很多 uf Ruby 用户主要在 Unix 相关平台上进行测试，我想这可以解释为什么代码在 Windows 上不是最佳的。有没有人比较过上面提到的在 Unix 上的性能？这是 ruby 与 python 的问题，还是 Ruby-windows 与 Ruby-Unix 的问题？

【讨论】：

【参考方案7】：

我敢打赌 Ruby 1.9.x 的结果（它在大多数领域都更快或与 Python 相当）是由哈希/字典实现所需的额外开销引起的，因为它们是有序 Ruby 与 Python 不同。

【讨论】：

【参考方案8】：

我会在我的大量空闲时间尝试做一个基准测试，但请尝试使用group_by。它不仅更像函数式编程，而且我发现它比 MRI 中的程序版本快得多。

def convert_to_hash(file)
  values_hash = file.each_line.group_by |line| line
  # Hash.[] converts an array of pairs into a hash
  count_hash = Hash[ values_hash.map|line, lines| [line, lines.length]]
  count_hash
end

hash1 = convert_to_hash(file)
hash2 = convert_to_hash(file2)
# compare if the two hashes are equal

【讨论】：

以上是关于执行速度差异的原因是啥？的主要内容，如果未能解决你的问题，请参考以下文章