在 Python 中，有没有比较简洁的方法来比较两个文本文件的内容是不是相同？

Posted 2023-02-15

技术标签:

【中文标题】在 Python 中，有没有比较简洁的方法来比较两个文本文件的内容是不是相同？【英文标题】：In Python, is there a concise way of comparing whether the contents of two text files are the same?在 Python 中，有没有比较简洁的方法来比较两个文本文件的内容是否相同？ 【发布时间】：2010-09-20 05:46:33 【问题描述】：

我不在乎有什么不同。我只是想知道内容是否不同。

【问题讨论】：

【参考方案1】：

简单高效的解决方案：

import os


def is_file_content_equal(
    file_path_1: str, file_path_2: str, buffer_size: int = 1024 * 8
) -> bool:
    """Checks if two files content is equal
    Arguments:
        file_path_1 (str): Path to the first file
        file_path_2 (str): Path to the second file
        buffer_size (int): Size of the buffer to read the file
    Returns:
        bool that indicates if the file contents are equal
    Example:
        >>> is_file_content_equal("filecomp.py", "filecomp copy.py")
            True
        >>> is_file_content_equal("filecomp.py", "diagram.dio")
            False
    """
    # First check sizes
    s1, s2 = os.path.getsize(file_path_1), os.path.getsize(file_path_2)
    if s1 != s2:
        return False
    # If the sizes are the same check the content
    with open(file_path_1, "rb") as fp1, open(file_path_2, "rb") as fp2:
        while True:
            b1 = fp1.read(buffer_size)
            b2 = fp2.read(buffer_size)
            if b1 != b2:
                return False
            # if the content is the same and they are both empty bytes
            # the file is the same
            if not b1:
                return True

【讨论】：

【参考方案2】：

这是一个函数式文件比较函数。如果文件大小不同，它会立即返回 False；否则，它会读取 4KiB 块大小并在第一个差异时立即返回 False：

from __future__ import with_statement
import os
import itertools, functools, operator
try:
    izip= itertools.izip  # Python 2
except AttributeError:
    izip= zip  # Python 3

def filecmp(filename1, filename2):
    "Do the two files have exactly the same contents?"
    with open(filename1, "rb") as fp1, open(filename2, "rb") as fp2:
        if os.fstat(fp1.fileno()).st_size != os.fstat(fp2.fileno()).st_size:
            return False # different sizes ∴ not equal

        # set up one 4k-reader for each file
        fp1_reader= functools.partial(fp1.read, 4096)
        fp2_reader= functools.partial(fp2.read, 4096)

        # pair each 4k-chunk from the two readers while they do not return '' (EOF)
        cmp_pairs= izip(iter(fp1_reader, b''), iter(fp2_reader, b''))

        # return True for all pairs that are not equal
        inequalities= itertools.starmap(operator.ne, cmp_pairs)

        # voilà; any() stops at first True value
        return not any(inequalities)

if __name__ == "__main__":
    import sys
    print filecmp(sys.argv[1], sys.argv[2])

只是一个不同的看法:)

【讨论】：

相当老套，使用所有快捷方式、itertools 和部分 - 赞，这是最好的解决方案！我不得不在 Python 3 中稍作改动，否则函数永远不会返回： cmp_pairs= izip(iter(fp1_reader, b''), iter(fp2_reader, b'')) @TedStriker 你是对的！感谢您帮助改进此答案:)【参考方案3】：

from __future__ import with_statement

filename1 = "G:\\test1.TXT"

filename2 = "G:\\test2.TXT"


with open(filename1) as f1:

   with open(filename2) as f2:

      file1list = f1.read().splitlines()

      file2list = f2.read().splitlines()

      list1length = len(file1list)

      list2length = len(file2list)

      if list1length == list2length:

          for index in range(len(file1list)):

              if file1list[index] == file2list[index]:

                   print file1list[index] + "==" + file2list[index]

              else:                  

                   print file1list[index] + "!=" + file2list[index]+" Not-Equel"

      else:

          print "difference inthe size of the file and number of lines"

【讨论】：

【参考方案4】：

低级方式：

from __future__ import with_statement
with open(filename1) as f1:
   with open(filename2) as f2:
      if f1.read() == f2.read():
         ...

高级方式：

import filecmp
if filecmp.cmp(filename1, filename2, shallow=False):
   ...

【讨论】：

我更正了您的 filecmp.cmp 调用，因为没有非真实的浅论参数，它不会满足问题的要求。你是对的。 python.org/doc/2.5.2/lib/module-filecmp.html 。非常感谢。顺便说一句，应该以二进制模式打开文件以确保文件的行分隔符不同。如果文件很大，这可能会出现问题。如果您要做的第一件事是比较文件大小，则可以节省计算机的一些精力。如果大小不同，显然文件是不同的。如果大小相同，您只需要读取文件。我刚刚发现filecmp.cmp() 也比较元数据，例如inode number 和ctime 和其他统计信息。这在我的用例中是不可取的。如果您只想比较内容而不比较元数据，f1.read() == f2.read() 可能是更好的方法。【参考方案5】：

由于我无法评论其他人的答案，所以我会自己写。

如果你使用 md5，你绝对不能只使用 md5.update(f.read())，因为你会使用太多的内存。

def get_file_md5(f, chunk_size=8192):
    h = hashlib.md5()
    while True:
        chunk = f.read(chunk_size)
        if not chunk:
            break
        h.update(chunk)
    return h.hexdigest()

【讨论】：

我相信任何散列操作对于这个问题来说都是多余的；直接逐件比较更快更直接。我只是在清理某人建议的实际散列部分。 +1 我更喜欢你的版本。另外，我不认为使用哈希是矫枉过正的。如果您只想知道它们是否不同，那么真的没有充分的理由不这样做。 @Jeremy Cantrell：当要缓存/存储哈希值或与缓存/存储的哈希值进行比较时，计算哈希值。否则，只需比较字符串。无论硬件如何，str1 != str2 都比 md5.new(str1).digest() != md5.new(str2).digest() 快。哈希也有冲突（不太可能但并非不可能）。【参考方案6】：

我会使用 MD5 对文件内容进行哈希处理。

import hashlib

def checksum(f):
    md5 = hashlib.md5()
    md5.update(open(f).read())
    return md5.hexdigest()

def is_contents_same(f1, f2):
    return checksum(f1) == checksum(f2)

if not is_contents_same('foo.txt', 'bar.txt'):
    print 'The contents are not the same!'

【讨论】：

【参考方案7】：

对于较大的文件，您可以计算文件的 MD5 或 SHA 哈希。

【讨论】：

那么只有第一个字节不同的两个 32 GiB 文件呢？为什么要花费 CPU 时间并等待答案太久？查看我的解决方案，对于较大的文件，最好进行缓冲读取【参考方案8】：

如果你想要基本的效率，你可能想先检查文件大小：

if os.path.getsize(filename1) == os.path.getsize(filename2):
  if open('filename1','r').read() == open('filename2','r').read():
    # Files are the same.

这可以节省您阅读两个文件的每一行的时间，这些文件的大小甚至不同，因此不可能相同。

（更进一步，您可以调用每个文件的快速 MD5sum 并进行比较，但这不是“在 Python 中”，所以我会停在这里。）

【讨论】：

只有 2 个文件的 md5sum 方法会更慢（您仍然需要读取文件来计算总和）它只有在您在多个文件中查找重复项时才会得到回报。 @Brian：您假设 md5sum 的文件读取速度不比 Python 快，并且将整个文件作为字符串读取到 Python 环境中没有任何开销！用 2GB 文件试试这个... 没有理由期望 md5sum 的文件读取速度会比 python 的快 - IO 非常独立于语言。大文件问题是在块中迭代（或使用 filecmp）的原因，而不是使用 md5，因为您不必要地支付额外的 CPU 损失。当您考虑文件不相同的情况时尤其如此。按块比较可以提前退出，但 md5sum 必须继续读取整个文件。【参考方案9】：


f = open(filename1, "r").read()
f2 = open(filename2,"r").read()
print f == f2

【讨论】：

“嗯，我有这个 8 GiB 的文件和我想要比较的那个 32 GiB 的文件……” 这不是一个好方法。一个大问题是文件在打开后永远不会关闭。不太重要的是，在打开和读取文件之前没有优化，例如文件大小比较..

以上是关于在 Python 中，有没有比较简洁的方法来比较两个文本文件的内容是不是相同？的主要内容，如果未能解决你的问题，请参考以下文章

比较Python中两种子字符串搜索方法的效率

python 三目操作符（写起来还是比较简洁的）

有没有办法在 C# 中使用 emguCV 比较两张脸？

this和super用法的区别与细节（java继承中this和super的比较）（简洁而精炼）

Python实现奖金计算两种方法的比较

在 Python 中设置枚举的最简洁方法是啥？ [复制]