在拆分为多个文件的大型数据框中查找重复行和包含重复行的文件

Posted 2023-04-18

技术标签:

【中文标题】在拆分为多个文件的大型数据框中查找重复行和包含重复行的文件【英文标题】：Find duplicate rows and the file which contains the duplicated row in a large dataframe split over multiple files 【发布时间】：2019-04-19 09:10:34 【问题描述】：

所以我在 404 个 excel 文件中拆分了一个大数据框。数据框作为 ID 列，我必须：

查找是否存在重复行如果出现重复行，输出包含重复行的两个文件

例如，假设键 ID 为“ID_101”的行包含在文件 #10 和文件 #209 中。该脚本应输出“重复行：ID_101 包含在文件 #10 和文件 #209 中”。

我尝试了这种方法：创建一个包含所有键 id 的 set，以及一个将每个 id 映射到一个文件的 dictionary。当我遍历文件及其行时

如果 ID 在集合中，它将查找字典并输出已找到该行的位置。如果 ID 不在集合中，它会将其添加到集合中，并在字典中创建一个新条目，将该 ID 映射到当前文件

所以 MWE 应该是：

import os, sys, pandas

ids_set = set()
ids_map = dict()

for root, dirs, files in os.walk(sys.argv[1]):
    for file in files:
        in_file = pandas.read_excel(os.path.join(root, file), header=0, sheet_name="Results")    

        # Check for duplicated companies
        this_ids = list(in_file['BvD ID number'])
        for this_id in this_ids:
            if this_id in ids_set:
                print("ERROR: duplicate ID '', already found in ''".format(this_id, ids_map[this_id]))
            else:
                ids_set.add(this_id)
                ids_map[this_id] = filen

问题是，在第 300 个文件中，当我尝试访问字典时出现 MemoryError，据说是因为它变得太大了。

如何使用如此大的数据框实现我的目的？

【问题讨论】：

您是要仅查找文件中的重复项，还是要查找文件中的重复项？另一件事是您可以删除ids_set，而是检查if this_id in ids_map。 @QuangHoang 文件中的那些也算在内。查找集合比查找字典更快吗？我相信是一样的。此外，您不需要创建 this_ids。您可以浏览列本身。 【参考方案1】：

您会遇到内存错误，因为您在 Pandas 针对矢量化操作进行了优化时递归地执行此操作。

最好的方法是将所有数据框附加到一个非常大的数据框中，创建一个包含源文件的列，并查找重复项。

类似的东西：

df = pandas.DataFrame()

for root, dirs, files in os.walk(sys.argv[1]):
    for file in files:
        current_df = pandas.read_excel(filen, header=0, sheet_name="Results")
        current_df["source_file"] = root + file

        df = df.append(current_file, ignore_index=True)

然后得到重复的行：

duplicated_df = df[df.duplicated(subset="ID", keep=False)]
print(duplicated_df)

我无法尝试，因为我没有您的数据，也没有您的确切预期输出，但类似的东西应该可以工作。

【讨论】：

他没有做事recursively，更像serially。如果有的话，他的代码需要的内存比你的少，因为它一次处理一个数据帧，而不是 401。 @ggrelet 谢谢，我已经尝试过了，但我无法读取内存中的合并数据帧。我得到了 low_memory=True 的 MemoryError 和 pandas.errors.ParserError: Error tokenizing data. C error: out of memory 的 low_memory=False

以上是关于在拆分为多个文件的大型数据框中查找重复行和包含重复行的文件的主要内容，如果未能解决你的问题，请参考以下文章