如何使用 Python 内置函数成功处理大量 .txt 文件？

Posted 2023-02-16

技术标签:

【中文标题】如何使用 Python 内置函数成功处理大量 .txt 文件？【英文标题】：What can I do using Python built-ins to successfully process a massive .txt file? 【发布时间】：2019-01-11 14:22:57 【问题描述】：

我有一个项目，我需要从一个相对较大的 .txt 文件中读取数据，该文件包含 5 列和大约 2500 万行逗号分隔数据，处理数据，然后将处理后的数据写入新的 .txt 文件。 txt 文件。当我尝试处理这么大的文件时，我的电脑死机了。

我已经编写了处理数据的函数，它适用于小的输入 .txt 文件，所以我只需要调整它以适用于较大的文件。

这是我的代码的删减版：

import csv
import sys

def process_data(input_file, output_file):

    prod_dict = 
    with open(input_file, "r") as file:

        # some code that reads all data from input file into dictionary


    # some code that sorts dictionary into an array with desired row order

    # list comprehension code that puts array into desired output form

    with open(output_file, 'w') as myfile:
        wr = csv.writer(myfile)
        for i in final_array:
            wr.writerow(i)

def main():
    input_file = sys.argv[1]
    output_file = sys.argv[2]
    process_data(input_file, output_file)

if __name__ == '__main__':
    main()

【问题讨论】：

大文件有什么问题？当我尝试处理较大的文件时，我的电脑死机了。您需要一次读取所有文件，还是分块读取和处理？重要的是要知道为什么需要将整个文件读入内存才能在此处提供答案。你对读取的数据执行了哪些操作？ @sundance 我不需要一次读取所有文件——我可以分块读取它，但我不知道该怎么做。 【参考方案1】：

该文件显然太大，无法一次将整个内容读入内存。听起来您需要分块处理文件。

有许多排序算法，包括一些不需要一次将整个文件读入内存的算法。特别是，请查看“合并排序”的概念。 wikipedia article 中有一个很好的技术动画演示了这个概念。您可以进行合并排序，而无需一次在内存中对两个以上的项目进行排序。基本上就是“分而治之”。

一般程序：

readline

排序

【讨论】：

【参考方案2】：

你需要逐行处理，听起来像。

（不是整个文件加载到内存中。）

for line in open('really_big_file.dat'): process_data(line)

解释：https://***.com/a/519653/9914705

【讨论】：

如何逐行处理？ for line in open('file.txt'): process_line(line) ***.com/questions/519633/… 这似乎如你所愿。 for line in open('really_big_file.dat'): process_data(line)

以上是关于如何使用 Python 内置函数成功处理大量 .txt 文件？的主要内容，如果未能解决你的问题，请参考以下文章