使用 Python 按行号将大文本文件拆分为较小的文本文件

Posted 2023-03-12

技术标签:

【中文标题】使用 Python 按行号将大文本文件拆分为较小的文本文件【英文标题】：Splitting large text file into smaller text files by line numbers using Python 【发布时间】：2013-04-23 18:44:49 【问题描述】：

我有一个文本文件，说真的是_big_file.txt，其中包含：

line 1
line 2
line 3
line 4
...
line 99999
line 100000

我想编写一个 Python 脚本，将 real_big_file.txt 分成较小的文件，每个文件 300 行。例如，small_file_300.txt 包含第 1-300 行，small_file_600 包含第 301-600 行，依此类推，直到生成足够的小文件来包含大文件中的所有行。

如果有任何关于使用 Python 完成此任务的最简单方法的建议，我将不胜感激

【问题讨论】：

【参考方案1】：

使用itertools grouper 配方：

from itertools import zip_longest

def grouper(n, iterable, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx
    args = [iter(iterable)] * n
    return zip_longest(fillvalue=fillvalue, *args)

n = 300

with open('really_big_file.txt') as f:
    for i, g in enumerate(grouper(n, f, fillvalue=''), 1):
        with open('small_file_0'.format(i * n), 'w') as fout:
            fout.writelines(g)

与将每一行存储在列表中相比，此方法的优势在于它可以逐行处理迭代，因此不必一次将每个small_file 存储到内存中。

请注意，在这种情况下，最后一个文件将是 small_file_100200，但只会持续到 line 100000。发生这种情况是因为fillvalue=''，这意味着当我没有更多行可写时，我将 nothing 写入文件，因为组大小不均等。您可以通过写入临时文件然后重命名它来解决此问题，而不是像我一样先命名它。这是如何做到的。

import os, tempfile

with open('really_big_file.txt') as f:
    for i, g in enumerate(grouper(n, f, fillvalue=None)):
        with tempfile.NamedTemporaryFile('w', delete=False) as fout:
            for j, line in enumerate(g, 1): # count number of lines in group
                if line is None:
                    j -= 1 # don't count this line
                    break
                fout.write(line)
        os.rename(fout.name, 'small_file_0.txt'.format(i * n + j))

这次fillvalue=None 和我遍历每一行检查None，当它发生时，我知道过程已经完成，所以我从j 中减去1 不计算填充符，然后写文件。

【讨论】：

如果您使用的是python 3.x中的第一个脚本，请将izip_longest替换为新的zip_longestdocs.python.org/3/library/itertools.html#itertools.zip_longest @YuvalPruss 我根据您的评论更新了 Py3 是标准【参考方案2】：

lines_per_file = 300  # Lines on each small file
lines = []  # Stores lines not yet written on a small file
lines_counter = 0  # Same as len(lines)
created_files = 0  # Counting how many small files have been created

with open('really_big_file.txt') as big_file:
    for line in big_file:  # Go throught the whole big file
        lines.append(line)
        lines_counter += 1
        if lines_counter == lines_per_file:
            idx = lines_per_file * (created_files + 1)
            with open('small_file_%s.txt' % idx, 'w') as small_file:
                # Write all lines on small file
                small_file.write('\n'.join(stored_lines))
            lines = []  # Reset variables
            lines_counter = 0
            created_files += 1  # One more small file has been created
    # After for-loop has finished
    if lines_counter:  # There are still some lines not written on a file?
        idx = lines_per_file * (created_files + 1)
        with open('small_file_%s.txt' % idx, 'w') as small_file:
            # Write them on a last small file
            small_file.write('n'.join(stored_lines))
        created_files += 1

print '%s small files (with %s lines each) were created.' % (created_files,
                                                             lines_per_file)

【讨论】：

唯一的问题是，在使用这种方法编写之前，您必须将每个small_file 一次存储在内存中，但这可能是也可能不是问题。当然，您可以通过将其逐行写入文件来解决此问题。【参考方案3】：

我这样做的方式更易于理解，并且使用更少的捷径，以便让您进一步了解它的工作原理和原因。以前的答案有效，但如果您不熟悉某些内置函数，您将无法理解该函数在做什么。

因为您没有发布代码，所以我决定这样做，因为您可能不熟悉基本 Python 语法以外的其他内容，因为您提出问题的方式使您看起来好像没有尝试，也不知道如何解决问题

以下是在基本 python 中执行此操作的步骤：

首先您应该将您的文件读入一个列表以便妥善保管：

my_file = 'really_big_file.txt'
hold_lines = []
with open(my_file,'r') as text_file:
    for row in text_file:
        hold_lines.append(row)

其次，您需要设置一种按名称创建新文件的方法！我建议一个循环和几个计数器：

outer_count = 1
line_count = 0
sorting = True
while sorting:
    count = 0
    increment = (outer_count-1) * 300
    left = len(hold_lines) - increment
    file_name = "small_file_" + str(outer_count * 300) + ".txt"

第三，在该循环中，您需要一些嵌套循环，将正确的行保存到数组中：

hold_new_lines = []
    if left < 300:
        while count < left:
            hold_new_lines.append(hold_lines[line_count])
            count += 1
            line_count += 1
        sorting = False
    else:
        while count < 300:
            hold_new_lines.append(hold_lines[line_count])
            count += 1
            line_count += 1

最后一件事，再次在您的第一个循环中，您需要编写新文件并添加最后一个计数器增量，以便您的循环将再次执行并写入一个新文件

outer_count += 1
with open(file_name,'w') as next_file:
    for row in hold_new_lines:
        next_file.write(row)

注意：如果行数不能被 300 整除，则最后一个文件的名称将与最后一个文件行不对应。

了解这些循环为何起作用很重要。您已将其设置为在下一个循环中，您编写的文件的名称会发生变化，因为您的名称取决于不断变化的变量。这是一个非常有用的脚本工具，用于文件访问、打开、写入、组织等。

如果你无法理解循环中的内容，这里是整个函数：

my_file = 'really_big_file.txt'
sorting = True
hold_lines = []
with open(my_file,'r') as text_file:
    for row in text_file:
        hold_lines.append(row)
outer_count = 1
line_count = 0
while sorting:
    count = 0
    increment = (outer_count-1) * 300
    left = len(hold_lines) - increment
    file_name = "small_file_" + str(outer_count * 300) + ".txt"
    hold_new_lines = []
    if left < 300:
        while count < left:
            hold_new_lines.append(hold_lines[line_count])
            count += 1
            line_count += 1
        sorting = False
    else:
        while count < 300:
            hold_new_lines.append(hold_lines[line_count])
            count += 1
            line_count += 1
    outer_count += 1
    with open(file_name,'w') as next_file:
        for row in hold_new_lines:
            next_file.write(row)

【讨论】：

【参考方案4】：

lines_per_file = 300
smallfile = None
with open('really_big_file.txt') as bigfile:
    for lineno, line in enumerate(bigfile):
        if lineno % lines_per_file == 0:
            if smallfile:
                smallfile.close()
            small_filename = 'small_file_.txt'.format(lineno + lines_per_file)
            smallfile = open(small_filename, "w")
        smallfile.write(line)
    if smallfile:
        smallfile.close()

【讨论】：

【参考方案5】：

import csv
import os
import re

MAX_CHUNKS = 300


def writeRow(idr, row):
    with open("file_%d.csv" % idr, 'ab') as file:
        writer = csv.writer(file, delimiter=',', quotechar='\"', quoting=csv.QUOTE_ALL)
        writer.writerow(row)

def cleanup():
    for f in os.listdir("."):
        if re.search("file_.*", f):
            os.remove(os.path.join(".", f))

def main():
    cleanup()
    with open("large_file.csv", 'rb') as results:
        r = csv.reader(results, delimiter=',', quotechar='\"')
        idr = 1
        for i, x in enumerate(r):
            temp = i + 1
            if not (temp % (MAX_CHUNKS + 1)):
                idr += 1
            writeRow(idr, x)

if __name__ == "__main__": main()

【讨论】：

嘿，小问题，你介意解释一下为什么使用 quotechar='\"' 谢谢我正在使用它，因为在我的情况下我有一个不同的引号字符 ( | )。您可以跳过将其设置为默认引号字符为 (quotes ") 对于关心速度的人来说，一个包含 98500 条记录（大约 13MB 大小）的 CSV 文件在大约 2.31 秒内被此代码拆分。我会说这很好。【参考方案6】：

我必须对 650000 行文件做同样的事情。

使用枚举索引和整数 div it (//) 和块大小

当该数字更改时，关闭当前文件并打开一个新文件

这是一个使用格式字符串的python3解决方案。

chunk = 50000  # number of lines from the big file to put in small file
this_small_file = open('./a_folder/0', 'a')

with open('massive_web_log_file') as file_to_read:
    for i, line in enumerate(file_to_read.readlines()):
        file_name = f'./a_folder/i // chunk'
        print(i, file_name)  # a bit of feedback that slows the process down a

        if file_name == this_small_file.name:
            this_small_file.write(line)

        else:
            this_small_file.write(line)
            this_small_file.close()
            this_small_file = open(f'file_name', 'a')

【讨论】：

您可以通过评论print(i, file_name)获得显着的加速也可以将file_to_read.readlines() 改为file_to_read...【参考方案7】：

将 files 设置为要将主文件拆分为的文件数在我的例子中，我想从我的主文件中获取 10 个文件

files = 10
with open("data.txt","r") as data :
    emails = data.readlines()
    batchs = int(len(emails)/10)
    for id,log in enumerate(emails):
        fileid = id/batchs
        file=open("minifilefile.txt".format(file=int(fileid)+1),'a+')
        file.write(log)

【讨论】：

感谢@JoeVenner 我尝试了这种方法，但对于大文件来说它会变慢【参考方案8】：

如果您想将其拆分为 2 个文件，这是一种非常简单的方法，例如：

with open("myInputFile.txt",'r') as file:
    lines = file.readlines()

with open("OutputFile1.txt",'w') as file:
    for line in lines[:int(len(lines)/2)]:
        file.write(line)

with open("OutputFile2.txt",'w') as file:
    for line in lines[int(len(lines)/2):]:
        file.write(line)

使这种动态将是：

with open("inputFile.txt",'r') as file:
    lines = file.readlines()

Batch = 10
end = 0
for i in range(1,Batch + 1):
    if i == 1:
        start = 0
    increase = int(len(lines)/Batch)
    end = end + increase
    with open("splitText_" + str(i) + ".txt",'w') as file:
        for line in lines[start:end]:
            file.write(line)
    
    start = end

【讨论】：

【参考方案9】：

在 Python 中，文件是简单的迭代器。这提供了对它们进行多次迭代的选项，并且始终从上一个迭代器获得的最后一个位置继续。记住这一点，我们可以使用islice 在连续循环中每次获取文件的下 300 行。棘手的部分是知道何时停止。为此，我们将为next 行“采样”文件，一旦用尽，我们可以break 循环：

from itertools import islice

lines_per_file = 300
with open("really_big_file.txt") as file:
    i = 1
    while True:
        try:
            checker = next(file)
        except StopIteration:
            break
        with open(f"small_file_i*lines_per_file.txt", 'w') as out_file:
            out_file.write(checker)
            for line in islice(file, lines_per_file-1):
                out_file.write(line)
        i += 1

【讨论】：

【参考方案10】：

with open('/really_big_file.txt') as infile:
    file_line_limit = 300
    counter = -1
    file_index = 0
    outfile = None
    for line in infile.readlines():
        counter += 1
        if counter % file_line_limit == 0:
            # close old file
            if outfile is not None:
                outfile.close()
            # create new file
            file_index += 1
            outfile = open('small_file_%03d.txt' % file_index, 'w')
        # write to file
        outfile.write(line)

【讨论】：

您的答案可以通过额外的支持信息得到改进。请edit 添加更多详细信息，例如引用或文档，以便其他人可以确认您的答案是正确的。你可以找到更多关于如何写好答案的信息in the help center。

以上是关于使用 Python 按行号将大文本文件拆分为较小的文本文件的主要内容，如果未能解决你的问题，请参考以下文章