使用 Python 按行号将大文本文件拆分为较小的文本文件
Posted
技术标签:
【中文标题】使用 Python 按行号将大文本文件拆分为较小的文本文件【英文标题】:Splitting large text file into smaller text files by line numbers using Python 【发布时间】:2013-04-23 18:44:49 【问题描述】:我有一个文本文件,说真的是_big_file.txt,其中包含:
line 1
line 2
line 3
line 4
...
line 99999
line 100000
我想编写一个 Python 脚本,将 real_big_file.txt 分成较小的文件,每个文件 300 行。例如,small_file_300.txt 包含第 1-300 行,small_file_600 包含第 301-600 行,依此类推,直到生成足够的小文件来包含大文件中的所有行。
如果有任何关于使用 Python 完成此任务的最简单方法的建议,我将不胜感激
【问题讨论】:
【参考方案1】:使用itertools
grouper 配方:
from itertools import zip_longest
def grouper(n, iterable, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx
args = [iter(iterable)] * n
return zip_longest(fillvalue=fillvalue, *args)
n = 300
with open('really_big_file.txt') as f:
for i, g in enumerate(grouper(n, f, fillvalue=''), 1):
with open('small_file_0'.format(i * n), 'w') as fout:
fout.writelines(g)
与将每一行存储在列表中相比,此方法的优势在于它可以逐行处理迭代,因此不必一次将每个small_file
存储到内存中。
请注意,在这种情况下,最后一个文件将是 small_file_100200
,但只会持续到 line 100000
。发生这种情况是因为fillvalue=''
,这意味着当我没有更多行可写时,我将 nothing 写入文件,因为组大小不均等。您可以通过写入临时文件然后重命名它来解决此问题,而不是像我一样先命名它。这是如何做到的。
import os, tempfile
with open('really_big_file.txt') as f:
for i, g in enumerate(grouper(n, f, fillvalue=None)):
with tempfile.NamedTemporaryFile('w', delete=False) as fout:
for j, line in enumerate(g, 1): # count number of lines in group
if line is None:
j -= 1 # don't count this line
break
fout.write(line)
os.rename(fout.name, 'small_file_0.txt'.format(i * n + j))
这次fillvalue=None
和我遍历每一行检查None
,当它发生时,我知道过程已经完成,所以我从j
中减去1
不计算填充符,然后写文件。
【讨论】:
如果您使用的是python 3.x中的第一个脚本,请将izip_longest
替换为新的zip_longest
docs.python.org/3/library/itertools.html#itertools.zip_longest
@YuvalPruss 我根据您的评论更新了 Py3 是标准【参考方案2】:
lines_per_file = 300 # Lines on each small file
lines = [] # Stores lines not yet written on a small file
lines_counter = 0 # Same as len(lines)
created_files = 0 # Counting how many small files have been created
with open('really_big_file.txt') as big_file:
for line in big_file: # Go throught the whole big file
lines.append(line)
lines_counter += 1
if lines_counter == lines_per_file:
idx = lines_per_file * (created_files + 1)
with open('small_file_%s.txt' % idx, 'w') as small_file:
# Write all lines on small file
small_file.write('\n'.join(stored_lines))
lines = [] # Reset variables
lines_counter = 0
created_files += 1 # One more small file has been created
# After for-loop has finished
if lines_counter: # There are still some lines not written on a file?
idx = lines_per_file * (created_files + 1)
with open('small_file_%s.txt' % idx, 'w') as small_file:
# Write them on a last small file
small_file.write('n'.join(stored_lines))
created_files += 1
print '%s small files (with %s lines each) were created.' % (created_files,
lines_per_file)
【讨论】:
唯一的问题是,在使用这种方法编写之前,您必须将每个small_file
一次存储在内存中,但这可能是也可能不是问题。当然,您可以通过将其逐行写入文件来解决此问题。【参考方案3】:
我这样做的方式更易于理解,并且使用更少的捷径,以便让您进一步了解它的工作原理和原因。以前的答案有效,但如果您不熟悉某些内置函数,您将无法理解该函数在做什么。
因为您没有发布代码,所以我决定这样做,因为您可能不熟悉基本 Python 语法以外的其他内容,因为您提出问题的方式使您看起来好像没有尝试,也不知道如何解决问题
以下是在基本 python 中执行此操作的步骤:
首先您应该将您的文件读入一个列表以便妥善保管:
my_file = 'really_big_file.txt'
hold_lines = []
with open(my_file,'r') as text_file:
for row in text_file:
hold_lines.append(row)
其次,您需要设置一种按名称创建新文件的方法!我建议一个循环和几个计数器:
outer_count = 1
line_count = 0
sorting = True
while sorting:
count = 0
increment = (outer_count-1) * 300
left = len(hold_lines) - increment
file_name = "small_file_" + str(outer_count * 300) + ".txt"
第三,在该循环中,您需要一些嵌套循环,将正确的行保存到数组中:
hold_new_lines = []
if left < 300:
while count < left:
hold_new_lines.append(hold_lines[line_count])
count += 1
line_count += 1
sorting = False
else:
while count < 300:
hold_new_lines.append(hold_lines[line_count])
count += 1
line_count += 1
最后一件事,再次在您的第一个循环中,您需要编写新文件并添加最后一个计数器增量,以便您的循环将再次执行并写入一个新文件
outer_count += 1
with open(file_name,'w') as next_file:
for row in hold_new_lines:
next_file.write(row)
注意:如果行数不能被 300 整除,则最后一个文件的名称将与最后一个文件行不对应。
了解这些循环为何起作用很重要。您已将其设置为在下一个循环中,您编写的文件的名称会发生变化,因为您的名称取决于不断变化的变量。这是一个非常有用的脚本工具,用于文件访问、打开、写入、组织等。
如果你无法理解循环中的内容,这里是整个函数:
my_file = 'really_big_file.txt'
sorting = True
hold_lines = []
with open(my_file,'r') as text_file:
for row in text_file:
hold_lines.append(row)
outer_count = 1
line_count = 0
while sorting:
count = 0
increment = (outer_count-1) * 300
left = len(hold_lines) - increment
file_name = "small_file_" + str(outer_count * 300) + ".txt"
hold_new_lines = []
if left < 300:
while count < left:
hold_new_lines.append(hold_lines[line_count])
count += 1
line_count += 1
sorting = False
else:
while count < 300:
hold_new_lines.append(hold_lines[line_count])
count += 1
line_count += 1
outer_count += 1
with open(file_name,'w') as next_file:
for row in hold_new_lines:
next_file.write(row)
【讨论】:
【参考方案4】:lines_per_file = 300
smallfile = None
with open('really_big_file.txt') as bigfile:
for lineno, line in enumerate(bigfile):
if lineno % lines_per_file == 0:
if smallfile:
smallfile.close()
small_filename = 'small_file_.txt'.format(lineno + lines_per_file)
smallfile = open(small_filename, "w")
smallfile.write(line)
if smallfile:
smallfile.close()
【讨论】:
【参考方案5】:import csv
import os
import re
MAX_CHUNKS = 300
def writeRow(idr, row):
with open("file_%d.csv" % idr, 'ab') as file:
writer = csv.writer(file, delimiter=',', quotechar='\"', quoting=csv.QUOTE_ALL)
writer.writerow(row)
def cleanup():
for f in os.listdir("."):
if re.search("file_.*", f):
os.remove(os.path.join(".", f))
def main():
cleanup()
with open("large_file.csv", 'rb') as results:
r = csv.reader(results, delimiter=',', quotechar='\"')
idr = 1
for i, x in enumerate(r):
temp = i + 1
if not (temp % (MAX_CHUNKS + 1)):
idr += 1
writeRow(idr, x)
if __name__ == "__main__": main()
【讨论】:
嘿,小问题,你介意解释一下为什么使用 quotechar='\"' 谢谢 我正在使用它,因为在我的情况下我有一个不同的引号字符 ( | )。您可以跳过将其设置为默认引号字符为 (quotes ") 对于关心速度的人来说,一个包含 98500 条记录(大约 13MB 大小)的 CSV 文件在大约 2.31 秒内被此代码拆分。我会说这很好。【参考方案6】:我必须对 650000 行文件做同样的事情。
使用枚举索引和整数 div it (//) 和块大小
当该数字更改时,关闭当前文件并打开一个新文件
这是一个使用格式字符串的python3解决方案。
chunk = 50000 # number of lines from the big file to put in small file
this_small_file = open('./a_folder/0', 'a')
with open('massive_web_log_file') as file_to_read:
for i, line in enumerate(file_to_read.readlines()):
file_name = f'./a_folder/i // chunk'
print(i, file_name) # a bit of feedback that slows the process down a
if file_name == this_small_file.name:
this_small_file.write(line)
else:
this_small_file.write(line)
this_small_file.close()
this_small_file = open(f'file_name', 'a')
【讨论】:
您可以通过评论print(i, file_name)
获得显着的加速
也可以将file_to_read.readlines()
改为file_to_read
...【参考方案7】:
将 files 设置为要将主文件拆分为的文件数 在我的例子中,我想从我的主文件中获取 10 个文件
files = 10
with open("data.txt","r") as data :
emails = data.readlines()
batchs = int(len(emails)/10)
for id,log in enumerate(emails):
fileid = id/batchs
file=open("minifilefile.txt".format(file=int(fileid)+1),'a+')
file.write(log)
【讨论】:
感谢@JoeVenner 我尝试了这种方法,但对于大文件来说它会变慢【参考方案8】:如果您想将其拆分为 2 个文件,这是一种非常简单的方法,例如:
with open("myInputFile.txt",'r') as file:
lines = file.readlines()
with open("OutputFile1.txt",'w') as file:
for line in lines[:int(len(lines)/2)]:
file.write(line)
with open("OutputFile2.txt",'w') as file:
for line in lines[int(len(lines)/2):]:
file.write(line)
使这种动态将是:
with open("inputFile.txt",'r') as file:
lines = file.readlines()
Batch = 10
end = 0
for i in range(1,Batch + 1):
if i == 1:
start = 0
increase = int(len(lines)/Batch)
end = end + increase
with open("splitText_" + str(i) + ".txt",'w') as file:
for line in lines[start:end]:
file.write(line)
start = end
【讨论】:
【参考方案9】:在 Python 中,文件是简单的迭代器。这提供了对它们进行多次迭代的选项,并且始终从上一个迭代器获得的最后一个位置继续。记住这一点,我们可以使用islice
在连续循环中每次获取文件的下 300 行。棘手的部分是知道何时停止。为此,我们将为next
行“采样”文件,一旦用尽,我们可以break
循环:
from itertools import islice
lines_per_file = 300
with open("really_big_file.txt") as file:
i = 1
while True:
try:
checker = next(file)
except StopIteration:
break
with open(f"small_file_i*lines_per_file.txt", 'w') as out_file:
out_file.write(checker)
for line in islice(file, lines_per_file-1):
out_file.write(line)
i += 1
【讨论】:
【参考方案10】:with open('/really_big_file.txt') as infile:
file_line_limit = 300
counter = -1
file_index = 0
outfile = None
for line in infile.readlines():
counter += 1
if counter % file_line_limit == 0:
# close old file
if outfile is not None:
outfile.close()
# create new file
file_index += 1
outfile = open('small_file_%03d.txt' % file_index, 'w')
# write to file
outfile.write(line)
【讨论】:
您的答案可以通过额外的支持信息得到改进。请edit 添加更多详细信息,例如引用或文档,以便其他人可以确认您的答案是正确的。你可以找到更多关于如何写好答案的信息in the help center。以上是关于使用 Python 按行号将大文本文件拆分为较小的文本文件的主要内容,如果未能解决你的问题,请参考以下文章