根据分隔符将一个文件拆分为多个文件

Posted 2023-02-24

技术标签:

【中文标题】根据分隔符将一个文件拆分为多个文件【英文标题】：Split one file into multiple files based on delimiter 【发布时间】：2012-07-04 01:24:13 【问题描述】：

我有一个文件，每个部分后面都有-| 作为分隔符...需要使用 unix 为每个部分创建单独的文件。

输入文件示例

wertretr
ewretrtret
1212132323
000232
-|
ereteertetet
232434234
erewesdfsfsfs
0234342343
-|
jdhg3875jdfsgfd
sjdhfdbfjds
347674657435
-|

文件 1 中的预期结果

wertretr
ewretrtret
1212132323
000232
-|

文件 2 中的预期结果

ereteertetet
232434234
erewesdfsfsfs
0234342343
-|

文件 3 中的预期结果

jdhg3875jdfsgfd
sjdhfdbfjds
347674657435
-|

【问题讨论】：

您是在编写程序还是想使用命令行实用程序来执行此操作？最好使用命令行实用程序.. 你可以使用 awk，编写一个 3 或 4 行的程序很容易做到这一点。不幸的是我没有练习。 【参考方案1】：

一个班轮，没有编程。（除了正则表达式等）

csplit --digits=2  --quiet --prefix=outfile infile "/-|/+1" "*"

测试： csplit (GNU coreutils) 8.30

Apple Mac 使用注意事项

“对于 OS X 用户，请注意操作系统附带的 csplit 版本不起作用。您需要 coreutils 中的版本（可通过 Homebrew 安装），称为 gcsplit。” — @Danial

“只是补充一点，您可以获得适用于 OS X 的版本（至少在 High Sierra 上）。您只需稍微调整参数 csplit -k -f=outfile infile "/-\|/+1" "3"。似乎不起作用的功能是 @ 987654326@，分隔符的个数我要具体说明，需要加上-k，以免找不到最终的分隔符就删除所有的outfiles。另外，如果你想要--digits，你需要使用@改为 987654329@。” — @Pebbl

【讨论】：

+1 - 更短：csplit -n2 -s -b outfile infile "/-|/+1" "*" @zb226 写的太长了，不用解释了。建议加--elide-empty-files，不然最后会出现空文件。仅适用于那些想知道参数含义的人：--digits=2 控制用于对输出文件进行编号的位数（我默认为 2，因此没有必要）。 --quiet 抑制输出（这里也不是真正需要或要求的）。 --prefix 指定输出文件的前缀（默认为 xx）。所以你可以跳过所有的参数，得到像xx12这样的输出文件。我已经更新了问题以包括关于苹果 mac 的未读 cmets。【参考方案2】：

awk 'f="file" NR; print $0 " -|"> f' RS='-\\|'  input-file

解释（已编辑）：

RS 是记录分隔符，此解决方案使用 gnu awk 扩展，允许它包含多个字符。 NR 是记录号。

print 语句将一条后跟" -|" 的记录打印到名称中包含记录号的文件中。

【讨论】：

RS 是记录分隔符，此解决方案使用 gnu awk 扩展名，允许它包含多个字符。 NR 是记录号。 print 语句打印一条后跟“-|”的记录放入名称中包含记录号的文件中。 @rzetterbeg 这应该适用于大文件。 awk 一次处理文件一条记录，因此它只读取所需的内容。如果第一次出现的记录分隔符出现在文件中很晚，则可能是内存紧缩，因为一条完整的记录必须适合内存。另外，请注意，在 RS 中使用多个字符不是标准 awk，但这在 gnu awk 中可以使用。对我来说，它在 31.728 秒内拆分了 3.3 GB @ccf 文件名就是>右边的字符串，所以你可以随心所欲地构造它。例如，print $0 "-|" > "file" NR ".txt" @AGrush 这取决于版本。你可以做awk 'f="file" NR; print $0 " -|" > f'【参考方案3】：

Debian 有csplit，但我不知道这对所有/大多数/其他发行版是否通用。如果不是，那么追踪源代码并编译它应该不会太难......

【讨论】：

我同意。我的 Debian 盒子说 csplit 是 gnu coreutils 的一部分。因此任何 Gnu 操作系统，例如所有 Gnu/Linux 发行版都将拥有它。***还在 csplit 页面上提到了“The Single UNIX® Specification, Issue 7”，所以我怀疑你明白了。由于csplit 在 POSIX 中，我希望它基本上可以在所有类 Unix 系统上使用。虽然 csplit 是 POISX，但问题（似乎在我面前的 Ubuntu 系统上对其进行了测试）是没有明显的方法让它使用更现代的正则表达式语法。比较：csplit --prefix gold-data - "/^==*$/ 与 csplit --prefix gold-data - "/^=+$/。至少 GNU grep 有-e。【参考方案4】：

我解决了一个稍微不同的问题，文件包含一行名称，后面的文本应该放在哪里。这个 perl 代码对我有用：

#!/path/to/perl -w

#comment the line below for UNIX systems
use Win32::Clipboard;

# Get command line flags

#print ($#ARGV, "\n");
if($#ARGV == 0) 
    print STDERR "usage: ncsplit.pl --mff -- filename.txt [...] \n\nNote that no space is allowed between the '--' and the related parameter.\n\nThe mff is found on a line followed by a filename.  All of the contents of filename.txt are written to that file until another mff is found.\n";
    exit;


# this package sets the ARGV count variable to -1;

use Getopt::Long;
my $mff = "";
GetOptions('mff' => \$mff);

# set a default $mff variable
if ($mff eq "") $mff = "-#-";
print ("using file switch=", $mff, "\n\n");

while($_ = shift @ARGV) 
    if(-f "$_") 
    push @filelist, $_;
     


# Could be more than one file name on the command line, 
# but this version throws away the subsequent ones.

$readfile = $filelist[0];

open SOURCEFILE, "<$readfile" or die "File not found...\n\n";
#print SOURCEFILE;

while (<SOURCEFILE>) 
  /^$mff (.*$)/o;
    $outname = $1;
#   print $outname;
#   print "right is: $1 \n";

if (/^$mff /) 

    open OUTFILE, ">$outname" ;
    print "opened $outname\n";
    
    else print OUTFILE "$_";

【讨论】：

你能解释一下为什么这段代码有效吗？我的情况与您在此处描述的情况类似 - 所需的输出文件名嵌入在文件中。但我不是普通的 perl 用户，所以不能完全理解这段代码。真正的牛肉在最后的while 循环中。如果它在行首找到mff 正则表达式，它将使用该行的其余部分作为文件名来打开并开始写入。它从不关闭任何东西，所以它会在几十个之后用完文件句柄。脚本实际上会得到改进，方法是在最后的while循环之前删除大部分代码并切换到while (<>)【参考方案5】：

以下命令对我有用。希望对您有所帮助。

awk 'BEGINfile = 0; filename = "output_" file ".txt"
    /-|/ getline; file ++; filename = "output_" file ".txt"
    print $0 > filename' input

【讨论】：

这通常会在几十个文件之后用完文件句柄。解决方法是在您启动新文件时明确地close 旧文件。 @tripleee 你如何关闭它（初学者 awk 问题）。你能提供一个更新的例子吗？ @JesperRønn-Jensen 这个框对于任何有用的示例来说可能都太小了，但在分配新的filename 值之前基本上是if (file) close(filename);。 aah 发现了如何关闭它：; close(filename)。真的很简单，但它确实修复了上面的示例 @JesperRønn-Jensen 我回滚了您的编辑，因为您提供了一个损坏的脚本。应该避免对其他人的答案进行重大修改 - 如果您认为值得单独回答，请随时发布您自己的新答案（可能是 community wiki）。【参考方案6】：

您也可以使用 awk。我对 awk 不是很熟悉，但以下内容似乎对我有用。它生成了 part1.txt、part2.txt、part3.txt 和 part4.txt。请注意，生成的最后一个 partn.txt 文件是空的。我不确定如何解决这个问题，但我确信可以通过一些调整来完成。有什么建议吗？

awk_pattern 文件：

BEGIN fn = "part1.txt"; n = 1 

   print > fn
   if (substr($0,1,2) == "-|") 
       close (fn)
       n++
       fn = "part" n ".txt"

bash 命令：

awk -f awk_pattern input.file

【讨论】：

【参考方案7】：

这是一个 Python 3 脚本，它根据分隔符提供的文件名将一个文件拆分为多个文件。示例输入文件：

# Ignored

######## FILTER BEGIN foo.conf
This goes in foo.conf.
######## FILTER END

# Ignored

######## FILTER BEGIN bar.conf
This goes in bar.conf.
######## FILTER END

这是脚本：

#!/usr/bin/env python3

import os
import argparse

# global settings
start_delimiter = '######## FILTER BEGIN'
end_delimiter = '######## FILTER END'

# parse command line arguments
parser = argparse.ArgumentParser()
parser.add_argument("-i", "--input-file", required=True, help="input filename")
parser.add_argument("-o", "--output-dir", required=True, help="output directory")

args = parser.parse_args()

# read the input file
with open(args.input_file, 'r') as input_file:
    input_data = input_file.read()

# iterate through the input data by line
input_lines = input_data.splitlines()
while input_lines:
    # discard lines until the next start delimiter
    while input_lines and not input_lines[0].startswith(start_delimiter):
        input_lines.pop(0)

    # corner case: no delimiter found and no more lines left
    if not input_lines:
        break

    # extract the output filename from the start delimiter
    output_filename = input_lines.pop(0).replace(start_delimiter, "").strip()
    output_path = os.path.join(args.output_dir, output_filename)

    # open the output file
    print("extracting file: 0".format(output_path))
    with open(output_path, 'w') as output_file:
        # while we have lines left and they don't match the end delimiter
        while input_lines and not input_lines[0].startswith(end_delimiter):
            output_file.write("0\n".format(input_lines.pop(0)))

        # remove end delimiter if present
        if not input_lines:
            input_lines.pop(0)

最后是你如何运行它：

$ python3 script.py -i input-file.txt -o ./output-folder/

【讨论】：

【参考方案8】：

如果有，请使用csplit。

如果你没有，但你有 Python...不要使用 Perl。

懒读文件

您的文件可能太大而无法一次全部保存在内存中 - 逐行阅读可能更可取。假设输入文件名为“samplein”：

$ python3 -c "from itertools import count
with open('samplein') as file:
    for i in count():
        firstline = next(file, None)
        if firstline is None:
            break
        with open(f'outi', 'w') as out:
            out.write(firstline)
            for line in file:
                out.write(line)
                if line == '-|\n':
                    break"

【讨论】：

这会将整个文件读入内存，这意味着它对于大文件效率低下甚至失败。 @tripleee 我已经更新了处理非常大文件的答案。【参考方案9】：

cat file| ( I=0; echo -n "">file0; while read line; do echo $line >> file$I; if [ "$line" == '-|' ]; then I=$[I+1]; echo -n "" > file$I; fi; done )

和格式化版本：

#!/bin/bash
cat FILE | (
  I=0;
  echo -n"">file0;
  while read line; 
  do
    echo $line >> file$I;
    if [ "$line" == '-|' ];
    then I=$[I+1];
      echo -n "" > file$I;
    fi;
  done;
)

【讨论】：

一如既往，the cat is Useless。 @Reishin 链接页面更详细地解释了如何在任何情况下避免cat 在单个文件上。有更多讨论的 Stack Overflow 问题（尽管接受的答案是恕我直言）； ***.com/questions/11710552/useless-use-of-cat 无论如何，shell 在这类事情上通常效率很低；如果你不能使用csplit，Awk 解决方案可能比这个解决方案更可取（即使你要修复shellcheck.net 等报告的问题；请注意，它目前没有找到所有的错误） . @tripleee 但如果任务是在没有 awk、csplit 等的情况下完成它 - 仅 bash？那么cat还是没用，剩下的脚本可以简化改正很多；但它仍然会很慢。参见例如***.com/questions/13762625/…【参考方案10】：

这是我为上下文拆分编写的那种问题： http://stromberg.dnsalias.org/~strombrg/context-split.html

$ ./context-split -h
usage:
./context-split [-s separator] [-n name] [-z length]
        -s specifies what regex should separate output files
        -n specifies how output files are named (default: numeric
        -z specifies how long numbered filenames (if any) should be
        -i include line containing separator in output files
        operations are always performed on stdin

【讨论】：

呃，这看起来基本上是标准 csplit 实用程序的副本。见@richard's answer。这实际上是 imo 的最佳解决方案。我不得不拆分一个 98G 的 mysql 转储，并且由于某种原因 csplit 耗尽了我所有的 RAM，并且被杀死了。即使它当时只需要匹配一行。没有意义。这个 python 脚本工作得更好，不会吃掉所有的内存。【参考方案11】：

这里有一个 perl 代码可以做这件事

#!/usr/bin/perl
open(FI,"file.txt") or die "Input file not found";
$cur=0;
open(FO,">res.$cur.txt") or die "Cannot open output file $cur";
while(<FI>)

    print FO $_;
    if(/^-\|/)
    
        close(FO);
        $cur++;
        open(FO,">res.$cur.txt") or die "Cannot open output file $cur"
    

close(FO);

【讨论】：

以上是关于根据分隔符将一个文件拆分为多个文件的主要内容，如果未能解决你的问题，请参考以下文章