使用 gawk 解析 CSV 文件

Posted 2023-02-24

技术标签:

【中文标题】使用 gawk 解析 CSV 文件【英文标题】：Parsing a CSV file using gawk 【发布时间】：2010-09-23 18:49:26 【问题描述】：

如何？仅设置FS="," 是不够的，因为引号内带有逗号的字段将被视为多个字段。

使用 FS="," 的示例不起作用：

文件内容：

one,two,"three, four",five
"six, seven",eight,"nine"

gawk 脚本：

BEGIN  FS="," 

  for (i=1; i<=NF; i++) printf "field #%d: %s\n", i, $(i)
  printf "---------------------------\n"

输出错误：

field #1: one
field #2: two
field #3: "three
field #4:  four"
field #5: five
---------------------------
field #1: "six
field #2:  seven"
field #3: eight
field #4: "nine"
---------------------------

想要的输出：

field #1: one
field #2: two
field #3: "three, four"
field #4: five
---------------------------
field #1: "six, seven"
field #2: eight
field #3: "nine"
---------------------------

【问题讨论】：

另见：***.com/questions/45420535/… What's the most robust way to efficiently parse CSV using awk?的可能重复 【参考方案1】：

The gawk version 4 manual 说要使用FPAT = "([^,]*)|(\"[^\"]+\")"

当定义FPAT 时，它会禁用FS 并按内容而不是按分隔符指定字段。

【讨论】：

FPAT 的概念很有趣。引用的正则表达式虽然不允许在引用的字符串中使用双引号。这需要更复杂的正则表达式，例如：FPAT="([^,]*)|(\"([^\"]|\"\")+\"[^,]*)"。最后的[^,]* 允许以引号开头的格式错误的字段，例如"abc"def,；它将def 视为字段的一部分。在双引号内，接受两个连续的双引号。这东西很讨厌，这就是为什么 CSV 特定的模块通常是处理 CSV 数据的最佳方式，除非 CSV 数据干净简单。 FPAT 需要 gawk 4。花了我一些时间... ;) 我用 gawk 设置了一个别名，以便在 CSVS 上轻松运行 gawk：alias awkcsv="gawk -v FPAT='([^,]+)|(\"[^\"]+\")'"【参考方案2】：

简短的回答是“如果 CSV 包含尴尬的数据，我不会使用 gawk 来解析 CSV”，其中 'awkward' 表示 CSV 字段数据中的逗号之类的东西。

下一个问题是“您将要进行哪些其他处理”，因为这会影响您使用哪些替代方案。

我可能会使用 Perl 和 Text::CSV 或 Text::CSV_XS 模块来读取和处理数据。请记住，Perl 最初部分是作为awk 和sed 杀手编写的——因此a2p 和s2p 程序仍然随Perl 一起分发，它们将awk 和sed 脚本（分别）转换为Perl。

【讨论】：

另见csvfix 程序。当然，可以使用 Python（以及 Ruby、Tcl 和大多数其他可扩展的脚本语言）来代替 Perl；这成为个人品味或公司强制（霍布森）选择的问题。【参考方案3】：

您可以使用一个名为 csvquote 的简单包装函数来清理输入并在 awk 完成处理后将其恢复。在开始和结束时通过管道传输您的数据，一切都应该正常：

之前：

gawk -f mypgoram.awk input.csv

之后：

csvquote input.csv | gawk -f mypgoram.awk | csvquote -u

有关代码和文档，请参阅 https://github.com/dbro/csvquote。

【讨论】：

【参考方案4】：

如果允许，我会使用 Python csv 模块，特别注意 dialect used and formatting parameters required，来解析您拥有的 CSV 文件。

【讨论】：

【参考方案5】：

csv2delim.awk

# csv2delim.awk converts comma delimited files with optional quotes to delim separated file
#     delim can be any character, defaults to tab
# assumes no repl characters in text, any delim in line converts to repl
#     repl can be any character, defaults to ~
# changes two consecutive quotes within quotes to '

# usage: gawk -f csv2delim.awk [-v delim=d] [-v repl=`"] input-file > output-file
#       -v delim    delimiter, defaults to tab
#       -v repl     replacement char, defaults to ~

# e.g. gawk -v delim=; -v repl=` -f csv2delim.awk test.csv > test.txt

# abe 2-28-7
# abe 8-8-8 1.0 fixed empty fields, added replacement option
# abe 8-27-8 1.1 used split
# abe 8-27-8 1.2 inline rpl and "" = '
# abe 8-27-8 1.3 revert to 1.0 as it is much faster, split most of the time
# abe 8-29-8 1.4 better message if delim present

BEGIN 
    if (delim == "") delim = "\t"
    if (repl == "") repl = "~"
    print "csv2delim.awk v.m 1.4 run at " strftime() > "/dev/stderr" ###########################################



    #if ($0 ~ repl) 
    #   print "Replacement character " repl " is on line " FNR ":" lineIn ";" > "/dev/stderr"
    #
    if ($0 ~ delim) 
        print "Temp delimiter character " delim " is on line " FNR ":" lineIn ";" > "/dev/stderr"
        print "    replaced by " repl > "/dev/stderr"
    
    gsub(delim, repl)

    $0 = gensub(/([^,])\"\"/, "\\1'", "g")
#   $0 = gensub(/\"\"([^,])/, "'\\1", "g")  # not needed above covers all cases

    out = ""
    #for (i = 1;  i <= length($0);  i++)
    n = length($0)
    for (i = 1;  i <= n;  i++)
        if ((ch = substr($0, i, 1)) == "\"")
            inString = (inString) ? 0 : 1 # toggle inString
        else
            out = out ((ch == "," && ! inString) ? delim : ch)
    print out


END 
    print NR " records processed from " FILENAME " at " strftime() > "/dev/stderr"

test.csv

"first","second","third"
"fir,st","second","third"
"first","sec""ond","third"
" first ",sec   ond,"third"
"first" , "second","th  ird"
"first","sec;ond","third"
"first","second","th;ird"
1,2,3
,2,3
1,2,
,2,
1,,2
1,"2",3
"1",2,"3"
"1",,"3"
1,"",3
"","",""
"","""aiyn","oh"""
"""","""",""""
11,2~2,3

test.bat

rem test csv2delim
rem default is: -v delim=tab -v repl=~
gawk                      -f csv2delim.awk test.csv > test.txt
gawk -v delim=;           -f csv2delim.awk test.csv > testd.txt
gawk -v delim=; -v repl=` -f csv2delim.awk test.csv > testdr.txt
gawk            -v repl=` -f csv2delim.awk test.csv > testr.txt

【讨论】：

【参考方案6】：

我不确定这是否是正确的做事方式。我宁愿处理一个 csv 文件，其中所有值都被引用或没有。顺便说一句，awk 允许正则表达式成为字段分隔符。检查这是否有用。

【讨论】：

我也会采用正则表达式方法并尝试让它匹配像这样的东西 ^"|","|"$ （这是一个快速的镜头，你当然要逃避 ",我想保持简单）【参考方案7】：


  ColumnCount = 0
  $0 = $0 ","                           # Assures all fields end with comma
  while($0)                             # Get fields by pattern, not by delimiter
  
    match($0, / *"[^"]*" *,|[^,]*,/)    # Find a field with its delimiter suffix
    Field = substr($0, RSTART, RLENGTH) # Get the located field with its delimiter
    gsub(/^ *"?|"? *,$/, "", Field)     # Strip delimiter text: comma/space/quote
    Column[++ColumnCount] = Field       # Save field without delimiter in an array
    $0 = substr($0, RLENGTH + 1)        # Remove processed text from the raw data

遵循此模式的模式可以访问 Column[] 中的字段。 ColumnCount 指示 Column[] 中找到的元素数。如果不是所有行都包含相同数量的列，则 Column[] 在处理较短的行时会在 Column[ColumnCount] 之后包含额外的数据。

这个实现速度很慢，但它似乎模拟了之前答案中提到的 gawk >= 4.0.0 中的 FPAT/patsplit() 功能。

Reference

【讨论】：

【参考方案8】：

这就是我想出的。任何 cmets 和/或更好的解决方案将不胜感激。

BEGIN  FS="," 

  for (i=1; i<=NF; i++) 
    f[++n] = $i
    if (substr(f[n],1,1)=="\"") 
      while (substr(f[n], length(f[n]))!="\"" || substr(f[n], length(f[n])-1, 1)=="\\") 
        f[n] = sprintf("%s,%s", f[n], $(++i))
      
    
  
  for (i=1; i<=n; i++) printf "field #%d: %s\n", i, f[i]
  print "----------------------------------\n"

基本思想是循环遍历字段，任何以引号开头但不以引号结尾的字段都会获得附加到它的下一个字段。

【讨论】：

这看起来更像 C.. 我们是否为正确的工作使用了正确的工具？我是 awk 的新手，但想不出任何直接的解决方案.. @Vijay Dev，新手意味着初学者，而不是专家。啊我的英语！！我想说 - '我是新手，所以我想不出任何直接的解决方案' 仅供参考，这行得通，但您需要“n=0”作为最后一行，它才能在多行文件上正常运行。请注意，有效字段可能是："""Jump"", he said!"。这将在逗号处拆分，但前面的字符是双引号。脚本在逗号处拆分，即使它不应该因为逗号嵌入在带引号的字段中。逗号前面的奇数个双引号表示字段结束；偶数不会。【参考方案9】：

Perl 具有 Text::CSV_XS 模块，该模块专门用于处理带引号的逗号怪异。或者尝试 Text::CSV 模块。

perl -MText::CSV_XS -ne 'BEGIN$csv=Text::CSV_XS->new() if($csv->parse($_))@f=$csv->fields();for $n (0..$#f) print "field #$n: $f[$n]\n";print "---\n"' file.csv

产生这个输出：

field #0: one
field #1: two
field #2: three, four
field #3: five
---
field #0: six, seven
field #1: eight
field #2: nine
---

这是一个人类可读的版本。保存为parsecsv，chmod +x，运行为“parsecsv file.csv”

#!/usr/bin/perl
use warnings;
use strict;
use Text::CSV_XS;
my $csv = Text::CSV_XS->new();
open(my $data, '<', $ARGV[0]) or die "Could not open '$ARGV[0]' $!\n";
while (my $line = <$data>) 
    if ($csv->parse($line)) 
        my @f = $csv->fields();
        for my $n (0..$#f) 
            print "field #$n: $f[$n]\n";
        
        print "---\n";

您可能需要在您的机器上指向不同版本的 perl，因为您的默认 perl 版本上可能未安装 Text::CSV_XS 模块。

Can't locate Text/CSV_XS.pm in @INC (@INC contains: /home/gnu/lib/perl5/5.6.1/i686-linux /home/gnu/lib/perl5/5.6.1 /home/gnu/lib/perl5/site_perl/5.6.1/i686-linux /home/gnu/lib/perl5/site_perl/5.6.1 /home/gnu/lib/perl5/site_perl .).
BEGIN failed--compilation aborted.

如果您的 Perl 版本都没有安装 Text::CSV_XS，您需要：sudo apt-get install cpanminussudo cpanm Text::CSV_XS

【讨论】：

以上是关于使用 gawk 解析 CSV 文件的主要内容，如果未能解决你的问题，请参考以下文章

“CSV格式转Json格式”Shell脚本解析

使用 atoi 解析 csv 文件

使用 XSLT 在文本文件 (CSV) 中解析 XML 文件

使用 NodeJS 解析 CSV 文件

使用 Scala 解析器组合器解析 CSV 文件

如何使用熊猫解析 CSV 文件？