使用 awk 忽略 CSV 文件字段中的逗号

Posted

技术标签:

【中文标题】使用 awk 忽略 CSV 文件字段中的逗号【英文标题】:Ignoring comma in field of CSV file with awk 【发布时间】:2017-09-14 14:47:13 【问题描述】:

我正在尝试从 CSV 文件的最后一行的第二个字段中获取一个数字。到目前为止,我有这个:

awk -F"," 'END print $2' /file/path/fileName.csv

这有效,除非最后一行的第一个字段中有逗号。所以对于看起来像这样的一行,

"Company Name, LLC", 12345, Type1, SubType3

..."Company Name, LLC" 实际上是第一个字段,awk 命令将返回 LLC

如何忽略第一个字段中的逗号,以便在第二个字段中获取信息?

【问题讨论】:

如果最后三个字段不能包含,,您可以使用$(NF-2),假设有4个字段 【参考方案1】:

我认为您的要求是在GNU Awk 中使用FPAT 的完美用例,

man page引用原样

通常,当使用FS 时,gawk 将字段定义为出现在每个字段分隔符之间的记录部分。换句话说,FS 定义了字段不是什么,而不是字段是什么。但是,有时您确实想根据字段的内容而不是字段的内容来定义字段。

最臭名昭著的这种情况是所谓的逗号分隔值 (CSV) 数据。如果逗号只分隔数据,就不会有问题。当其中一个字段包含嵌入式逗号时,就会出现问题。在这种情况下,大多数程序会将该字段嵌入双引号中。

对于此处显示的 CSV 数据,每个字段要么是“非逗号的任何内容”,要么是“双引号、非双引号的任何内容以及结束双引号”。如果写成正则表达式常量(参见 Regexp),我们将有 /([^,]+)|("[^"]+")/。将其写成字符串需要我们转义双引号,从而导致:

FPAT = "([^,]+)|(\"[^\"]+\")"

在您的输入文件上使用它,

awk 'BEGINFPAT = "([^,]+)|(\"[^\"]+\")"print $1' file
"Company Name, LLC"

【讨论】:

这是一个很好的答案,但您需要将加号 (+) 替换为星号 (*),否则如果 csv 包含没有任何内容的逗号或没有任何内容的双引号,它将跳过列 ( ,,,, )。【参考方案2】:

这个问题没有一般性的答案,因为正则表达式不够强大(在一般情况下)无法解析 csv。我的solution 是一个 C 程序,它使用有限状态机对输入进行预处理,其输出可以输入到 Awk:

/* NAME
 *
 *     csv -- convert comma-separated values file to character-delimited
 *
 *
 * SYNOPSIS
 *
 *     csv [-Cc] [-Fc] [filename ...]
 *
 *
 * DESCRIPTION
 *
 *     Csv reads from standard input or from one or more files named on
 *     the command line a sequence of records in comma-separated values
 *     format and writes on standard output the same records in character-
 *     delimited format.  Csv returns 0 on success, 1 for option errors,
 *     and 2 if any file couldn't be opened.
 *
 *     The comma-separated values format has developed over time as a
 *     set of conventions that has never been formally defined, and some
 *     implementations are in conflict about some of the details.  In
 *     general, the comma-separated values format is used by databases,
 *     spreadsheets, and other programs that need to write data consisting
 *     of records containing fields.  The data is written as ascii text,
 *     with records terminated by newlines and fields containing zero or
 *     more characters separated by commas.  Leading and trailing space in
 *     unquoted fields is preserved.  Fields may be surrounded by double-
 *     quote characters (ascii \042); such fields may contain newlines,
 *     literal commas (ascii \054), and double-quote characters
 *     represented as two successive double-quotes.  The examples shown
 *     below clarify many irregular situations that may arise.
 *
 *     The field separator is normally a comma, but can be changed to an
 *     arbitrary character c with the command line option -Cc.  This is
 *     useful in those european countries that use a comma instead of a
 *     decimal point, where the field separator is normally changed to a
 *     semicolon.
 *
 *     Character-delimited format has records terminated by newlines and
 *     fields separated by a single character, which is \034 by default
 *     but may be changed with the -Fc option on the command line.
 *
 *
 * EXAMPLE
 *
 *     Each record below has five fields.  For readability, the three-
 *     character sequence TAB represents a single tab character (ascii
 *     \011).
 *
 *         $ cat testdata.csv
 *         1,abc,def ghi,jkl,unquoted character strings
 *         2,"abc","def ghi","jkl",quoted character strings
 *         3,123,456,789,numbers
 *         4,   abc,def   ,   ghi   ,strings with whitespace
 *         5,   "abc","def"   ,   "ghi"   ,quoted strings with whitespace
 *         6,   123,456   ,   789   ,numbers with whitespace
 *         7,TAB123,456TAB,TAB789TAB,numbers with tabs for whitespace
 *         8,   -123,   +456,   1E3,more numbers with whitespace
 *         9,123 456,123"456,  123 456   ,strange numbers
 *         10,abc",de"f,g"hi,embedded quotes
 *         11,"abc""","de""f","g""hi",quoted embedded quotes
 *         12,"","" "",""x"",doubled quotes
 *         13,"abc"def,abc"def","abc" "def",strange quotes
 *         14,,"",   ,empty fields
 *         15,abc,"def
 *         ghi",jkl,embedded newline
 *         16,abc,"def",789,multiple types of fields
 *
 *         $ csv -F'|' testdata.csv
 *         1|abc|def ghi|jkl|unquoted character strings
 *         2|abc|def ghi|jkl|quoted character strings
 *         3|123|456|789|numbers
 *         4|   abc|def   |   ghi   |strings with whitespace
 *         5|   "abc"|def   |   "ghi"   |quoted strings with whitespace
 *         6|   123|456   |   789   |numbers with whitespace
 *         7|TAB123|456TAB|TAB789TAB|numbers with tabs for whitespace
 *         8|   -123|   +456|   1E3|more numbers with whitespace
 *         9|123 456|123"456|  123 456   |strange numbers
 *         10|abc"|de"f|g"hi|embedded quotes
 *         11|abc"|de"f|g"hi|quoted embedded quotes
 *         12|| ""|x""|doubled quotes
 *         13|abcdef|abc"def"|abc "def"|strange quotes
 *         14|||   |empty fields
 *         15|abc|def
 *         ghi|jkl|embedded newline
 *         16|abc|def|789|multiple types of fields
 *
 *     It is particularly easy to pipe the output from csv into any of
 *     the unix tools that accept character-delimited fielded text data
 *     files, such as sort, join, or cut.  For example:
 *
 *         csv datafile.csv | awk -F'\034' -f program.awk
 *
 *
 * BUGS
 *
 *     On DOS, Windows, and OS/2 systems, processing of each file stops
 *     at the first appearance of the ascii \032 (control-Z) end of file
 *     character.
 *
 *     Because newlines embedded in quoted fields are treated literally,
 *     a missing closing quote can suck up all remaining input.
 *
 *
 * LICENSE
 *
 *     This program was written by Philip L. Bewig of Saint Louis,
 *     Missouri, United States of America on February 28, 2002 and
 *     placed in the public domain.
 */

#include <stdio.h>

/* dofile -- convert one file from comma-separated to delimited */
void dofile(char ofs, char fs, FILE *f) 
    int c; /* current input character */

    START:
        c = fgetc(f);
        if (c == EOF)                       return; 
        if (c == '\r')                      goto CARRIAGE_RETURN; 
        if (c == '\n')                      goto LINE_FEED; 
        if (c == '\"')                      goto QUOTED_FIELD; 
        if (c == fs)    putchar(ofs);       goto NOT_FIELD; 
        /* default */   putchar(c);         goto UNQUOTED_FIELD; 

    NOT_FIELD:
        c = fgetc(f);
        if (c == EOF)   putchar('\n');      return; 
        if (c == '\r')                      goto CARRIAGE_RETURN; 
        if (c == '\n')                      goto LINE_FEED; 
        if (c == '\"')                      goto QUOTED_FIELD; 
        if (c == fs)    putchar(ofs);       goto NOT_FIELD; 
        /* default */   putchar(c);         goto UNQUOTED_FIELD; 

    QUOTED_FIELD:
        c = fgetc(f);
        if (c == EOF)   putchar('\n');      return; 
        if (c == '\"')                      goto MAY_BE_DOUBLED_QUOTES; 
        /* default */   putchar(c);         goto QUOTED_FIELD; 

    MAY_BE_DOUBLED_QUOTES:
        c = fgetc(f);
        if (c == EOF)   putchar('\n');      return; 
        if (c == '\r')                      goto CARRIAGE_RETURN; 
        if (c == '\n')                      goto LINE_FEED; 
        if (c == '\"')  putchar('\"');      goto QUOTED_FIELD; 
        if (c == fs)    putchar(ofs);       goto NOT_FIELD; 
        /* default */   putchar(c);         goto UNQUOTED_FIELD; 

    UNQUOTED_FIELD:
        c = fgetc(f);
        if (c == EOF)   putchar('\n');      return; 
        if (c == '\r')                      goto CARRIAGE_RETURN; 
        if (c == '\n')                      goto LINE_FEED; 
        if (c == fs)    putchar(ofs);       goto NOT_FIELD; 
        /* default */   putchar(c);         goto UNQUOTED_FIELD; 

    CARRIAGE_RETURN:
        c = fgetc(f);
        if (c == EOF)   putchar('\n');      return; 
        if (c == '\r')  putchar('\n');      goto CARRIAGE_RETURN; 
        if (c == '\n')  putchar('\n');      goto START; 
        if (c == '\"')  putchar('\n');      goto QUOTED_FIELD; 
        if (c == fs)    printf("\n%c",ofs); goto NOT_FIELD; 
        /* default */   printf("\n%c",c);   goto UNQUOTED_FIELD; 

    LINE_FEED:
        c = fgetc(f);
        if (c == EOF)   putchar('\n');      return; 
        if (c == '\r')  putchar('\n');      goto START; 
        if (c == '\n')  putchar('\n');      goto LINE_FEED; 
        if (c == '\"')  putchar('\n');      goto QUOTED_FIELD; 
        if (c == fs)    printf("\n%c",ofs); goto NOT_FIELD; 
        /* default */   printf("\n%c",c);   goto UNQUOTED_FIELD; 


/* main -- process command line, call appropriate conversion */
int main(int argc, char *argv[]) 
    char ofs = '\034'; /* output field separator */
    char fs = ',';     /* input field separator */
    int  status = 0;   /* error status for return to operating system */
    char *progname;    /* name of program for error messages */

    FILE *f;
    int i;

    progname = (char *) malloc(strlen(argv[0])+1);
    strcpy(progname, argv[0]);

    while (argc > 1 && argv[1][0] == '-') 
        switch (argv[1][1]) 
            case 'c':
            case 'C':
                fs = argv[1][2];
                break;
            case 'f':
            case 'F':
                ofs = argv[1][2];
                break;
            default:
                fprintf(stderr, "%s: unknown argument %s\n",
                    progname, argv[1]);
                fprintf(stderr,
                   "usage: %s [-Cc] [-Fc] [filename ...]\n",
                    progname);
                exit(1);
        
        argc--;
        argv++;
    

    if (argc == 1)
        dofile(ofs, fs, stdin);
    else
        for (i = 1; i < argc; i++)
            if ((f = fopen(argv[i], "r")) == NULL) 
                fprintf(stderr, "%s: can't open %s\n",
                    progname, argv[i]);
                status = 2;
             else 
                dofile(ofs, fs, f);
                fclose(f);
            

    exit(status);

【讨论】:

以上是关于使用 awk 忽略 CSV 文件字段中的逗号的主要内容,如果未能解决你的问题,请参考以下文章

awk 可以处理在引用字段中包含逗号的 CSV 文件吗?

使用 awk 或 perl 从 CSV 中提取特定列(解析)

awk的基本使用

Netezza CSV 加载忽略值内的逗号

CSV文件格式要求

使用 pyspark 处理 csv 文件中字段中的逗号