Pig CSVExcelStorage 双引号逗号

Posted

技术标签:

【中文标题】Pig CSVExcelStorage 双引号逗号【英文标题】:Pig CSVExcelStorage DoubleQuoted Commas 【发布时间】:2016-07-11 20:20:52 【问题描述】:

我正在将 csv 格式的文件(字段以逗号分隔并用双引号括起来)接收到 HDFS,并开发了一个 pig 脚本,该脚本在我使用 HQL 脚本将数据插入 Hive 之前删除标题行并去除双引号。

这个过程一直运行良好;但是,今天我发现其中一张表存在数据问题。该表的文件特别有一个字符串字段,可以在双引号内包含多个逗号。这导致数据被错误地加载到 Hive 中某些记录的错误列中。

我无法更改源文件的格式。

目前我正在使用 PiggyBank CSVExcelStorage 来处理 csv 格式,如下所示。可以修改它以产生正确的结果吗?我还有什么其他选择?我注意到现在还有一个 CSVLoader,但还没有找到任何示例来展示如何使用/实现它。 Pig CSVLoader

USING org.apache.pig.piggybank.storage.CSVExcelStorage(',','NO_MULTILINE','NOCHANGE','SKIP_INPUT_HEADER')

编辑以添加额外的样本数据和测试结果

示例输入文件数据:

"P_NAME","P_ID","C_ID","C_NAME","C_TYPE","PROT","I_NAME","I_ID","A_NAME","A_IDS","C_NM","CO"    
"SAMPLEPNAME","123456","789123","SAMPLECNAME","Upload","SAMPLEINAME","This Sample Name of A, B, and C","3234","This Sample Name of A, B, and C","3234","c_name","R"
"SAMPLEPNAME2","123457","789124","SAMPLECNAME2","Download","SAMPLEINAME2","This Sample Name","3235","This Sample Name","3235","c_name2","Q"

使用上面提供的格式的 CSVExcelLoader:

SAMPLEPNAME,123456,789123,SAMPLECNAME,Upload,SAMPLEINAME,This Sample Name of A, B, and C,3234,This Sample Name of A, B, and C,3234,c_name,R
SAMPLEPNAME2,123457,789124,SAMPLECNAME2,Download,SAMPLEINAME2,This Sample Name,3235,This Sample Name,3235,c_name2,Q

将 CSVLoader 用作 CSVLoader(): 注意 - 没有看到要提供给构造函数的参数的任何选项

P_NAME,,,C_NAME,C_TYPE,PROT,I_NAME,,A_NAME,,C_NM,CO 
SAMPLEPNAME,123456,789123,SAMPLECNAME,Upload,SAMPLEINAME,This Sample Name of A, B, and C,3234,This Sample Name of A, B, and C,3234,c_name,R
SAMPLEPNAME2,123457,789124,SAMPLECNAME2,Download,SAMPLEINAME2,This Sample Name,3235,This Sample Name,3235,c_name2,Q

我看到的唯一真正区别是 CSVLoader 没有删除标题行,因为我没有看到选择此选项的选项,而是删除了一些标题名称。

我做错了吗?一个可行的解决方案将不胜感激。

【问题讨论】:

Hive 中的数据是否需要在字段中包含逗号?处理此问题的一种方法是将字段中的逗号替换为另一个字符,例如“|”然后加载数据。 @inquisitive_mind 是的,我需要保留数据的原始格式。 【参考方案1】:

要解决字段中的逗号问题,您可以尝试这项工作。

将数据加载为一行。 将“,”视为分隔符并将其替换为竖线字符,即“|”。 将开头和结尾的引号 " 替换为空字符串。 使用“|”将该行加载到配置单元中作为分隔符。

A = LOAD 'test1.csv' AS (lines:chararray);
ranked = rank A;
B = FILTER ranked BY (rank_A > 1);
C = FOREACH B GENERATE REPLACE($1,'","','|');
D = FOREACH C GENERATE REPLACE($0,'"','');
DUMP D;

A = LOAD 'test1.csv' AS (lines:chararray);

"P_NAME","P_ID","C_ID","C_NAME","C_TYPE","PROT","I_NAME","I_ID","A_NAME","A_IDS","C_NM","CO"
"SAMPLEPNAME","123456","789123","SAMPLECNAME","Upload","SAMPLEINAME","This Sample Name of A, B, and C","3234","This Sample Name of A, B, and C","3234","c_name","R"
"SAMPLEPNAME2","123457","789124","SAMPLECNAME2","Download","SAMPLEINAME2","This Sample Name","3235","This Sample Name","3235","c_name2","Q"

排名 = 排名 A;

(1,"P_NAME","P_ID","C_ID","C_NAME","C_TYPE","PROT","I_NAME","I_ID","A_NAME","A_IDS","C_NM","CO")
(2,"SAMPLEPNAME","123456","789123","SAMPLECNAME","Upload","SAMPLEINAME","This Sample Name of A, B, and C","3234","This S
ample Name of A, B, and C","3234","c_name","R")
(3,"SAMPLEPNAME2","123457","789124","SAMPLECNAME2","Download","SAMPLEINAME2","This Sample Name","3235","This Sample Name
","3235","c_name2","Q")

B = FILTER 排名依据(rank_A > 1);

(2,"SAMPLEPNAME","123456","789123","SAMPLECNAME","Upload","SAMPLEINAME","This Sample Name of A, B, and C","3234","This S
ample Name of A, B, and C","3234","c_name","R")
(3,"SAMPLEPNAME2","123457","789124","SAMPLECNAME2","Download","SAMPLEINAME2","This Sample Name","3235","This Sample Name
","3235","c_name2","Q")

C = FOREACH B 生成替换($1,'","','|');

("SAMPLEPNAME|123456|789123|SAMPLECNAME|Upload|SAMPLEINAME|This Sample Name of A, B, and C|3234|This S
ample Name of A, B, and C|3234|c_name|R")
("SAMPLEPNAME2|123457|789124|SAMPLECNAME2|Download|SAMPLEINAME2|This Sample Name|3235|This Sample Name
|3235|c_name2|Q")

D = FOREACH C GENERATE REPLACE($0,'"','');

(SAMPLEPNAME|123456|789123|SAMPLECNAME|Upload|SAMPLEINAME|This Sample Name of A, B, and C|3234|This S
ample Name of A, B, and C|3234|c_name|R)
(SAMPLEPNAME2|123457|789124|SAMPLECNAME2|Download|SAMPLEINAME2|This Sample Name|3235|This Sample Name
|3235|c_name2|Q)

您现在可以使用“|”将此数据加载到配置单元作为分隔符。

【讨论】:

以上是关于Pig CSVExcelStorage 双引号逗号的主要内容,如果未能解决你的问题,请参考以下文章

Pig — 如何加载包含用双引号括起来并用逗号分隔的字段的 CSV 文件

csv 在 pig 中读取,csv 文件包含带引号的逗号

在 PIG 中加载文件时如何忽略“(双引号)?

MySQL双引号加逗号,是啥分隔符

hive导入csv文件,字段中双引号内有逗号

Postgresql regex_replace 逗号,单引号和双引号