Pig CSVExcelStorage 双引号逗号
Posted
技术标签:
【中文标题】Pig CSVExcelStorage 双引号逗号【英文标题】:Pig CSVExcelStorage DoubleQuoted Commas 【发布时间】:2016-07-11 20:20:52 【问题描述】:我正在将 csv 格式的文件(字段以逗号分隔并用双引号括起来)接收到 HDFS,并开发了一个 pig 脚本,该脚本在我使用 HQL 脚本将数据插入 Hive 之前删除标题行并去除双引号。
这个过程一直运行良好;但是,今天我发现其中一张表存在数据问题。该表的文件特别有一个字符串字段,可以在双引号内包含多个逗号。这导致数据被错误地加载到 Hive 中某些记录的错误列中。
我无法更改源文件的格式。
目前我正在使用 PiggyBank CSVExcelStorage 来处理 csv 格式,如下所示。可以修改它以产生正确的结果吗?我还有什么其他选择?我注意到现在还有一个 CSVLoader,但还没有找到任何示例来展示如何使用/实现它。 Pig CSVLoader
USING org.apache.pig.piggybank.storage.CSVExcelStorage(',','NO_MULTILINE','NOCHANGE','SKIP_INPUT_HEADER')
编辑以添加额外的样本数据和测试结果:
示例输入文件数据:
"P_NAME","P_ID","C_ID","C_NAME","C_TYPE","PROT","I_NAME","I_ID","A_NAME","A_IDS","C_NM","CO"
"SAMPLEPNAME","123456","789123","SAMPLECNAME","Upload","SAMPLEINAME","This Sample Name of A, B, and C","3234","This Sample Name of A, B, and C","3234","c_name","R"
"SAMPLEPNAME2","123457","789124","SAMPLECNAME2","Download","SAMPLEINAME2","This Sample Name","3235","This Sample Name","3235","c_name2","Q"
使用上面提供的格式的 CSVExcelLoader:
SAMPLEPNAME,123456,789123,SAMPLECNAME,Upload,SAMPLEINAME,This Sample Name of A, B, and C,3234,This Sample Name of A, B, and C,3234,c_name,R
SAMPLEPNAME2,123457,789124,SAMPLECNAME2,Download,SAMPLEINAME2,This Sample Name,3235,This Sample Name,3235,c_name2,Q
将 CSVLoader 用作 CSVLoader(): 注意 - 没有看到要提供给构造函数的参数的任何选项
P_NAME,,,C_NAME,C_TYPE,PROT,I_NAME,,A_NAME,,C_NM,CO
SAMPLEPNAME,123456,789123,SAMPLECNAME,Upload,SAMPLEINAME,This Sample Name of A, B, and C,3234,This Sample Name of A, B, and C,3234,c_name,R
SAMPLEPNAME2,123457,789124,SAMPLECNAME2,Download,SAMPLEINAME2,This Sample Name,3235,This Sample Name,3235,c_name2,Q
我看到的唯一真正区别是 CSVLoader 没有删除标题行,因为我没有看到选择此选项的选项,而是删除了一些标题名称。
我做错了吗?一个可行的解决方案将不胜感激。
【问题讨论】:
Hive 中的数据是否需要在字段中包含逗号?处理此问题的一种方法是将字段中的逗号替换为另一个字符,例如“|”然后加载数据。 @inquisitive_mind 是的,我需要保留数据的原始格式。 【参考方案1】:要解决字段中的逗号问题,您可以尝试这项工作。
将数据加载为一行。 将“,”视为分隔符并将其替换为竖线字符,即“|”。 将开头和结尾的引号 " 替换为空字符串。 使用“|”将该行加载到配置单元中作为分隔符。
A = LOAD 'test1.csv' AS (lines:chararray);
ranked = rank A;
B = FILTER ranked BY (rank_A > 1);
C = FOREACH B GENERATE REPLACE($1,'","','|');
D = FOREACH C GENERATE REPLACE($0,'"','');
DUMP D;
A = LOAD 'test1.csv' AS (lines:chararray);
"P_NAME","P_ID","C_ID","C_NAME","C_TYPE","PROT","I_NAME","I_ID","A_NAME","A_IDS","C_NM","CO"
"SAMPLEPNAME","123456","789123","SAMPLECNAME","Upload","SAMPLEINAME","This Sample Name of A, B, and C","3234","This Sample Name of A, B, and C","3234","c_name","R"
"SAMPLEPNAME2","123457","789124","SAMPLECNAME2","Download","SAMPLEINAME2","This Sample Name","3235","This Sample Name","3235","c_name2","Q"
排名 = 排名 A;
(1,"P_NAME","P_ID","C_ID","C_NAME","C_TYPE","PROT","I_NAME","I_ID","A_NAME","A_IDS","C_NM","CO")
(2,"SAMPLEPNAME","123456","789123","SAMPLECNAME","Upload","SAMPLEINAME","This Sample Name of A, B, and C","3234","This S
ample Name of A, B, and C","3234","c_name","R")
(3,"SAMPLEPNAME2","123457","789124","SAMPLECNAME2","Download","SAMPLEINAME2","This Sample Name","3235","This Sample Name
","3235","c_name2","Q")
B = FILTER 排名依据(rank_A > 1);
(2,"SAMPLEPNAME","123456","789123","SAMPLECNAME","Upload","SAMPLEINAME","This Sample Name of A, B, and C","3234","This S
ample Name of A, B, and C","3234","c_name","R")
(3,"SAMPLEPNAME2","123457","789124","SAMPLECNAME2","Download","SAMPLEINAME2","This Sample Name","3235","This Sample Name
","3235","c_name2","Q")
C = FOREACH B 生成替换($1,'","','|');
("SAMPLEPNAME|123456|789123|SAMPLECNAME|Upload|SAMPLEINAME|This Sample Name of A, B, and C|3234|This S
ample Name of A, B, and C|3234|c_name|R")
("SAMPLEPNAME2|123457|789124|SAMPLECNAME2|Download|SAMPLEINAME2|This Sample Name|3235|This Sample Name
|3235|c_name2|Q")
D = FOREACH C GENERATE REPLACE($0,'"','');
(SAMPLEPNAME|123456|789123|SAMPLECNAME|Upload|SAMPLEINAME|This Sample Name of A, B, and C|3234|This S
ample Name of A, B, and C|3234|c_name|R)
(SAMPLEPNAME2|123457|789124|SAMPLECNAME2|Download|SAMPLEINAME2|This Sample Name|3235|This Sample Name
|3235|c_name2|Q)
您现在可以使用“|”将此数据加载到配置单元作为分隔符。
【讨论】:
以上是关于Pig CSVExcelStorage 双引号逗号的主要内容,如果未能解决你的问题,请参考以下文章