Pig 如何使用过滤器格式化半结构化 CSV

Posted

技术标签:

【中文标题】Pig 如何使用过滤器格式化半结构化 CSV【英文标题】:Pig how to format a semi-structured CSV with filters 【发布时间】:2015-06-03 08:55:03 【问题描述】:

我有半结构化的 CSV,看起来像这样。

VTS,01,0099,7022606164,SP,GP,33,060646,A,1258.9805,N,07735.9303,E,0.0,278.6,280515,0000,00,4000,11,999,842,4B61
VTS,01,0099,7022606164,NM,GP,20,060637,A,1258.9805,N,07735.9302,E,0.0,278.6,280515,0000,00,4000,11,999,841,7407+++
VTS,66,0065,7022606164,NM,0,GP,22,060648,280515,1258.9804,N,07735.9301,E,04AE+++
VTS,01,0099,7022606164,NM,GP,22,060656,A,1258.9804,N,07735.9301,E,0.0,278.6,280515,0000,00,4000,11,999,843,8FEB+++
VTS,01,0099,7022606164,NM,GP,22,060721,A,1258.9803,N,07735.9304,E,0.0,278.6,280515,0000,00,4000,11,999,845,044D++++++
VTS,99,0065,7022606164,NM,0,A,GP,22,060648,280515,1258.9804,N,07735.9301,E,04AE+++
VTS,99,0065,7022606164,NM,0,A,GP,22,060648,280515,1258.9804,N,07735.9301,E,04AE

我想用这些数据制作三个不同的表格。即一个带有 VTS,01 另一个带有 VTS,99 和另一个带有 VTS,66。同样,我还需要删除每行附加的“+++”作为错误,为此我编写了这个猪脚本。

data = load '/user/simulator/SKYTRACK/27thMay2015' using PigStorage('\n') as (f1:chararray);
splt = foreach data generate FLATTEN(STRSPLIT($0, '\\+++'));
data_pkt = FILTER splt BY $0 MATCHES '.*VTS,01+.*';
sos_pkt = FILTER splt BY $1 MATCHES '.*VTS,99+.*';
health_pkt = FILTER splt BY $2 MATCHES '.*VTS,66+.*';

当我为每个表单独测试此脚本时,只有一个输出,我收到其余的没有输出,

dump data_pkt; dump sos_pkt; dump health_pkt;

我对猪很陌生,所以任何人都可以帮助我解决这个问题..将不胜感激。

【问题讨论】:

【参考方案1】:

要删除 +++,您还需要转义所有“+”,而不仅仅是唯一的一个。 您对这些优点的含义不是很具体。您可以使用该正则表达式进行拆分:

 "\\+3,"

因此,在你的猪脚本中:

splt = foreach data generate FLATTEN(STRSPLIT($0, '\\+3,'));

Altough Aman 是正确的,但是,我宁愿使用 SPLIT 而不是 FILTER 来分离数据集:

 a = load '/abc.txt';
 SPLIT a INTO 
     b01 IF $1 == 01,
     b66 IF $1 == 66,
     b99 IF $1 == 69;

【讨论】:

您忘记添加using PigStorage(','),因为默认情况下它会查找标签,因此您的答案将不起作用。 现在对我来说效果很好a = load '/abc.txt' using PigStorage(','); SPLIT a INTO b01 IF $1 == 01, b66 IF $1 == 66, b99 IF $1 == 69; 谢谢大家。【参考方案2】:

这将根据值过滤您的记录。

 a = load '/abc.txt' using PigStorage(',');
 b1 = FILTER a by $1==01;
 b66 = FILTER a by $1==66;
 b99 = FILTER a by $1==99;

而要删除 +++ 你必须写一个简单的猪 udf。

输出:

(VTS,99,0065,7022606164,NM,0,A,GP,22,060648,280515,1258.9804,N,07735.9301,E,04AE+++)
(VTS,99,0065,7022606164,NM,0,A,GP,22,060648,280515,1258.9804,N,07735.9301,E,04AE)

【讨论】:

您忘记添加using PigStorage(','),因为默认情况下它会查找标签,因此您的答案将不起作用。【参考方案3】:

现在这工作还不错。

data = load '/user/simulator/SKYTRACK/27thMay2015' using PigStorage(',');

splt = foreach data generate $0 as col0:chararray,$1 as col1:chararray,$2 as col2:chararray,$3 as col3:chararray,$4 as col4:chararray,$5 as col5:chararray,$6 as col6:chararray,$7 as col7:chararray,$8 as col8:chararray,$9 as col9:chararray,$10 as col10:chararray,$11 as col11:chararray,$12 as col12:chararray,$13, FLATTEN(STRSPLIT($14, '\\+++'));

data_pkt = FILTER splt BY $1 MATCHES '.*01+.*';
health_pkt = FILTER splt BY $1 MATCHES '.*66+.*';
sos_pkt = FILTER splt BY $1 MATCHES '.*99+.*';

但问题是三个步骤。

【讨论】:

要根据第一列的值拆分数据,请查看@g.l 答案,以避免 3 个filter 操作。

以上是关于Pig 如何使用过滤器格式化半结构化 CSV的主要内容,如果未能解决你的问题,请参考以下文章

如何使用 apache Pig 插入过滤数据的索引?

如何在 PIG 中将 XLSX 文件转换为 CSV 文件?

使用 PIG 连接后过滤数据

如何使用 Hive/Pig/MapReduce 展平递归层次结构

Hadoop Pig - 删除 csv 标头

如何将 CSV/TSV 文件从 Pig 加载/导出到 Pandas?