CSV 将大量数据加载到 Pig 中

Posted

技术标签:

【中文标题】CSV 将大量数据加载到 Pig 中【英文标题】:CSV loading large amount of data into Pig 【发布时间】:2016-03-26 21:45:26 【问题描述】:

我在 pig 中使用此查询从包含 50000 条记录的 CSV 文件中加载数据。

A = LOAD '/home/user/q2.csv' using org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'YES_MULTILINE') as (Id:chararray,
PostTypeId:chararray, 
AcceptedAnswerId:chararray, 
ParentId:chararray, 
CreationDate:chararray, 
DeletionDate:chararray, 
Score:chararray, 
ViewCount:chararray, 
Body:chararray, 
OwnerUserId:chararray, 
OwnerDisplayName:chararray, 
LastEditorUserId:chararray, 
LastEditorDisplayName:chararray, 
LastEditDate:chararray, 
LastActivityDate:chararray, 
Title:chararray, 
Tags:chararray, 
AnswerCount:chararray, 
CommentCount:chararray, 
FavoriteCount:chararray, 
ClosedDate:chararray, 
CommunityOwnedDate:chararray);

这里是清除 \n & 的数据的查询,在 body 字段中等等。

Q2Clean = FOREACH Q2 GENERATE
Id as Id, 
PostTypeId as PostTypeId, 
AcceptedAnswerId as AcceptedAnswerId, 
(chararray)REPLACE(ParentId,'"','')  as ParentId, 
CreationDate as CreationDate, 
(chararray)REPLACE(DeletionDate,'"','') as DeletionDate, 
Score as Score, 
ViewCount as ViewCount,  
(chararray)REPLACE(REPLACE(Body,'\n',''),',','')as Body, 
OwnerUserId as OwnerUserId, 
(chararray)REPLACE(OwnerDisplayName,'"','') as OwnerDisplayName, 
LastEditorUserId as LastEditorUserId, 
(chararray)REPLACE(LastEditorDisplayName,'"','') as LastEditorDisplayName, 
LastEditDate as LastEditDate, 
LastActivityDate as LastActivityDate, 
(chararray)REPLACE(Title,',','') as Title, 
(chararray)REPLACE(Tags,',','') as Tags, 
AnswerCount as AnswerCount, 
CommentCount as CommentCount, 
FavoriteCount as FavoriteCount, 
(chararray)REPLACE(ClosedDate,'"','') as ClosedDate, 
(chararray)REPLACE(CommunityOwnedDate,'"','') as CommunityOwnedDate;

现在的问题是,当我存储输出时显示 617538 行已写入。它创建了两个文件。第一个文件有 27000 条格式正确的记录,但第二个文件没有正确存储。它包含大约 610000 行和许多行,其中只有 ,。如何正确加载数据,以便输出显示 50000 而不是 617538 行。

Here's the load status

【问题讨论】:

【参考方案1】:

我认为问题出在脚本的以下部分。

(chararray)REPLACE(REPLACE(Body,'\n',''),',','')as Body, 

您必须添加另一个反斜杠来替换 '\n'

(chararray)REPLACE(REPLACE(Body,'\\n',''),',','')as Body, 

【讨论】:

我厌倦了用另一个反斜杠替换 \n 但仍然显示相同数量的记录。 @user6118910 你能发布示例数据吗?

以上是关于CSV 将大量数据加载到 Pig 中的主要内容,如果未能解决你的问题,请参考以下文章

使用 Apache Pig 将数据加载到 Hbase 表时,如何排除 csv 或文本文件中没有数据(只有空格)的列?

无法将数据加载到 Pig 中的 Hortonworks Sandbox

数据正在转换为二进制格式,同时使用 Apache pig 将数据加载到 monet db

如何将 CSV/TSV 文件从 Pig 加载/导出到 Pandas?

PIG 加载 CSV - 地图类型错误

如何在 PIG 中导入/加载 .csv 文件?