Apache pig 按功能分组未提供预期输出

Posted

技术标签:

【中文标题】Apache pig 按功能分组未提供预期输出【英文标题】:Apache pig group by function is not giving expected output 【发布时间】:2016-04-04 12:50:49 【问题描述】:

我有csv 格式的数据,如下所示。

数据格式如下

"first_name","last_name","company_name","address","city","county","postal","phone1","phone2","email","web"

User.csv 下命名的样本数据。该文件包含以下数据。

"Aleshia","Tomkiewicz","Alan D Rosenburg Cpa Pc","14, Taylor St","St. Stephens Ward","Kent","CT2 7PP","01835-703597","01944-369967","atomkiewicz@hotmail.com","http://www.alandrosenburgcpapc.co.uk"
"Evan","Zigomalas","Cap Gemini America","5, Binney St","Abbey Ward","Buckinghamshire","HP11 2AX","01937-864715","01714-737668","evan.zigomalas@gmail.com","http://www.capgeminiamerica.co.uk"
"France","Andrade","Elliott, John W Esq","8 Moor Place","East Southbourne and Tuckton W","Bournemouth","BH6 3BE","01347-368222","01935-821636","france.andrade@hotmail.com","http://www.elliottjohnwesq.co.uk"

当我尝试使用 PigStorage 加载时

user = LOAD '/home/abhijit/Downloads/User.csv' USING PigStorage(',');

DUMP user;

它的输出是这样的:

("Aleshia","Tomkiewicz","Alan D Rosenburg Cpa Pc","14 Taylor St","St. Stephens Ward","Kent","CT2 7PP","01835-703597","01944-369967","atomkiewicz@hotmail.com","http://www.alandrosenburgcpapc.co.uk")
("Evan","Zigomalas","Cap Gemini America","5, Binney St","Abbey Ward","Buckinghamshire","HP11 2AX","01937-864715","01714-737668","evan.zigomalas@gmail.com","http://www.capgeminiamerica.co.uk")
("France","Andrade","Elliott, John W Esq","8 Moor Place","East Southbourne and Tuckton W","Bournemouth","BH6 3BE","01347-368222","01935-821636","france.andrade@hotmail.com","http://www.elliottjohnwesq.co.uk")

我想按城市分组。所以我写了

grp = group user by $4; 
dump grp;

我得到的输出为:

( Binney St",("Evan","Zigomalas","Cap Gemini America","5, Binney St","Abbey Ward","Buckinghamshire","HP11 2AX","01937-864715","01714-737668","evan.zigomalas@gmail.com","http://www.capgeminiamerica.co.uk"))
("8 Moor Place",("France","Andrade","Elliott, John W Esq","8 Moor Place","East Southbourne and Tuckton W","Bournemouth","BH6 3BE","01347-368222","01935-821636","france.andrade@hotmail.com","http://www.elliottjohnwesq.co.uk"))
("St. Stephens Ward",("Aleshia","Tomkiewicz","Alan D Rosenburg Cpa Pc","14 Taylor St","St. Stephens Ward","Kent","CT2 7PP","01835-703597","01944-369967","atomkiewicz@hotmail.com","http://www.alandrosenburgcpapc.co.uk"))

company_name 和地址造成问题,因为其中包含','。例如地址中的"14, Taylor St" 或公司名称中的"Elliott, John W Esq"

所以我的$4 被视为"Taylor St" 而不是"St. Stephens Ward"

因此,由于地址数据或公司名称数据中的额外分隔符没有正确加载或正确分隔,并且按功能分组没有给出正确的结果。

我怎样才能通过输出实现分组如下

("Abbey Ward",("Evan","Zigomalas","Cap Gemini America","5, Binney St","Abbey Ward","Buckinghamshire","HP11 2AX","01937-864715","01714-737668","evan.zigomalas@gmail.com","http://www.capgeminiamerica.co.uk"))
("St. Stephens Ward",("Aleshia","Tomkiewicz","Alan D Rosenburg Cpa Pc","14, Taylor St","St. Stephens Ward","Kent","CT2 7PP","01835-703597","01944-369967","atomkiewicz@hotmail.com","http://www.alandrosenburgcpapc.co.uk"))
("East Southbourne and Tuckton W",("France","Andrade","Elliott, John W Esq","8 Moor Place","East Southbourne and Tuckton W","Bournemouth","BH6 3BE","01347-368222","01935-821636","france.andrade@hotmail.com","http://www.elliottjohnwesq.co.uk"))


grp = group a by $5 ;

这对我来说不是解决方案。我已经想到了。

【问题讨论】:

尝试使用 CSVExcelStorage 加载数据。它应该遵守转义并正确加载数据。 将尝试同时更新您 @LiMuBei:谢谢。使用“CSVExcelStorage”对我有用。现在我能够在分组后得到正确的数据... 猜猜我会创建一个答案 【参考方案1】:

问题在于PigStorage 没有考虑转义,因此为不应该是列的字段创建列(每次条目都包含逗号)。

使用CSVExcelStorage 将解决此问题,因为此存储可以处理转义,从而创建正确数量和列的顺序。

【讨论】:

以上是关于Apache pig 按功能分组未提供预期输出的主要内容,如果未能解决你的问题,请参考以下文章

按每周一分组数据未按预期工作

LINQ - 按多个键分组未给出预期结果

NumberFormatter 分组未按预期工作

Hadoop猪技术按功能分组

apache pig中的嵌套组

Apache PIG - 使用百分比值对 foreach 中的分组数据进行采样