SUM 函数上的 Pig 错误

Posted

技术标签:

【中文标题】SUM 函数上的 Pig 错误【英文标题】:Pig Error on SUM function 【发布时间】:2015-01-01 20:23:57 【问题描述】:

我有类似的数据 -

store   trn_date    dept_id sale_amt
1       2014-12-14  101     10007655
1       2014-12-14  101     10007654
1       2014-12-14  101     10007544
6       2014-12-14  104     100086544
8       2014-12-14  101     1000000
9       2014-12-14  106     1000000

我想得到 sale_amt 的总和,为此我正在做

首先我使用以下方法加载数据:

table = LOAD 'table' USING org.apache.hcatalog.pig.HCatLoader();

然后对store、tran_date、dept_id上的数据进行分组

grp_table = GROUP table BY (store, tran_date, dept_id);

最后尝试使用

获得 sale_amt 的 SUM
grp_gen = FOREACH grp_table GENERATE 
           FLATTEN(group) AS (store, tran_date, dept_id),
           SUM(table.sale_amt) AS tota_sale_amt;

低于错误 -

================================================================================
Pig Stack Trace
---------------
ERROR 2103: Problem doing work on Longs

org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing (Name: grouped_all: Local Rearrange[tuple]tuple(false) - scope-1317 Operator Key: scope-1317): org.apache.pig.backend.executionengine.ExecException: ERROR 2103: Problem doing work on Longs
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:289)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNextTuple(POLocalRearrange.java:263)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.processOnePackageOutput(PigCombiner.java:183)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:161)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:51)
        at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
        at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1645)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1611)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1462)
        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:700)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:770)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2103: Problem doing work on Longs
        at org.apache.pig.builtin.AlgebraicLongMathBase.doTupleWork(AlgebraicLongMathBase.java:84)
        at org.apache.pig.builtin.AlgebraicLongMathBase$Intermediate.exec(AlgebraicLongMathBase.java:108)
        at org.apache.pig.builtin.AlgebraicLongMathBase$Intermediate.exec(AlgebraicLongMathBase.java:102)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:330)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNextTuple(POUserFunc.java:369)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:333)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:378)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:298)
        at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:281)
Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Number
        at org.apache.pig.builtin.AlgebraicLongMathBase.doTupleWork(AlgebraicLongMathBase.java:77)
================================================================================

由于我正在使用 HCatalog Loader 读取表,并且在配置单元表中的数据类型是字符串,所以我也尝试在脚本中进行强制转换,但仍然得到相同的错误

【问题讨论】:

【参考方案1】:

我的系统中没有安装HCatalog,因此尝试使用简单文件,但以下方法和代码对您有用。

1.SUM 仅适用于数据类型 (int, long, float, double, bigdecimal, biginteger or bytearray cast as double)。它看起来像您的 sale_amt 列是字符串,因此您需要在使用 SUM 函数之前将此列类型转换为 (long or double)。

2.你不应该使用store作为变量,因为它是Pig中的保留关键字,所以你必须将此变量重命名为不同的名称,否则会出错。我将此变量重命名为“商店”。

示例:

表格:

1       2014-12-14      101     10007655
1       2014-12-14      101     10007654
1       2014-12-14      101     10007544
6       2014-12-14      104     100086544
8       2014-12-14      101     1000000
9       2014-12-14      106     1000000

PigScript:

A = LOAD 'table' USING PigStorage() AS (store:chararray,trn_date:chararray,dept_id:chararray,sale_amt:chararray);
B = FOREACH A GENERATE $0 AS stores,trn_date,dept_id,(long)sale_amt; --Renamed the variable store to stores and typecasted the sale_amt to long.
C = GROUP B BY (stores,trn_date,dept_id);
D = FOREACH C GENERATE FLATTEN(group),SUM(B.sale_amt);
DUMP D;

输出:

(1,2014-12-14,101,30022853)
(6,2014-12-14,104,100086544)
(8,2014-12-14,101,1000000)
(9,2014-12-14,106,1000000)

【讨论】:

谢谢 Jayaraman,我知道这样会起作用,但这是配置单元列数据类型的问题,它在我投射时会产生问题。

以上是关于SUM 函数上的 Pig 错误的主要内容,如果未能解决你的问题,请参考以下文章

Windows 上的 Apache Pig 设置错误

Hadoop 2.7.2 上的 Pig-0.16.0 - 错误 1002:无法存储别名

Pig 脚本中的 SUM 函数

PIG 中的 SUM 函数

迭代连接集后的 PIG 错误 1066。

Pig:错误 1045:无法将 COUNT 的匹配函数推断为多个匹配或都不匹配。请使用显式演员表