未设置 Pig 架构元组。不会生成代码

Posted

技术标签:

【中文标题】未设置 Pig 架构元组。不会生成代码【英文标题】:Pig schema tuple not set. Will not generate code 【发布时间】:2018-04-06 01:59:40 【问题描述】:

我在 google n-gram 数据集的 pig 上运行了以下命令:

inp = LOAD 'link to file' AS (ngram:chararray, year:int, occurences:float, books:float);

filter_input = FILTER inp BY (occurences >= 400) AND (books >= 8);

groupinp = GROUP filter_input BY ngram;

sum_occ = FOREACH groupinp GENERATE FLATTEN(group) as ngram, SUM(filter_input.occurences) / SUM(filter_input.books) AS ntry;

roundto = FOREACH sum_occ GENERATE sum_occ.ngram, ROUND_TO( sum_occ.ntry , 2 );

但是我收到以下错误:

DUMP roundto;
601062 [main] WARN  org.apache.pig.newplan.BaseOperatorPlan  - Encountered Warning IMPLICIT_CAST_TO_FLOAT 2 time(s).
18/04/06 01:46:03 WARN newplan.BaseOperatorPlan: Encountered Warning IMPLICIT_CAST_TO_FLOAT 2 time(s).
601067 [main] INFO  org.apache.pig.tools.pigstats.ScriptState  - Pig features used in the script: GROUP_BY,FILTER
18/04/06 01:46:03 INFO pigstats.ScriptState: Pig features used in the script: GROUP_BY,FILTER
601111 [main] INFO  org.apache.pig.data.SchemaTupleBackend  - Key [pig.schematuple] was not set... will not generate code.
18/04/06 01:46:03 INFO data.SchemaTupleBackend: Key [pig.schematuple] was not set... will not generate code.
601111 [main] INFO  org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer  - RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, NestedLimitOptimizer, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]
18/04/06 01:46:03 INFO optimizer.LogicalPlanOptimizer: RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, ConstantCalculator, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, NestedLimitOptimizer, PartitionFilterOptimizer, PredicatePushdownOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter]
601238 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezLauncher  - Tez staging directory is /tmp/temp-336429202 and resources directory is /tmp/temp-336429202
18/04/06 01:46:03 INFO tez.TezLauncher: Tez staging directory is /tmp/temp-336429202 and resources directory is /tmp/temp-336429202
601239 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.plan.TezCompiler  - File concatenation threshold: 100 optimistic? false
18/04/06 01:46:03 INFO plan.TezCompiler: File concatenation threshold: 100 optimistic? false
601241 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.CombinerOptimizerUtil  - Choosing to move algebraic foreach to combiner
18/04/06 01:46:03 INFO util.CombinerOptimizerUtil: Choosing to move algebraic foreach to combiner
601265 [main] INFO  org.apache.pig.builtin.PigStorage  - Using PigTextInputFormat
18/04/06 01:46:03 INFO builtin.PigStorage: Using PigTextInputFormat
18/04/06 01:46:03 INFO input.FileInputFormat: Total input files to process : 1
601285 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil  - Total input paths to process : 1
18/04/06 01:46:03 INFO util.MapRedUtil: Total input paths to process : 1
601285 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil  - Total input paths (combined) to process : 1
18/04/06 01:46:03 INFO util.MapRedUtil: Total input paths (combined) to process : 1
18/04/06 01:46:03 INFO hadoop.MRInputHelpers: NumSplits: 1, SerializedSize: 408
601322 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler  - Local resource: joda-time-2.9.4.jar
18/04/06 01:46:03 INFO tez.TezJobCompiler: Local resource: joda-time-2.9.4.jar
601322 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler  - Local resource: pig-0.17.0-core-h2.jar
18/04/06 01:46:03 INFO tez.TezJobCompiler: Local resource: pig-0.17.0-core-h2.jar
601322 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler  - Local resource: antlr-runtime-3.4.jar
18/04/06 01:46:03 INFO tez.TezJobCompiler: Local resource: antlr-runtime-3.4.jar
601322 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler  - Local resource: automaton-1.11-8.jar
18/04/06 01:46:03 INFO tez.TezJobCompiler: Local resource: automaton-1.11-8.jar
601402 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder  - For vertex - scope-141: parallelism=1, memory=1536, java opts=-Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx1229m -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=<LOG_DIR> -Dtez.root.logger=INFO,CLA 
18/04/06 01:46:03 INFO tez.TezDagBuilder: For vertex - scope-141: parallelism=1, memory=1536, java opts=-Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx1229m -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=<LOG_DIR> -Dtez.root.logger=INFO,CLA 
601402 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder  - Processing aliases: filter_input,groupinp,inp,sum_occ
18/04/06 01:46:03 INFO tez.TezDagBuilder: Processing aliases: filter_input,groupinp,inp,sum_occ
601402 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder  - Detailed locations: inp[1,6],inp[-1,-1],filter_input[2,15],sum_occ[4,10],groupinp[3,11]
18/04/06 01:46:03 INFO tez.TezDagBuilder: Detailed locations: inp[1,6],inp[-1,-1],filter_input[2,15],sum_occ[4,10],groupinp[3,11]
601402 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder  - Pig features in the vertex: 
18/04/06 01:46:03 INFO tez.TezDagBuilder: Pig features in the vertex: 
601449 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder  - Set auto parallelism for vertex scope-142
18/04/06 01:46:03 INFO tez.TezDagBuilder: Set auto parallelism for vertex scope-142
601450 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder  - For vertex - scope-142: parallelism=1, memory=3072, java opts=-Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx2458m -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=<LOG_DIR> -Dtez.root.logger=INFO,CLA 
18/04/06 01:46:03 INFO tez.TezDagBuilder: For vertex - scope-142: parallelism=1, memory=3072, java opts=-Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx2458m -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=<LOG_DIR> -Dtez.root.logger=INFO,CLA 
601450 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder  - Processing aliases: roundto,sum_occ
18/04/06 01:46:03 INFO tez.TezDagBuilder: Processing aliases: roundto,sum_occ
601450 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder  - Detailed locations: sum_occ[4,10],roundto[6,10]
18/04/06 01:46:03 INFO tez.TezDagBuilder: Detailed locations: sum_occ[4,10],roundto[6,10]
601450 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezDagBuilder  - Pig features in the vertex: GROUP_BY
18/04/06 01:46:03 INFO tez.TezDagBuilder: Pig features in the vertex: GROUP_BY
601489 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler  - Total estimated parallelism is 2
18/04/06 01:46:04 INFO tez.TezJobCompiler: Total estimated parallelism is 2
601531 [PigTezLauncher-0] INFO  org.apache.pig.tools.pigstats.tez.TezScriptState  - Pig script settings are added to the job
18/04/06 01:46:04 INFO tez.TezScriptState: Pig script settings are added to the job
18/04/06 01:46:04 INFO client.TezClient: Tez Client Version: [ component=tez-api, version=0.8.4, revision=300391394352b074b85b529e870816a72c6f314a, SCM-URL=scm:git:https://git-wip-us.apache.org/repos/asf/tez.git, buildTime=2018-03-21T23:55:28Z ]
18/04/06 01:46:04 INFO client.RMProxy: Connecting to ResourceManager at ip-172-31-28-12.ec2.internal/172.31.28.12:8032
18/04/06 01:46:04 INFO client.TezClient: Using org.apache.tez.dag.history.ats.acls.ATSHistoryACLPolicyManager to manage Timeline ACLs
18/04/06 01:46:04 INFO impl.TimelineClientImpl: Timeline service address: http://ip-172-31-28-12.ec2.internal:8188/ws/v1/timeline/
18/04/06 01:46:04 INFO client.TezClient: Session mode. Starting session.
18/04/06 01:46:04 INFO client.TezClientUtils: Using tez.lib.uris value from configuration: hdfs:///apps/tez/tez.tar.gz
18/04/06 01:46:04 INFO client.TezClientUtils: Using tez.lib.uris.classpath value from configuration: null
18/04/06 01:46:04 INFO client.TezClient: Tez system stage directory hdfs://ip-172-31-28-12.ec2.internal:8020/tmp/temp-336429202/.tez/application_1522978297921_0003 doesn't exist and is created
18/04/06 01:46:04 INFO acls.ATSHistoryACLPolicyManager: Created Timeline Domain for History ACLs, domainId=Tez_ATS_application_1522978297921_0003
18/04/06 01:46:04 INFO impl.YarnClientImpl: Submitted application application_1522978297921_0003
18/04/06 01:46:04 INFO client.TezClient: The url to track the Tez Session: http://ip-172-31-28-12.ec2.internal:20888/proxy/application_1522978297921_0003/
607861 [PigTezLauncher-0] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezJob  - Submitting DAG PigLatin:DefaultJobName-0_scope-2
18/04/06 01:46:10 INFO tez.TezJob: Submitting DAG PigLatin:DefaultJobName-0_scope-2
18/04/06 01:46:10 INFO client.TezClient: Submitting dag to TezSession, sessionName=PigLatin:DefaultJobName, applicationId=application_1522978297921_0003, dagName=PigLatin:DefaultJobName-0_scope-2, callerContext= context=PIG, callerType=PIG_SCRIPT_ID, callerId=PIG-default-d73e19dc-5287-4ee2-a85d-e931327011dc 
18/04/06 01:46:10 INFO client.TezClient: Submitted dag to TezSession, sessionName=PigLatin:DefaultJobName, applicationId=application_1522978297921_0003, dagName=PigLatin:DefaultJobName-0_scope-2
18/04/06 01:46:10 INFO client.RMProxy: Connecting to ResourceManager at ip-172-31-28-12.ec2.internal/172.31.28.12:8032
608409 [PigTezLauncher-0] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezJob  - Submitted DAG PigLatin:DefaultJobName-0_scope-2. Application id: application_1522978297921_0003
18/04/06 01:46:10 INFO tez.TezJob: Submitted DAG PigLatin:DefaultJobName-0_scope-2. Application id: application_1522978297921_0003
608528 [main] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezLauncher  - HadoopJobId: job_1522978297921_0003
18/04/06 01:46:11 INFO tez.TezLauncher: HadoopJobId: job_1522978297921_0003
609410 [Timer-1] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezJob  - DAG Status: status=RUNNING, progress=TotalTasks: 2 Succeeded: 0 Running: 0 Failed: 0 Killed: 0, diagnostics=, counters=null
18/04/06 01:46:11 INFO tez.TezJob: DAG Status: status=RUNNING, progress=TotalTasks: 2 Succeeded: 0 Running: 0 Failed: 0 Killed: 0, diagnostics=, counters=null
629410 [Timer-1] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezJob  - DAG Status: status=RUNNING, progress=TotalTasks: 2 Succeeded: 0 Running: 1 Failed: 0 Killed: 0, diagnostics=, counters=null
18/04/06 01:46:31 INFO tez.TezJob: DAG Status: status=RUNNING, progress=TotalTasks: 2 Succeeded: 0 Running: 1 Failed: 0 Killed: 0, diagnostics=, counters=null
646404 [pool-1-thread-1] INFO  org.apache.pig.backend.hadoop.executionengine.tez.TezSessionManager  - Shutting down Tez session org.apache.tez.client.TezClient@3a371843
18/04/06 01:46:48 INFO tez.TezSessionManager: Shutting down Tez session org.apache.tez.client.TezClient@3a371843
2018-04-06 01:46:48 Shutting down Tez session , sessionName=PigLatin:DefaultJobName, applicationId=application_1522978297921_0003
18/04/06 01:46:48 INFO client.TezClient: Shutting down Tez Session, sessionName=PigLatin:DefaultJobName, applicationId=application_1522978297921_0003

如何解决此错误?转储命令适用于除 roundto 之外的前几行。 Tez 客户端到底是什么?

【问题讨论】:

请阅读Under what circumstances may I add “urgent” or other similar phrases to my question, in order to obtain faster answers? - 总结是这不是解决志愿者的理想方式,并且可能会适得其反。请不要将此添加到您的问题中。 【参考方案1】:

我无法复制您的输出,因为我一尝试此行就会出错:

roundto = FOREACH sum_occ GENERATE sum_occ.ngram, ROUND_TO( sum_occ.ntry , 2 );

您不需要使用dot operator 来引用这些字段(例如sum_occ.ngram),因为它们没有嵌套在元组或包中。在没有点运算符的情况下尝试上面的行:

roundto = FOREACH sum_occ GENERATE ngram, ROUND_TO( ntry , 2 );

回答您的第二个问题,MapReduce 和 Tez 都是可用于运行 Pig 脚本的框架。 Tez 有时可以加快 Pig 脚本的运行时间。您可以通过使用 pig -x mapreducepig -x tez 启动 Pig shell 来显式使用 MapReduce 或 Tez。 MapReduce 是默认设置,因此如果您尚未指定 Tez,则必须设置您的 Hadoop 集群以在 Tez 中运行 Pig。

【讨论】:

以上是关于未设置 Pig 架构元组。不会生成代码的主要内容,如果未能解决你的问题,请参考以下文章

当未设置可选字段时,大象鸟库生成字段的默认值而不是 null

在 Pig 中投影分组元组

在 Apache Pig 中为元组创建模式

apache pig Java UDF - 更改属性中的值似乎并没有坚持

创建 pig udf 架构时遇到问题

运行 pig 0.7.0 错误:错误 2998:未处理的内部错误