转储数据集时将数据从 Hive 加载到 Pig 错误

Posted 2023-04-13

技术标签:

【中文标题】转储数据集时将数据从 Hive 加载到 Pig 错误【英文标题】：Loading Data from Hive to Pig Error while Dumping DataSet 【发布时间】：2018-04-15 11:01:16 【问题描述】：

retail_db.categories 有 58 行

$pig -useHCatalog
grunt> pcategories = LOAD 'retail_db.categories' USING org.apache.hive.hcatalog.pig.HCatLoader();
grunt>b = limit pcategories 100;
grunt>dump b;

然后我正在获取所有记录 但是当我试图转储原始数据集

grunt>dump pcategories;

然后我收到错误

2018-04-15 16:27:46,444 [主要] 信息 org.apache.hadoop.conf.Configuration.deprecation - yarn.resourcemanager.system-metrics-publisher.enabled 已弃用。相反，使用 yarn.system-metrics-publisher.enabled 2018-04-15 16:27:46,723 [主] 信息 org.apache.hadoop.hive.metastore.ObjectStore - ObjectStore，初始化调用 2018-04-15 16:27:47,170 [main] INFO org.apache.hadoop.hive.metastore.MetaStoreDirectSql - 使用直接 SQL，底层数据库是 mysql 2018-04-15 16:27:47,171 [main] INFO org.apache.hadoop.hive.metastore.ObjectStore - 初始化的 ObjectStore 2018-04-15 16:27:47,171 [主要] 信息 org.apache.hadoop.hive.metastore.HiveMetaStore - 0：get_databases： NonExistentDatabaseUsedForHealthCheck 2018-04-15 16:27:47,171 [主要] 信息 org.apache.hadoop.hive.metastore.HiveMetaStore.audit - ugi=jay ip=unknown-ip-addr cmd=get_databases： NonExistentDatabaseUsedForHealthCheck 2018-04-15 16:27:47,184 [主要] 信息 org.apache.hadoop.hive.metastore.HiveMetaStore - 0：get_table： db=retail_db tbl=categories 2018-04-15 16:27:47,184 [主要] 信息 org.apache.hadoop.hive.metastore.HiveMetaStore.audit - ugi=jay ip=unknown-ip-addr cmd=get_table : db=retail_db tbl=类别 2018-04-15 16:27:47,219 [主要] 信息 org.apache.hadoop.conf.Configuration.deprecation - yarn.resourcemanager.system-metrics-publisher.enabled 已弃用。相反，使用 yarn.system-metrics-publisher.enabled 2018-04-15 16:27:47,244 [主要] 信息 org.apache.hadoop.hive.metastore.HiveMetaStore - 0：get_databases： NonExistentDatabaseUsedForHealthCheck 2018-04-15 16:27:47,244 [主要] 信息 org.apache.hadoop.hive.metastore.HiveMetaStore.audit - ugi=jay ip=unknown-ip-addr cmd=get_databases： NonExistentDatabaseUsedForHealthCheck 2018-04-15 16:27:47,247 [主要] 信息 org.apache.hadoop.hive.metastore.HiveMetaStore - 0：get_table： db=retail_db tbl=departments 2018-04-15 16:27:47,247 [主要] 信息 org.apache.hadoop.hive.metastore.HiveMetaStore.audit - ugi=jay ip=unknown-ip-addr cmd=get_table : db=retail_db tbl=departments 2018-04-15 16:27:47,261 [主要] 信息 org.apache.hadoop.conf.Configuration.deprecation - yarn.resourcemanager.system-metrics-publisher.enabled 已弃用。相反，使用 yarn.system-metrics-publisher.enabled 2018-04-15 16:27:47,284 [主要] 信息 org.apache.hadoop.hive.metastore.HiveMetaStore - 0：get_databases： NonExistentDatabaseUsedForHealthCheck 2018-04-15 16:27:47,284 [主要] 信息 org.apache.hadoop.hive.metastore.HiveMetaStore.audit - ugi=jay ip=unknown-ip-addr cmd=get_databases： NonExistentDatabaseUsedForHealthCheck 2018-04-15 16:27:47,286 [主要] 信息 org.apache.hadoop.hive.metastore.HiveMetaStore - 0：get_table： db=retail_db tbl=categories 2018-04-15 16:27:47,286 [主要] 信息 org.apache.hadoop.hive.metastore.HiveMetaStore.audit - ugi=jay ip=unknown-ip-addr cmd=get_table : db=retail_db tbl=类别 2018-04-15 16:27:47,386 [主要] 信息 org.apache.hadoop.conf.Configuration.deprecation - yarn.resourcemanager.system-metrics-publisher.enabled 已弃用。相反，使用 yarn.system-metrics-publisher.enabled 2018-04-15 16:27:47,388 [主要] 信息 org.apache.pig.tools.pigstats.ScriptState - 脚本中使用的猪特征：UNKNOWN 2018-04-15 16:27:47,397 [主要] 信息 org.apache.hadoop.conf.Configuration.deprecation - yarn.resourcemanager.system-metrics-publisher.enabled 已弃用。相反，使用 yarn.system-metrics-publisher.enabled 2018-04-15 16:27:47,397 [主] 警告 org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend 已经初始化 2018-04-15 16:27:47,397 [主要] 信息 org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - RULES_ENABLED=[AddForEach，ColumnMapKeyPrune，ConstantCalculator， GroupByConstParallelSetter、LimitOptimizer、LoadTypeCastInserter、 MergeFilter、MergeForEach、NestedLimitOptimizer、 PartitionFilterOptimizer、PredicatePushdownOptimizer、 PushDownForEachFlatten、PushUpFilter、SplitFilter、 StreamTypeCastInserter] 2018-04-15 16:27:47,398 [主要] 信息 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - 文件连接阈值：100 乐观？错误 2018-04-15 16:27:47,399 [主要] 信息 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - 优化前的 MR 计划大小：1 2018-04-15 16:27:47,399 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - 优化后的 MR 计划大小：1 2018-04-15 16:27:47,406 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - yarn.resourcemanager.system-metrics-publisher.enabled 已弃用。相反，使用 yarn.system-metrics-publisher.enabled 2018-04-15 16:27:47,407 [主] 信息 org.apache.hadoop.yarn.client.RMProxy - 在 /0.0.0.0:8032 2018-04-15 16:27:47,409 处连接到 ResourceManager [主要] 信息 org.apache.pig.tools.pigstats.mapreduce.MRScriptState - 猪脚本设置添加到作业 2018-04-15 16:27:47,409 [主要] 信息 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent 未设置，设置为默认值 0.3 2018-04-15 16:27:47,435 [main] INFO org.apache.hadoop.hive.metastore.HiveMetaStore - 0：get_databases： NonExistentDatabaseUsedForHealthCheck 2018-04-15 16:27:47,435 [主要] 信息 org.apache.hadoop.hive.metastore.HiveMetaStore.audit - ugi=jay ip=unknown-ip-addr cmd=get_databases： NonExistentDatabaseUsedForHealthCheck 2018-04-15 16:27:47,437 [主要] 信息 org.apache.hadoop.hive.metastore.HiveMetaStore - 0：get_table： db=retail_db tbl=categories 2018-04-15 16:27:47,437 [主要] 信息 org.apache.hadoop.hive.metastore.HiveMetaStore.audit - ugi=jay ip=unknown-ip-addr cmd=get_table : db=retail_db tbl=类别 2018-04-15 16:27:47,458 [主要] 信息 org.apache.hadoop.conf.Configuration.deprecation - yarn.resourcemanager.system-metrics-publisher.enabled 已弃用。相反，使用 yarn.system-metrics-publisher.enabled 2018-04-15 16:27:47,458 [主要] 信息 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - 此作业无法转换运行在进程中 2018-04-15 16:27:48,419 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - 添加 jar 文件：/usr/local/apache-hive-2.3.2-bin/lib/hive-metastore-2.3.2.jar 到分布式缓存通过 /tmp/temp-1113251818/tmp122824794/hive-metastore-2.3.2.jar 2018-04-15 16:27:48,608 [主要] 信息 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - 添加 jar 文件：/usr/local/apache-hive-2.3.2-bin/lib/libthrift-0.9.3.jar 到分布式缓存通过 /tmp/temp-1113251818/tmp1608619006/libthrift-0.9.3.jar 2018-04-15 16:27:49,708 [主要] 信息 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - 添加 jar 文件：/usr/local/apache-hive-2.3.2-bin/lib/hive-exec-2.3.2.jar 到分布式缓存通过 /tmp/temp-1113251818/tmp1023486409/hive-exec-2.3.2.jar 2018-04-15 16:27:50,352 [主要] 信息 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - 添加jar文件：/usr/local/apache-hive-2.3.2-bin/lib/libfb303-0.9.3.jar 分布式缓存通过 /tmp/temp-1113251818/tmp-207303388/libfb303-0.9.3.jar 2018-04-15 16:27:51,375 [主要] 信息 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - 添加 jar 文件：/usr/local/apache-hive-2.3.2-bin/lib/jdo-api-3.0.1.jar 到分布式缓存通过 /tmp/temp-1113251818/tmp120570913/jdo-api-3.0.1.jar 2018-04-15 16:27:51,497 [主要] 信息 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - 添加jar文件：/usr/local/apache-hive-2.3.2-bin/lib/slf4j-api-1.7.25.jar 分布式缓存通过 /tmp/temp-1113251818/tmp1251741235/slf4j-api-1.7.25.jar 2018-04-15 16:27:51,786 [主要] 信息 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - 添加jar文件：/usr/local/apache-hive-2.3.2-bin/lib/hive-hbase-handler-2.3.2.jar 通过分布式缓存 /tmp/temp-1113251818/tmp1351750668/hive-hbase-handler-2.3.2.jar 2018-04-15 16:27:52,653 [主要] 信息 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - 添加jar文件：/usr/local/pig-0.17.0/pig-0.17.0-core-h2.jar到DistributedCache through /tmp/temp-1113251818/tmp1548980484/pig-0.17.0-core-h2.jar 2018-04-15 16:27:53,042 [主要] 信息 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - 添加jar文件：/usr/local/apache-hive-2.3.2-bin/hcatalog/share/hcatalog/hive-hcatalog-pig-adapter-2.3.2.jar到分布式缓存通过 /tmp/temp-1113251818/tmp-2078279932/hive-hcatalog-pig-adapter-2.3.2.jar 2018-04-15 16:27:53,197 [主要] 信息 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - 添加jar文件：/usr/local/pig-0.17.0/lib/automaton-1.11-8.jar到DistributedCache through /tmp/temp-1113251818/tmp1231439146/automaton-1.11-8.jar 2018-04-15 16:27:53,875 [主要] 信息 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - 添加jar文件：/usr/local/apache-hive-2.3.2-bin/lib/antlr-runtime-3.5.2.jar 分布式缓存通过 /tmp/temp-1113251818/tmp970518288/antlr-runtime-3.5.2.jar 2018-04-15 16:27:53,900 [主要] 信息 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - 设置单一商店作业 2018-04-15 16:27:53,920 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 个 map-reduce 作业等待提交。 2018-04-15 16:27:53,922 [JobControl] INFO org.apache.hadoop.yarn.client.RMProxy - 连接在 /0.0.0.0:8032 2018-04-15 16:27:54,152 到 ResourceManager [JobControl] 信息 org.apache.hadoop.mapreduce.JobResourceUploader - 禁用路径的纠删码： /tmp/hadoop-yarn/staging/jay/.staging/job_1523787662857_0004 2018-04-15 16:27:54,197 [作业控制] 警告 org.apache.hadoop.mapreduce.JobResourceUploader - 没有设置作业 jar 文件。可能找不到用户类。请参阅 Job 或 Job#setJar(String)。 2018-04-15 16:27:54,232 [作业控制] 信息 org.apache.hadoop.mapred.FileInputFormat - 输入文件总数过程：1 2018-04-15 16:27:54,232 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - 总计处理的输入路径（组合）：1 2018-04-15 16:27:54,631 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - 编号拆分次数：1 2018-04-15 16:27:55,247 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - 提交作业令牌： job_1523787662857_0004 2018-04-15 16:27:55,247 [JobControl] 信息 org.apache.hadoop.mapreduce.JobSubmitter - 使用令牌执行：[] 2018-04-15 16:27:55,253 [作业控制] 信息 org.apache.hadoop.mapred.YARNRunner - 作业 jar 不存在。不是将任何 jar 添加到资源列表中。 2018-04-15 16:27:55,503 [作业控制] 信息 org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - 已提交应用程序 application_1523787662857_0004 2018-04-15 16:27:55,733 [JobControl] INFO org.apache.hadoop.mapreduce.Job - 要跟踪的 url 工作： http://jay-Lenovo-Z50-70:8088/proxy/application_1523787662857_0004/ 2018-04-15 16:27:55,733 [主要] 信息 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId：job_1523787662857_0004 2018-04-15 16:27:55,733 [主要] 信息 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 处理别名 pcategories 2018-04-15 16:27:55,733 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 详细位置：M: pcategories[3,14] C: R: 2018-04-15 16:27:55,877 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% 完成 2018-04-15 16:27:55,877 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 正在运行的作业是 [job_1523787662857_0004] 2018-04-15 16:28:27,422 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 哎呀！有些工作失败了！如果您希望 Pig 在失败时立即停止，请指定 -stop_on_failure。 2018-04-15 16:28:27,422 [主要] 信息 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 作业 job_1523787662857_0004 失败了！停止运行所有相关作业 2018-04-15 16:28:27,422 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% 完成 2018-04-15 16:28:27,424 [main] INFO org.apache.hadoop.yarn.client.RMProxy - 连接到 ResourceManager 在 /0.0.0.0:8032 2018-04-15 16:28:27,580 [主要] 信息 org.apache.hadoop.yarn.client.RMProxy - 连接到 ResourceManager 在 /0.0.0.0:8032 2018-04-15 16:28:27,827 [主要] 错误 org.apache.pig.tools.pigstats.mapreduce.MRPigStatsUtil - 1 地图减少作业失败！ 2018-04-15 16:28:27,827 [主要] 信息 org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - 脚本统计：

HadoopVersion PigVersion UserId StartedAt FinishedAt 功能 3.0.0 0.17.0 杰伊 2018-04-15 16:27:47 2018-04-15 16:28:27 未知

失败！

失败的作业：JobId 别名功能消息输出 job_1523787662857_0004 pcategories MAP_ONLY 消息：工作失败的！ hdfs://localhost:9000/tmp/temp-1113251818/tmp-83503168,

输入：无法从“retail_db.categories”读取数据

输出：未能产生结果 "hdfs://localhost:9000/tmp/temp-1113251818/tmp-83503168"

计数器：写入的总记录数：0 写入的总字节数：0 可溢出内存管理器溢出计数：0 主动溢出的包总数：0 总计主动泄露的记录：0

工作 DAG：job_1523787662857_0004

2018-04-15 16:28:27,828 [主要] 信息 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 失败的！ 2018-04-15 16:28:27,836 [main] 错误 org.apache.pig.tools.grunt.Grunt - 错误 1066：无法打开迭代器日志文件中的别名 pcategories 详细信息： /home/jay/pig_1523787729987.log

AM Container for appattempt_1523799060075_0001_000002 exited with exitCode: 1
Failing this attempt.Diagnostics: [2018-04-15 19:02:58.344]Exception from container-launch.
Container id: container_1523799060075_0001_02_000001
Exit code: 1
[2018-04-15 19:02:58.348]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
log4j:WARN No appenders could be found for logger (org.apache.hadoop.mapreduce.v2.app.MRAppMaster).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
[2018-04-15 19:02:58.348]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
log4j:WARN No appenders could be found for logger (org.apache.hadoop.mapreduce.v2.app.MRAppMaster).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
For more detailed output, check the application tracking page: http://jay-Lenovo-Z50-70:8088/cluster/app/application_1523799060075_0001 Then click on links to logs of each attempt.  this what get after clicking the link

【问题讨论】：

上面写着The url to track the job...你的实际错误输出应该存在于YARN中 【参考方案1】：

对我来说效果很好。我跑了下面的命令

$pig -useHCatalog
grunt> pcategories = LOAD 'hive_testing.address' USINGorg.apache.hive.hcatalog.pig.HCatLoader();
grunt>dump pcategories

这里我在我的数据库中创建了一个虚拟地址表

输出

(101,印度,xxx)

所以问题可能出在您的数据集上，而不是您正在运行的命令上。

【讨论】：

正确阅读问题和错误消息我知道我的命令是正确的，但我不知道为什么我的 DUMP 不工作您的问题是它在转储实际数据集时出错。这就是我所做的，我转储了实际的数据集，它工作正常。您可以尝试使用其他表并转储原始数据集我尝试了同样的问题，我认为这是因为我的配置

以上是关于转储数据集时将数据从 Hive 加载到 Pig 错误的主要内容，如果未能解决你的问题，请参考以下文章

如何使用 Pig 从 Cassandra 加载 CF/TABLE

通过 pig 脚本删除 hive 表分区

转储 Json 数据时出现 Apache Pig 错误

将 JSON 格式表加载到 Pig 中

脚本化加载文件与转储

将数据加载到 Hive 数组列