使用 Pig 将 csv 导入 HBase

Posted

技术标签:

【中文标题】使用 Pig 将 csv 导入 HBase【英文标题】:Importing csv into HBase using Pig 【发布时间】:2015-06-11 20:42:34 【问题描述】:

我想使用 Pig 将以下示例数据(制表符分隔)导入 HBase

1       2       3
4       5       6
7       8       9

并且我正在使用以下命令来实现相同的目的。

grunt> A = LOAD '/idn/home/mvenk9/Test' USING PigStorage('\t') as (id:int, id1:int, id2:int);

 STORE A INTO 'hbase://mydata' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('mycf:intdata');

在执行第二行时出现以下异常,我不知道为什么这不起作用,并且对所有这些工具都是新手..

2015-06-11 13:34:37,125 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: UNKNOWN
2015-06-11 13:34:37,126 [main] INFO  org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, DuplicateForEachColumnRewrite, GroupByConstParallelSetter, ImplicitSplitInserter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, NewPartitionFilterOptimizer, PartitionFilterOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter], RULES_DISABLED=[FilterLogicExpressionSimplifier]
2015-06-11 13:34:37,442 [main] INFO  org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper - The identifier of this process is 29965@lppbd0030.gso.aexp.com
2015-06-11 13:34:37,554 [main] INFO  org.apache.hadoop.hbase.mapreduce.TableOutputFormat - Created table instance for mydata
2015-06-11 13:34:37,557 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2015-06-11 13:34:37,559 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
2015-06-11 13:34:37,559 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
2015-06-11 13:34:37,561 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job
2015-06-11 13:34:37,562 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2015-06-11 13:34:37,563 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - creating jar file Job2235913801538823778.jar
2015-06-11 13:34:40,868 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - jar file Job2235913801538823778.jar created
2015-06-11 13:34:40,882 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
2015-06-11 13:34:40,885 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2017: Internal error creating job configuration.
Details at logfile: /idn/home/mvenk9/pig_1434054848332.log

从日志文件中

Pig Stack Trace
---------------
ERROR 2017: Internal error creating job configuration.

org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unable to store alias A
        at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1635)
        at org.apache.pig.PigServer.registerQuery(PigServer.java:575)
        at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1093)
        at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:501)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173)
        at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
        at org.apache.pig.Main.run(Main.java:541)
        at org.apache.pig.Main.main(Main.java:156)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:197)
Caused by: org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobCreationException: ERROR 2017: Internal error creating job configuration.
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:861)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.compile(JobControlCompiler.java:296)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:192)
        at org.apache.pig.PigServer.launchPlan(PigServer.java:1322)
        at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1307)
        at org.apache.pig.PigServer.execute(PigServer.java:1297)
        at org.apache.pig.PigServer.access$400(PigServer.java:122)
        at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1630)
        ... 13 more
Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: hbase://mydata_logs
        at org.apache.hadoop.fs.Path.initialize(Path.java:155)
        at org.apache.hadoop.fs.Path.<init>(Path.java:74)
        at org.apache.hadoop.fs.Path.<init>(Path.java:48)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:613)
        ... 20 more
Caused by: java.net.URISyntaxException: Relative path in absolute URI: hbase://mydata_logs
        at java.net.URI.checkPath(URI.java:1804)
        at java.net.URI.<init>(URI.java:752)
        at org.apache.hadoop.fs.Path.initialize(Path.java:152)
        ... 23 more
================================================================================

任何帮助将不胜感激。

提前谢谢你。

【问题讨论】:

【参考方案1】:

将 hbase 中的各个列名称作为参数添加到 HBaseStorage。你只给了一个单元格 mycf:intdata。查看here 和here 的示例

【讨论】:

谢谢你,我也参考了这些链接并尝试了所有列然后也得到了同样的例外。 更新语句是 STORE A INTO 'hbase://mydata' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('mycf:id mycf:id1 mycf:id2'); 您是在本地还是集群中运行。并添加有关您的跑步方式的详细信息 嗨,我在集群上运行这些,在使用 putty 登录到我的集群后,我将使用以下命令启动 pig,一旦进入 pig I将执行上述语句。在第一条语句之后,我可以使用转储 A 在 A 中看到结果。但是当我执行第二条语句时,我得到了错误。

以上是关于使用 Pig 将 csv 导入 HBase的主要内容,如果未能解决你的问题,请参考以下文章

使用 Pig 脚本将 Json 导入 Hbase

将数据从 hdfs 导入到 hbase 是不是会创建一个副本

使用 Apache Pig 将数据加载到 Hbase 表时,如何排除 csv 或文本文件中没有数据(只有空格)的列?

PIG UDF 错误 - 可以使用导入解决

从 HBase 中的 HDFS 导入表

importTSV工具导入数据到hbase