使用 Pig 脚本将 Json 导入 Hbase

Posted

技术标签:

【中文标题】使用 Pig 脚本将 Json 导入 Hbase【英文标题】:Import Json to Hbase using Pig script 【发布时间】:2015-08-24 11:35:56 【问题描述】:

我正在尝试编写一个允许我加载 Json 的猪脚本(取自弹性搜索并转储到 hdfs)。

我已经为此苦苦挣扎了好几天,也许有人可以给我一些关于我遇到的问题的见解。

这是我编写的一个快速猪脚本,用于从 hbase 读取任意修改数据,然后存储回 hbase(只是为了确保一切正常)

REGISTER hbase-common-1.1.1.jar
REGISTER /tmp/udfs/json-simple-1.1.1.jar
REGISTER /tmp/udfs/elephant-bird-hadoop-compat-4.9.jar
REGISTER /tmp/udfs/elephant-bird-pig-4.9.jar
REGISTER /user/hdfs/share/libs/guava-11.0.jar
REGISTER /user/hdfs/share/libs/zookeeper-3.4.6.2.2.4.2-2.jar

set hbase.zookeeper.quorum 'list of servers';    

raw = LOAD 'hbase://esimporttest' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('esinfo:a', '-loadKey true -limit 5') AS (id:bytearray, a:chararray);
keys = FOREACH raw GENERATE id, CONCAT(a, '1');

keys = LIMIT keys 1;

STORE keys INTO 'hbase://esimporttest' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('esinfo:id esinfo:a');    

运行这个脚本的结果是数据从hbase中读取并存储回hbase,并且运行良好。

然后我尝试修改脚本以现在从 json 文件而不是从 Hbase 加载数据。

REGISTER hbase-common-1.1.1.jar
REGISTER /tmp/udfs/json-simple-1.1.1.jar
REGISTER /tmp/udfs/elephant-bird-hadoop-compat-4.9.jar
REGISTER /tmp/udfs/elephant-bird-pig-4.9.jar
REGISTER /user/hdfs/share/libs/guava-11.0.jar
REGISTER /user/hdfs/share/libs/zookeeper-3.4.6.2.2.4.2-2.jar

set hbase.zookeeper.quorum 'list of servers';

raw_data = LOAD '/user/hdfs/input/EsImports/2014-04-22.json' using com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') as (json:map[]); 
keys = FOREACH raw_data GENERATE
    json#'sid' as id:bytearray,
    json#'userAgent' as a:chararray;

limit_keys = LIMIT keys 1;

STORE limit_keys INTO 'hbase://esimporttest' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('esinfo:id esinfo:a');

这是失败的脚本,我感觉它与正在加载的数据的架构有关,但是当我对数据执行描述和转储时,它似乎都具有完全相同的结构

此外,脚本失败时我得到的错误如下

错误 2244:作业 job_1439978375936_0215 失败,hadoop 不返回 任何错误信息

完整的错误日志

Log Type: syslog
Log Upload Time: Mon Aug 24 13:28:43 +0200 2015
Log Length: 4121
2015-08-24 13:28:35,504 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Created MRAppMaster for application appattempt_1439978375936_0238_000001
2015-08-24 13:28:35,910 WARN [main] org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2015-08-24 13:28:35,921 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Executing with tokens:
2015-08-24 13:28:35,921 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Kind: YARN_AM_RM_TOKEN, Service: , Ident: (appAttemptId  application_id  id: 238 cluster_timestamp: 1439978375936  attemptId: 1  keyId: 176959833)
2015-08-24 13:28:36,056 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Kind: mapreduce.job, Service: job_1439978375936_0236, Ident: (org.apache.hadoop.mapreduce.security.token.JobTokenIdentifier@331fef77)
2015-08-24 13:28:36,057 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Kind: RM_DELEGATION_TOKEN, Service: ip removed, Ident: (owner=darryn, renewer=mr token, realUser=hcat, issueDate=1440415651774, maxDate=1441020451774, sequenceNumber=176, masterKeyId=149)
2015-08-24 13:28:36,070 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Using mapred newApiCommitter.
2015-08-24 13:28:36,699 WARN [main] org.apache.hadoop.hdfs.shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
2015-08-24 13:28:36,804 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: OutputCommitter set in config null
2015-08-24 13:28:36,950 FATAL [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Error starting MRAppMaster
java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/mapreduce/TableInputFormat
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:270)
    at org.apache.pig.impl.PigContext.resolveClassName(PigContext.java:657)
    at org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:726)
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore.getStoreFunc(POStore.java:251)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter.getCommitters(PigOutputCommitter.java:88)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter.<init>(PigOutputCommitter.java:71)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.getOutputCommitter(PigOutputFormat.java:289)
    at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$1.call(MRAppMaster.java:470)
    at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$1.call(MRAppMaster.java:452)
    at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.callWithJobClassLoader(MRAppMaster.java:1541)
    at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.createOutputCommitter(MRAppMaster.java:452)
    at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceInit(MRAppMaster.java:371)
    at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
    at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$4.run(MRAppMaster.java:1499)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
    at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1496)
    at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1429)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.mapreduce.TableInputFormat
    at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
    ... 20 more
2015-08-24 13:28:36,954 INFO [main] org.apache.hadoop.util.ExitUtil: Exiting with status 1

编辑:

所以我注意到了一些有趣的行为,如果我使用 PigStorage 将存储的数据保存到别名中并指定 -schema 选项,然后在单独的脚本中加载该文件备份(仍然使用 PigStorage),然后我可以直接插入 HBase ,这让我怀疑它与模式的存储方式有关

【问题讨论】:

【参考方案1】:

所以我最终使用的解决方案绝不是最佳的,但效果很好。

在从 json 文件中读取数据并生成架构后,您要做的是使用 pig 存储将其保存回文件,然后再读回该文件。

fs -rm -r /tmp/estest2
Store test into '/tmp/estest2' USING PigStorage('\t', '-schema');

processed_data = LOAD '/tmp/estest2' USING PigStorage('\t'); 

EXEC; //Used to sync the script and allow it to finish up until this point

我怀疑正在发生的事情是大象鸟 JsonLoader 使用的类型被 HbaseStorage 错误解释,但它确实理解 PigStorage 类型,因此允许它将数据加载到 hbase 中。

在执行此操作时,我还发现了其他几件事。您需要在数据别名中添加一个“id”字段,但不得在传递给 hbase 的参数列表中指定此字段。

使用此解决方案的简化工作脚本如下所示

REGISTER hbase-common-1.1.1.jar
REGISTER /tmp/udfs/json-simple-1.1.1.jar
REGISTER /tmp/udfs/elephant-bird-hadoop-compat-4.9.jar
REGISTER /tmp/udfs/elephant-bird-pig-4.9.jar
REGISTER /user/hdfs/share/libs/guava-11.0.jar
REGISTER /user/hdfs/share/libs/zookeeper-3.4.6.2.2.4.2-2.jar

set hbase.zookeeper.quorum 'list of servers';

raw_data = LOAD '/user/hdfs/input/EsImports/2014-04-22.json' using com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') as (json:map[]); 
keys = FOREACH raw_data GENERATE
    json#'sid' as id:bytearray, //ID field will not be included in HBaseStorage function call but will be used
    json#'userAgent' as a:chararray;

limit_keys = LIMIT keys 1;

//This is super hacky but works
fs -rm -r /tmp/estest2 //fails if the directory does not exist
Store limit_keys into '/tmp/estest2' USING PigStorage('\t', '-schema');
processed_data = LOAD '/tmp/estest2' USING PigStorage('\t'); 

EXEC; //Used to sync the script and allow it to finish up until this point before starting to insert to hbase

STORE processed_data INTO 'hbase://esimporttest' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('esinfo:a');

【讨论】:

以上是关于使用 Pig 脚本将 Json 导入 Hbase的主要内容,如果未能解决你的问题,请参考以下文章

在 PIg 脚本中对 Avro 文件使用 UDF

将数据从 hdfs 导入到 hbase 是不是会创建一个副本

运行 Apache Pig 脚本时如何查找 jar 依赖项?

PIG - HBASE - 铸造值

在 EMR 中,与 HBase 集成的 Pig 脚本在尝试加载数据时失败

pig 将Hbase中表导出为CSV出现错误 高分!!! 求解!!!