在 Pig 中使用 Hcatalog 加载配置单元表时出错
Posted
技术标签:
【中文标题】在 Pig 中使用 Hcatalog 加载配置单元表时出错【英文标题】:Error while loading hive table using Hcatalog in Pig 【发布时间】:2017-02-14 13:45:30 【问题描述】:我正在尝试在 pig 中使用 Hcatalog 加载我的配置单元表,因为我已经在下面编写了代码,但我遇到了错误。我正在使用pig -useHCatalog
打开我的猪壳
代码:
A = LOAD 'patient_info' USING org.apache.hive.hcatalog.pig.HCatLoader();
错误:
错误 hive.ql.metadata.Table - 无法从 serde 获取字段: com.ibm.spss.hive.serde2.xml.XmlSerDe java.lang.RuntimeException: MetaException(消息:java.lang.ClassNotFoundException 类 com.ibm.spss.hive.serde2.xml.XmlSerDe 未找到)在 org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:275) 在 org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:255) 在 org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:602) 在 org.apache.hive.hcatalog.common.HCatUtil.getTableSchemaWithPtnCols(HCatUtil.java:184) 在 org.apache.hive.hcatalog.pig.HCatLoader.getSchema(HCatLoader.java:216) 在 org.apache.pig.newplan.logical.relational.LOLoad.getSchemaFromMetaData(LOLoad.java:175) 在 org.apache.pig.newplan.logical.relational.LOLoad.(LOLoad.java:89) 在 org.apache.pig.parser.LogicalPlanBuilder.buildLoadOp(LogicalPlanBuilder.java:866) 在 org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3568) 在 org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1625) 在 org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:1102) 在 org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:560) 在 org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:421) 在 org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:188) 在 org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1688) 在 org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1635) 在 org.apache.pig.PigServer.registerQuery(PigServer.java:587) 在 org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1093) 在 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:501) 在 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198) 在 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173) 在 org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69) 在 org.apache.pig.Main.run(Main.java:547) 在 org.apache.pig.Main.main(Main.java:158) 在 sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 在 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 在 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 在 java.lang.reflect.Method.invoke(Method.java:606) 在 org.apache.hadoop.util.RunJar.run(RunJar.java:221) 在 org.apache.hadoop.util.RunJar.main(RunJar.java:136) 原因: MetaException(消息:java.lang.ClassNotFoundException 类 com.ibm.spss.hive.serde2.xml.XmlSerDe 未找到)在 org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:400)
更新:
我在 hive 中存储数据的命令如下。
add jar /home/cloudera/hivexmlserde-1.0.5.3.jar;
CREATE EXTERNAL TABLE patient_info (
statusCode string,
title string,
startTime string,
endTime string,
frequencyValue string,
frequencyUnits string
)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.statusCode"="medicationsInfo/entryInfo/statusCode/text()",
"column.xpath.title"="medications/code/code/text()",
"column.xpath.startTime"="medications/xxx/startTime/text()",
"column.xpath.endTime"="medications/xxx/endTime/text()",
"column.xpath.frequencyValue"="medications/xxx/frequencyValue/text()",
"column.xpath.frequencyUnits"="medications/xxx/frequencyUnits/text()",
)
STORED AS
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
TBLPROPERTIES (
"xmlinput.start"="<medicationsInfo",
"xmlinput.end"="</medicationsInfo>");
load data inpath '/user/cloudera/xml' into table patient_info ;
示例:
<Document>
<ProductCode>
<code>10160-0</code>
<entryInfo>
<statusCode>completed</statusCode>
<startTime>20110729</startTime>
<endTime>20110822</endTime>
<strengthValue>24</strengthValue>
<strengthUnits>h</strengthUnits>
</entryInfo>
<entryInfo>
<statusCode>completed</statusCode>
<startTime>20120130</startTime>
<endTime>20120326</endTime>
<strengthValue>12</strengthValue>
<strengthUnits>h</strengthUnits>
</entryInfo>
<entryInfo>
<statusCode>completed</statusCode>
<startTime>20100412</startTime>
<endTime>20110822</endTime>
<strengthValue>8</strengthValue>
<strengthUnits>d</strengthUnits>
</entryInfo>
</ProductCode>
<ProductCode>
<code>10160-0</code>
<entryInfo>
<statusCode>completed</statusCode>
<startTime>20110729</startTime>
<endTime>20110822</endTime>
<strengthValue>24</strengthValue>
<strengthUnits>h</strengthUnits>
</entryInfo>
<entryInfo>
<statusCode>completed</statusCode>
<startTime>20120130</startTime>
<endTime>20120326</endTime>
<strengthValue>12</strengthValue>
<strengthUnits>h</strengthUnits>
</entryInfo>
<entryInfo>
<statusCode>completed</statusCode>
<startTime>20100412</startTime>
<endTime>20110822</endTime>
<strengthValue>8</strengthValue>
<strengthUnits>d</strengthUnits>
</entryInfo>
</ProductCode>
<Medicationsinfo>
<code>10160-0</code>
<entryInfo>
<statusCode>completed</statusCode>
<startTime>20110729</startTime>
<endTime>20110822</endTime>
<strengthValue>24</strengthValue>
<strengthUnits>h</strengthUnits>
</entryInfo>
<entryInfo>
<statusCode>completed</statusCode>
<startTime>20120130</startTime>
<endTime>20120326</endTime>
<strengthValue>12</strengthValue>
<strengthUnits>h</strengthUnits>
</entryInfo>
<entryInfo>
<statusCode>completed</statusCode>
<startTime>20100412</startTime>
<endTime>20110822</endTime>
<strengthValue>8</strengthValue>
<strengthUnits>d</strengthUnits>
</entryInfo>
</Medicationsinfo>
<Medicationsinfo>
<code>10160-0</code>
<entryInfo>
<statusCode>completed</statusCode>
<startTime>20110729</startTime>
<endTime>20110822</endTime>
<strengthValue>24</strengthValue>
<strengthUnits>h</strengthUnits>
</entryInfo>
<entryInfo>
<statusCode>completed</statusCode>
<startTime>20120130</startTime>
<endTime>20120326</endTime>
<strengthValue>12</strengthValue>
<strengthUnits>h</strengthUnits>
</entryInfo>
<entryInfo>
<statusCode>completed</statusCode>
<startTime>20100412</startTime>
<endTime>20110822</endTime>
<strengthValue>8</strengthValue>
<strengthUnits>d</strengthUnits>
</entryInfo>
</Medicationsinfo>
</Document>
【问题讨论】:
它看起来像一个 hie 错误,尝试从 hive 中的表中选择一些行进行验证。 我尝试从 hive 获取数据并且我能够得到它。 您能否将表格定义添加到您的问题中? 似乎 XmlSerDe 不知道猪。愿意分享您存储数据的方式吗? @DuduMarkovitz 我使用 pig 来解析 hdfs 中的文件,然后使用 hive create 命令将解析后的数据存储在 hive 中。 【参考方案1】:您的外部表的定义无效。 以下是一些选项:
选项 1
create external table patient_info
(
code string
,entryInfo string
)
row format serde 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
with serdeproperties
(
"column.xpath.code" = "/Medicationsinfo/code/text()"
,"column.xpath.entryInfo" = "/Medicationsinfo/entryInfo"
)
stored as
inputformat 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
location '/user/hive/warehouse/patient_info'
tblproperties
(
"xmlinput.start" = "<Medicationsinfo"
,"xmlinput.end" = "</Medicationsinfo>"
)
;
select * from patient_info
;
+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| patient_info.code | patient_info.entryinfo |
+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 10160-0 | <string><entryInfo><statusCode>completed</statusCode><startTime>20110729</startTime><endTime>20110822</endTime><strengthValue>24</strengthValue><strengthUnits>h</strengthUnits></entryInfo><entryInfo><statusCode>completed</statusCode><startTime>20120130</startTime><endTime>20120326</endTime><strengthValue>12</strengthValue><strengthUnits>h</strengthUnits></entryInfo><entryInfo><statusCode>completed</statusCode><startTime>20100412</startTime><endTime>20110822</endTime><strengthValue>8</strengthValue><strengthUnits>d</strengthUnits></entryInfo></string> |
+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 10160-0 | <string><entryInfo><statusCode>completed</statusCode><startTime>20110729</startTime><endTime>20110822</endTime><strengthValue>24</strengthValue><strengthUnits>h</strengthUnits></entryInfo><entryInfo><statusCode>completed</statusCode><startTime>20120130</startTime><endTime>20120326</endTime><strengthValue>12</strengthValue><strengthUnits>h</strengthUnits></entryInfo><entryInfo><statusCode>completed</statusCode><startTime>20100412</startTime><endTime>20110822</endTime><strengthValue>8</strengthValue><strengthUnits>d</strengthUnits></entryInfo></string> |
+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
选项 2
create external table patient_info
(
code string
,entryInfo array<map<string,map<string,string>>>
)
row format serde 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
with serdeproperties
(
"column.xpath.code" = "/Medicationsinfo/code/text()"
,"column.xpath.entryInfo" = "/Medicationsinfo/entryInfo"
)
stored as
inputformat 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
location '/user/hive/warehouse/patient_info'
tblproperties
(
"xmlinput.start" = "<Medicationsinfo"
,"xmlinput.end" = "</Medicationsinfo>"
)
;
select * from patient_info
;
+-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| patient_info.code | patient_info.entryinfo |
+-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 10160-0 | ["entryInfo":"statusCode":"completed","startTime":"20110729","strengthUnits":"h","endTime":"20110822","strengthValue":"24","entryInfo":"statusCode":"completed","startTime":"20120130","strengthUnits":"h","endTime":"20120326","strengthValue":"12","entryInfo":"statusCode":"completed","startTime":"20100412","strengthUnits":"d","endTime":"20110822","strengthValue":"8"] |
| 10160-0 | ["entryInfo":"statusCode":"completed","startTime":"20110729","strengthUnits":"h","endTime":"20110822","strengthValue":"24","entryInfo":"statusCode":"completed","startTime":"20120130","strengthUnits":"h","endTime":"20120326","strengthValue":"12","entryInfo":"statusCode":"completed","startTime":"20100412","strengthUnits":"d","endTime":"20110822","strengthValue":"8"] |
+-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
选项 3
create external table patient_info
(
code string
,entryInfo array<map<string,struct<statusCode:string,startTime:string,endTime:string,strengthValue:int,strengthUnits:string>>>
)
row format serde 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
with serdeproperties
(
"column.xpath.code" = "/Medicationsinfo/code/text()"
,"column.xpath.entryInfo" = "/Medicationsinfo/entryInfo"
)
stored as
inputformat 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
location '/user/hive/warehouse/patient_info'
tblproperties
(
"xmlinput.start" = "<Medicationsinfo"
,"xmlinput.end" = "</Medicationsinfo>"
)
;
select * from patient_info
;
+-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| patient_info.code | patient_info.entryinfo |
+-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 10160-0 | ["entryInfo":"statuscode":"completed","starttime":"20110729","endtime":"20110822","strengthvalue":24,"strengthunits":"h","entryInfo":"statuscode":"completed","starttime":"20120130","endtime":"20120326","strengthvalue":12,"strengthunits":"h","entryInfo":"statuscode":"completed","starttime":"20100412","endtime":"20110822","strengthvalue":8,"strengthunits":"d"] |
| 10160-0 | ["entryInfo":"statuscode":"completed","starttime":"20110729","endtime":"20110822","strengthvalue":24,"strengthunits":"h","entryInfo":"statuscode":"completed","starttime":"20120130","endtime":"20120326","strengthvalue":12,"strengthunits":"h","entryInfo":"statuscode":"completed","starttime":"20100412","endtime":"20110822","strengthvalue":8,"strengthunits":"d"] |
+-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
展开选项 3
select pi.code
,ei.i + 1 as i
,ei.entryInfo["entryInfo"].statusCode
,ei.entryInfo["entryInfo"].startTime
,ei.entryInfo["entryInfo"].endTime
,ei.entryInfo["entryInfo"].strengthValue
,ei.entryInfo["entryInfo"].strengthUnits
from patient_info pi
lateral view posexplode (entryInfo) ei as i,entryInfo
;
+---------+---+------------+-----------+----------+---------------+---------------+
| pi.code | i | statuscode | starttime | endtime | strengthvalue | strengthunits |
+---------+---+------------+-----------+----------+---------------+---------------+
| 10160-0 | 1 | completed | 20110729 | 20110822 | 24 | h |
+---------+---+------------+-----------+----------+---------------+---------------+
| 10160-0 | 2 | completed | 20120130 | 20120326 | 12 | h |
+---------+---+------------+-----------+----------+---------------+---------------+
| 10160-0 | 3 | completed | 20100412 | 20110822 | 8 | d |
+---------+---+------------+-----------+----------+---------------+---------------+
| 10160-0 | 1 | completed | 20110729 | 20110822 | 24 | h |
+---------+---+------------+-----------+----------+---------------+---------------+
| 10160-0 | 2 | completed | 20120130 | 20120326 | 12 | h |
+---------+---+------------+-----------+----------+---------------+---------------+
| 10160-0 | 3 | completed | 20100412 | 20110822 | 8 | d |
+---------+---+------------+-----------+----------+---------------+---------------+
【讨论】:
我需要按列格式的结果,但在这些情况下,我会在一行中得到结果。我怎样才能选择我需要的特定强度值。如果你能帮我解决这个问题。 谢谢。我正在等待它。我接受你的回答,你对我帮助很大。 你能解释一下你为什么使用ei.i + 1
posexplode 以 0 开头,我认为看到 1,2,3,... 超过 0,1,2... 的序列会更易于阅读。
知道了,感谢您的快速回复。以上是关于在 Pig 中使用 Hcatalog 加载配置单元表时出错的主要内容,如果未能解决你的问题,请参考以下文章
Pig 未将数据加载到 HCatalog 表中 - HortonWorks Sandbox [关闭]