使用 Pig 在 XPath 中进行嵌套解析
Posted
技术标签:
【中文标题】使用 Pig 在 XPath 中进行嵌套解析【英文标题】:Nested Parsing in XPath using Pig 【发布时间】:2017-01-17 14:40:07 【问题描述】:我正在尝试使用 pig 解析带有嵌套标签的 xml 文件。我有以下 xml 示例。
<Document>
<medicationsInfo>
<code>10160-0</code>
<entryInfo>
<statusCode>completed</statusCode>
<startTime>20110729</startTime>
<endTime>20110822</endTime>
<strengthValue>24</strengthValue>
<strengthUnits>h</strengthUnits>
</entryInfo>
<entryInfo>
<statusCode>completed</statusCode>
<startTime>20120130</startTime>
<endTime>20120326</endTime>
<strengthValue>12</strengthValue>
<strengthUnits>h</strengthUnits>
</entryInfo>
<entryInfo>
<statusCode>completed</statusCode>
<startTime>20100412</startTime>
<endTime>20110822</endTime>
<strengthValue>8</strengthValue>
<strengthUnits>d</strengthUnits>
</entryInfo>
</medicationsInfo>
<ProductInfo>
<code>10160-0</code>
<entryInfo>
<statusCode>completed</statusCode>
<startTime>20110729</startTime>
<endTime>20110822</endTime>
<strengthValue>24</strengthValue>
<strengthUnits>h</strengthUnits>
</entryInfo>
<entryInfo>
<statusCode>completed</statusCode>
<startTime>20120130</startTime>
<endTime>20120326</endTime>
<strengthValue>12</strengthValue>
<strengthUnits>h</strengthUnits>
</entryInfo>
<entryInfo>
<statusCode>completed</statusCode>
<startTime>20100412</startTime>
<endTime>20110822</endTime>
<strengthValue>8</strengthValue>
<strengthUnits>d</strengthUnits>
</entryInfo>
</ProductInfo>
</Document>
我正在编写以下代码以获取药物信息的条目信息结果,但我收到错误。
代码:
Register /home/cloudera/piggybank-0.16.0.jar;
DEFINE XPathAll org.apache.pig.piggybank.evaluation.xml.XPathAll();
DEFINE XPath org.apache.pig.piggybank.evaluation.xml.XPath();
A = LOAD '/home/cloudera/Parsed_CCD.xml' using org.apache.pig.piggybank.storage.XMLLoader('medicationsInfo/entryInfo') as (x:chararray);
B = FOREACH A GENERATE XPathAll(x, 'statusCode',false,true), XPathAll(x, 'medicationsInfo/code/code',false,true).$0, XPathAll(x,'strengthValue',false,true).$1;
DUMP B;
错误:
[主要] 信息 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 失败的! [main] 错误 org.apache.pig.tools.grunt.Grunt - 错误 1066:无法打开别名 B 的迭代器
预期输出:
completed 20110729 20110822 24 h
completed 20120130 20120326 12 h
completed 20100412 20110822 8 d
【问题讨论】:
【参考方案1】:以下代码将产生预期的输出:
Register /home/cloudera/piggybank-0.16.0.jar;
DEFINE XPathAll org.apache.pig.piggybank.evaluation.xml.XPathAll();
--DEFINE XPath org.apache.pig.piggybank.evaluation.xml.XPath();
A = LOAD 'home/cloudera/Parsed_CCD.xml'
using org.apache.pig.piggybank.storage.XMLLoader('medicationsInfo') as (x:chararray);
B = FOREACH A GENERATE
XPathAll(x, 'medicationsInfo/entryInfo/statusCode').$0,
XPathAll(x, 'medicationsInfo/entryInfo/startTime').$0,
XPathAll(x, 'medicationsInfo/entryInfo/endTime').$0,
XPathAll(x, 'medicationsInfo/entryInfo/strengthValue').$0,
XPathAll(x, 'medicationsInfo/entryInfo/strengthUnits').$0;
C = FOREACH A GENERATE
XPathAll(x, 'medicationsInfo/entryInfo/statusCode').$1,
XPathAll(x, 'medicationsInfo/entryInfo/startTime').$1,
XPathAll(x, 'medicationsInfo/entryInfo/endTime').$1,
XPathAll(x, 'medicationsInfo/entryInfo/strengthValue').$1,
XPathAll(x, 'medicationsInfo/entryInfo/strengthUnits').$1;
D = FOREACH A GENERATE
XPathAll(x, 'medicationsInfo/entryInfo/statusCode').$2,
XPathAll(x, 'medicationsInfo/entryInfo/startTime').$2,
XPathAll(x, 'medicationsInfo/entryInfo/endTime').$2,
XPathAll(x, 'medicationsInfo/entryInfo/strengthValue').$2,
XPathAll(x, 'medicationsInfo/entryInfo/strengthUnits').$2;
BCD = UNION B,C,D;
DUMP BCD;
【讨论】:
如果我不知道该部分重复的次数怎么办。 我的实际文件有多个entryInfo
里面的drugsInfo 我如何在那里实现这个逻辑?
那么你必须编写python或java udf来解析你的xml数据然后喂猪
我认为这个问题很常见,你能建议我最好的方法吗?
我正在使用A = LOAD '/home/cloudera/Parsed_CCD.xml' using org.apache.pig.piggybank.storage.XMLLoader('medicationsInfo/entryInfo') as (x:chararray)
,我可以先从加载的drugsInfo 中加载medicationsInfo
,然后再加载entryinfo
吗?以上是关于使用 Pig 在 XPath 中进行嵌套解析的主要内容,如果未能解决你的问题,请参考以下文章