使用 PIG Latin 加载 XML
Posted
技术标签:
【中文标题】使用 PIG Latin 加载 XML【英文标题】:Load XML using PIG Latin 【发布时间】:2014-04-23 09:33:00 【问题描述】:我有一个多级 xml,但我找不到任何如何加载它的示例。
XML 文件:
<?xml version="1.0" encoding="UTF-8" ?>
<Feed xmlns="http://www.xx.com/PRR/ProductFeed/1.0"
name="xx"
incremental="false"
extractDate="2014-04-22T11:00:00.000000"><Categories><Category> <ExternalId>2_5</ExternalId><ParentExternalId></ParentExternalId><Name><![CDATA[Baby]]></Name><CategoryPageUrl>http://www.xx.com/en-US/Clearance/Baby-0-3yrs-Clothing.html</CategoryPageUrl></Category><Category><ExternalId>2_3</ExternalId><ParentExternalId></ParentExternalId><Name><![CDATA[Boys 1½-12yrs]]></Name><CategoryPageUrl>http://www.xx.com/en-US/Clearance/Boys-1H-12yrs-Clothing.html</CategoryPageUrl></Category></Categories>
<Products><Product><ExternalId>78094</ExternalId><Name><![CDATA[Sleep Bag]]></Name><Description><![CDATA[A cover they can't throw off in the night. Pure cotton with one of our uniquely lovely prints. In its own gift box. An ultra thoughtful, luxurious present.]]></Description><Brand>xx</Brand><CategoryExternalId>1_5_1</CategoryExternalId><ProductPageUrl>http://www.xx.com/en-US/Baby-0-3yrs-Accessories/78094/Baby-0-3yrs-Sleep-Bag.html</ProductPageUrl><ImageUrl>http://www.xx.com/productimages/productThumb160x207/14USPR_78094_MUL.jpg</ImageUrl><SwatchImageUrl>http://www.xx.com/productimages/grsw/14USPR_78094_MUL_s.jpg</SwatchImageUrl><Price>54.0000</Price><Wasprice>54.0000</Wasprice><ManufacturerPartNumber></ManufacturerPartNumber><EAN></EAN><Colours><Variation><Tier2>MUL</Tier2><Tier2Descr><![CDATA[Multi Elephant Party]]></Tier2Descr><Tier2Url>http://www.xx.com/en-US/Baby-0-3yrs-Accessories/78094-MUL/Baby-0-3yrs-Multi-Elephant-Party-Sleep-Bag.html</Tier2Url><Tier2ImageUrl>http://www.xx.com/productimages/productThumb160x207/14USPR_78094_MUL.jpg</Tier2ImageUrl><Tier3>03 06</Tier3><Tier3Descr><![CDATA[3-6m]]></Tier3Descr><StockStatus>-2</StockStatus><SwatchUrl>http://www.xx.com/productimages/grsw/14USPR_78094_MUL_s.jpg</SwatchUrl></Variation><Variation><Tier2>MUL</Tier2><Tier2Descr><![CDATA[Multi Elephant Party]]></Tier2Descr><Tier2Url>http://www.xx.com/en-US/Baby-0-3yrs-Accessories/78094-MUL/Baby-0-3yrs-Multi-Elephant-Party-Sleep-Bag.html</Tier2Url><Tier2ImageUrl>http://www.xx.com/productimages/productThumb160x207/14USPR_78094_MUL.jpg</Tier2ImageUrl><Tier3>06 18</Tier3><Tier3Descr><![CDATA[6-18m]]></Tier3Descr> <StockStatus>-2</StockStatus> <SwatchUrl>http://www.xx.com/productimages/grsw/14USPR_78094_MUL_s.jpg</SwatchUrl> </Variation></Colours></Product>
</Products>
</Feed>
我试过这样,但它返回空行,我还需要产品,不仅是类别
REGISTER 'lib/pig/piggybank.jar'
-- load raw
raw = load '$Input' using org.apache.pig.piggybank.storage.XMLLoader('Category')
as (x:chararray);
raw_flatten = foreach raw GENERATE FLATTEN(REGEX_EXTRACT_ALL(x,
'<Category>\\n\\s*<ExternalId>(.*)</ExternalId>\\n\\s*<ParentExternalId>(.*)</ParentExternalId>\\n\\s*<Name>(.*)</Name>\\n\\s*<CategoryPageUrl>(.*)</CategoryPageUrl>\\n\\s*</Category>'))
as (external_id:chararray, parent_external_id:chararray, name:chararray, categorypageurl:chararray);
如何加载上面的xml?
提前致谢
更新:如果我在每个字段后放置一个换行符,那么我可以读取数据...我该如何解决这个问题?其他工具不需要换行符,我无法更改源数据。
格式化的xml:
<?xml version="1.0" encoding="UTF-8" ?>
<Feed xmlns="http://www.xx.com/PRR/ProductFeed/1.0"
name="xx"
incremental="false"
extractDate="2014-04-22T11:00:00.000000">
<Categories>
<Category>
<ExternalId>2_5</ExternalId>
<ParentExternalId></ParentExternalId>
<Name>Baby</Name>
<CategoryPageUrl>http://www.xx.com/en-US/Clearance/Baby-0-3yrs-Clothing.html</CategoryPageUrl>
</Category>
<Category>
<ExternalId>2_3</ExternalId>
<ParentExternalId></ParentExternalId>
<Name>Boys 1½-12yrs</Name>
<CategoryPageUrl>http://www.xx.com/en-US/Clearance/Boys-1H-12yrs-Clothing.html</CategoryPageUrl>
</Category>
</Categories>
<Products>
<Product>
<ExternalId>78094</ExternalId>
<Name>Sleep Bag</Name>
<Description>A cover they can't throw off in the night. Pure cotton with one of our uniquely lovely prints. In its own gift box. An ultra thoughtful, luxurious present.</Description>
<Brand>xx</Brand>
<CategoryExternalId>1_5_1</CategoryExternalId>
<ProductPageUrl>http://www.xx.com/en-US/Baby-0-3yrs-Accessories/78094/Baby-0-3yrs-Sleep-Bag.html</ProductPageUrl>
<ImageUrl>http://www.xx.com/productimages/productThumb160x207/14USPR_78094_MUL.jpg</ImageUrl>
<SwatchImageUrl>http://www.xx.com/productimages/grsw/14USPR_78094_MUL_s.jpg</SwatchImageUrl>
<Price>54.0000</Price>
<Wasprice>54.0000</Wasprice>
<ManufacturerPartNumber></ManufacturerPartNumber>
<EAN></EAN>
<Colours>
<Variation>
<Tier2>MUL</Tier2>
<Tier2Descr>Multi Elephant Party</Tier2Descr>
<Tier2Url>http://www.xx.com/en-US/Baby-0-3yrs-Accessories/78094-MUL/Baby-0-3yrs-Multi-Elephant-Party-Sleep-Bag.html</Tier2Url>
<Tier2ImageUrl>http://www.xx.com/productimages/productThumb160x207/14USPR_78094_MUL.jpg</Tier2ImageUrl>
<Tier3>03 06</Tier3>
<Tier3Descr>3-6m</Tier3Descr>
<StockStatus>-2</StockStatus>
<SwatchUrl>http://www.xx.com/productimages/grsw/14USPR_78094_MUL_s.jpg</SwatchUrl>
</Variation>
<Variation>
<Tier2>MUL</Tier2>
<Tier2Descr>Multi Elephant Party</Tier2Descr>
<Tier2Url>http://www.xx.com/en-US/Baby-0-3yrs-Accessories/78094-MUL/Baby-0-3yrs-Multi-Elephant-Party-Sleep-Bag.html</Tier2Url>
<Tier2ImageUrl>http://www.xx.com/productimages/productThumb160x207/14USPR_78094_MUL.jpg</Tier2ImageUrl>
<Tier3>06 18</Tier3>
<Tier3Descr>6-18m</Tier3Descr>
<StockStatus>-2</StockStatus>
<SwatchUrl>http://www.xx.com/productimages/grsw/14USPR_78094_MUL_s.jpg</SwatchUrl>
</Variation>
</Colours>
</Product>
</Products>
</Feed>
【问题讨论】:
我能够格式化 xml,现在可以读取类别,但无法读取产品,因为其中有嵌入的变体。我如何加载这个 xml? 【参考方案1】:您的正则表达式字符串似乎需要换行符:
\\n\\s*
把它改成 [\n\s]* 就可以了
【讨论】:
以上是关于使用 PIG Latin 加载 XML的主要内容,如果未能解决你的问题,请参考以下文章
仅使用 Pig Latin 在 Pig 中加载具有不同分隔符的非结构化数据
无法从 Pig Latin 的 Hadoop HDFS 加载文件