如何使用 pig 脚本解析 xml 元素节点?
Posted
技术标签:
【中文标题】如何使用 pig 脚本解析 xml 元素节点?【英文标题】:how to parse xml element node susing pig script? 【发布时间】:2014-12-28 21:27:40 【问题描述】:我正在使用 pig latin 进行大型 XML 转储。我正在尝试获取 xml 节点的值,例如猪拉丁语中的 location 和 temp_c 。文件是这样的
<?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet href="latest_ob.xsl" type="text/xsl"?>
<current_observation version="1.0"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="http://www.weather.gov/view/current_observation.xsd">
<credit>NOAA's National Weather Service</credit>
<credit_URL>http://weather.gov/</credit_URL>
<image>
<url>http://weather.gov/images/xml_logo.gif</url>
<title>NOAA's National Weather Service</title>
<link>http://weather.gov</link>
</image>
<suggested_pickup>15 minutes after the hour</suggested_pickup>
<suggested_pickup_period>60</suggested_pickup_period>
<location>Unknown Station</location>
<station_id>51WH0</station_id>
<observation_time>Last Updated on Dec 23 2014, 11:00 pm LST</observation_time>
<observation_time_rfc822>Tue, 23 Dec 2014 23:00:00 +1000</observation_time_rfc822>
<temperature_string>71.4 F (21.9 C)</temperature_string>
<temp_f>71.4</temp_f>
<temp_c>21.9</temp_c>
<water_temp_f>75.9</water_temp_f>
<water_temp_c>24.4</water_temp_c>
<wind_string>North at 24.6 MPH (21.38 KT)</wind_string>
<wind_dir>North</wind_dir>
<wind_degrees>20</wind_degrees>
<wind_mph>24.6</wind_mph>
<wind_gust_mph>0.0</wind_gust_mph>
<wind_kt>21.38</wind_kt>
<pressure_string>1015.0 mb</pressure_string>
<pressure_mb>1015.0</pressure_mb>
<dewpoint_string>58.1 F (14.5 C)</dewpoint_string>
<dewpoint_f>58.1</dewpoint_f>
<dewpoint_c>14.5</dewpoint_c>
</current_observation>
【问题讨论】:
【参考方案1】:也许对你有帮助,试试这个。
REGISTER piggybank.jar
DEFINE XPath org.apache.pig.piggybank.evaluation.xml.XPath();
A = LOAD 'xmls/your_file.xml' using org.apache.pig.piggybank.storage.XMLLoader('current_observation') as (x:chararray);
B = FOREACH A GENERATE XPath(x, 'current_observation/location'), XPath(x, 'current_observation/temp_c');
dump B;
【讨论】:
嗨 ravi 我试过了,但是根元素之后有一些属性,因为它无法转储结果。并且每个 xml 文件都包含相同的格式。使用这个:
data = LOAD '/path/your_file.xml'
USING org.apache.pig.piggybank.storage.StreamingXMLLoader(
'current_observation',
'credit, credit_URL, image, suggested_pickup, suggested_pickup_period, location, station_id, observation_time,temp_f, temp_c, water_temp_f, water_temp_c, wind_string, wind_dir, wind_degrees, wind_mph, wind_gust_mph, wind_kt, pressure_string, pressure_mb, dewpoint_string, dewpoint_f, dewpoint_c'
) AS (
credit: (attr:map[], content:chararray)
credit_URL: (attr:map[], content:chararray)
.
.
.
);
dump data;
【讨论】:
出现这些错误:原因:org.apache.pig.backend.executionengine.ExecException:错误 1070:无法使用导入解析 org.apache.pig.piggybank.storage.StreamingXMLLoader:[,java .lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.] 在 org.apache.pig.impl.PigContext.resolveClassName(PigContext.java:682) 在 org.apache.pig。 parser.LogicalPlanBuilder.validateFuncSpec(LogicalPlanBuilder.java:1320) ... 26 更多 2014-12-29 12:26:40,739 [main] 错误 org.apache.pig.tools.grunt.Grunt - 错误 1070:无法使用导入解析 org.apache.pig.piggybank.storage.StreamingXMLLoader:[, java.lang., org.apache .pig.builtin., org.apache.pig.impl.builtin.] 日志文件中的详细信息:/home/hduser/Desktop/pig_1419836199018.log 你现在用的是哪个版本的猪。 如果你想使用StreamingXMLLoader猪版本应该是0.13或以上。 嗨 ravi 我正在使用 pig-0.14.0以上是关于如何使用 pig 脚本解析 xml 元素节点?的主要内容,如果未能解决你的问题,请参考以下文章