Impala 可以查询存储在 Hadoop/HDFS 中的 XML 文件吗

Posted 2023-03-22

技术标签:

【中文标题】Impala 可以查询存储在 Hadoop/HDFS 中的 XML 文件吗【英文标题】：Can Impala query XML files stored in Hadoop/HDFS 【发布时间】：2014-03-24 16:44:49 【问题描述】：

我正在研究 Hadoop/Impala 组合是否能满足我的归档、批处理和实时即席查询要求。

我们会将 XML 文件（格式良好并符合我们自己的 XSD 架构）持久化到 Hadoop 中，并使用 MapReduce 处理日终批处理查询等。对于需要低延迟和相对高延迟的临时用户查询和应用程序查询我们正在考虑 Impala 的性能。

我想不通的是 Impala 如何理解 XML 文件的结构以便它可以有效地查询。 Impala 能否用于以有意义的方式跨 XML 文档进行查询？

提前致谢。

【问题讨论】：

【参考方案1】：

Hive 和 Impala 并没有真正的机制来处理 XML 文件（这很奇怪，考虑到大多数数据库都支持 XML）。

话虽如此，如果我遇到这个问题，我会使用 Pig 将数据导入 HCatalog。那时，Hive 和 Impala 完全可以使用它。

这是一个使用 Pig 将一些 XML 数据导入 HCatalog 的快速而肮脏的示例：

--rss.pig

REGISTER piggybank.jar

items = LOAD 'rss.txt' USING org.apache.pig.piggybank.storage.XMLLoader('item') AS  (item:chararray);

data = FOREACH items GENERATE REGEX_EXTRACT(item, '<link>(.*)</link>', 1) AS  link:chararray, 
REGEX_EXTRACT(item, '<title>(.*)</title>', 1) AS  title:chararray,
REGEX_EXTRACT(item, '<description>(.*)</description>',  1) AS description:chararray,
REGEX_EXTRACT(item, '<pubDate>.*(\\d2\\s[a-zA-Z]3\\s\\d4\\s\\d2:\\d2:\\d2).*</pubDate>', 1) AS  pubdate:chararray;

STORE data into 'rss_items' USING org.apache.hcatalog.pig.HCatStorer();


validate = LOAD 'default.rss_items' USING org.apache.hcatalog.pig.HCatLoader();
dump validate;

--结果

(http://www.hannonhill.com/news/item1.html,News Item 1,Description of news item 1 here.,03 Jun 2003 09:39:21)
(http://www.hannonhill.com/news/item2.html,News Item 2,Description of news item 2 here.,30 May 2003 11:06:42)
(http://www.hannonhill.com/news/item3.html,News Item 3,Description of news item 3 here.,20 May 2003 08:56:02)

--Impala 查询

select * from rss_items

--Impala 结果

    link    title   description pubdate
0   http://www.hannonhill.com/news/item1.html   News Item 1 Description of news item 1 here.    03 Jun 2003 09:39:21
1   http://www.hannonhill.com/news/item2.html   News Item 2 Description of news item 2 here.    30 May 2003 11:06:42
2   http://www.hannonhill.com/news/item3.html   News Item 3 Description of news item 3 here.    20 May 2003 08:56:02

--rss.txt数据文件

<rss version="2.0">
   <channel>
      <title>News</title>
      <link>http://www.hannonhill.com</link>
      <description>Hannon Hill News</description>
      <language>en-us</language>
      <pubDate>Tue, 10 Jun 2003 04:00:00 GMT</pubDate>
      <generator>Cascade Server</generator>
      <webMaster>webmaster@hannonhill.com</webMaster>
      <item>
         <title>News Item 1</title>
         <link>http://www.hannonhill.com/news/item1.html</link>
         <description>Description of news item 1 here.</description>
         <pubDate>Tue, 03 Jun 2003 09:39:21 GMT</pubDate>
         <guid>http://www.hannonhill.com/news/item1.html</guid>
      </item>
      <item>
         <title>News Item 2</title>
         <link>http://www.hannonhill.com/news/item2.html</link>
         <description>Description of news item 2 here.</description>
         <pubDate>Fri, 30 May 2003 11:06:42 GMT</pubDate>
         <guid>http://www.hannonhill.com/news/item2.html</guid>
      </item>
      <item>
         <title>News Item 3</title>
         <link>http://www.hannonhill.com/news/item3.html</link>
         <description>Description of news item 3 here.</description>
         <pubDate>Tue, 20 May 2003 08:56:02 GMT</pubDate>
         <guid>http://www.hannonhill.com/news/item3.html</guid>
      </item>
   </channel>
</rss>

【讨论】：

【参考方案2】：

目前看来，您对 Impala 和 XML 的运气并不好。 Impala 使用 Hive 元存储，但不支持自定义 InputFormats 和 SerDes。你可以看到他们原生支持的格式here。

您可以使用 Hive，并且较新的版本应该会明显更快 (0.12+)

【讨论】：

【参考方案3】：

另一种方法是快速将一堆 XML 转换为 avro，并使用 avro 文件为 hive 或 impala 中定义的表提供支持。

XMLSlurper 可用于解析 XML 文件中的记录

【讨论】：

【参考方案4】：

你可以试试 XML SerDe for Hive here

【讨论】：

以上是关于Impala 可以查询存储在 Hadoop/HDFS 中的 XML 文件吗的主要内容，如果未能解决你的问题，请参考以下文章