带有制表符分隔值和字段名称的 Impala 外部表
Posted
技术标签:
【中文标题】带有制表符分隔值和字段名称的 Impala 外部表【英文标题】:Impala external table with tab separated values and field names 【发布时间】:2014-03-03 15:34:52 【问题描述】:我在 HDFS 中有一些数据,我想创建一个外部表并通过 Impala 进行查询。数据是制表符分隔的,但也包含字段名称。示例数据:
state:IL city:chicago population:2714856
state:NY city:New York population:8336697
我知道如何创建表格并指定数据是制表符分隔的,但是有没有办法处理数据中的字段?
【问题讨论】:
【参考方案1】:黑斑羚
Impala 中的解决方案使用与我之前发布的 Pig 示例相同的 REGEXP_EXTRACT 逻辑。
--csp.txt(输入文件,位于/user/cloudera/csp)
state:New York city:New York population:8336697
state:California city:Los Angeles population:3857799
state:Illinois city:Chicago population:2714856
state:Texas city:Houston population:2160821
state:Pennsylvania city:Philadelphia population:1547607
state:Arizona city:Phoenix population:1488750
state:Texas city:San Antonio population:1382951
state:California city:San Diego population:1338348
state:Texas city:Dallas population:1241162
state:California city:San Jose population:982765
state:Texas city:Austin population:842592
创建数据库和外部表
CREATE DATABASE IF NOT EXISTS CSP COMMENT 'City, State, Population';
DROP TABLE IF EXISTS CSP.original;
CREATE EXTERNAL TABLE IF NOT EXISTS CSP.original
(
st STRING COMMENT 'State',
ct STRING COMMENT 'City',
po STRING COMMENT 'Population'
)
COMMENT 'Original Table'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION '/user/cloudera/csp';
选择语句以正则表达式输出“州:”、“城市:”和“人口:”文本
SELECT
regexp_extract(st, '.*:(\\w.*)', 1) AS state,
regexp_extract(ct, '.*:(\\w.*)', 1) AS city,
regexp_extract(po, '.*:(\\w.*)', 1) AS population
FROM original;
查询结果
[localhost.localdomain:21000] > select regexp_extract(st, '.*:(\\w.*)', 1) AS state, regexp_extract(ct, '.*:(\\w.*)', 1) AS city, regexp_extract(po, '.*:(\\w.*)', 1) AS population FROM original limit 11;
Query: select regexp_extract(st, '.*:(\\w.*)', 1) AS state, regexp_extract(ct, '.*:(\\w.*)', 1) AS city, regexp_extract(po, '.*:(\\w.*)', 1) AS population FROM original limit 11
+--------------+--------------+------------+
| state | city | population |
+--------------+--------------+------------+
| New York | New York | 8336697 |
| California | Los Angeles | 3857799 |
| Illinois | Chicago | 2714856 |
| Texas | Houston | 2160821 |
| Pennsylvania | Philadelphia | 1547607 |
| Arizona | Phoenix | 1488750 |
| Texas | San Antonio | 1382951 |
| California | San Diego | 1338348 |
| Texas | Dallas | 1241162 |
| California | San Jose | 982765 |
| Texas | Austin | 842592 |
+--------------+--------------+------------+
Returned 11 row(s) in 0.22s
猪
对我来说,概念化该过程的最简单方法实际上是首先在 Pig 中,所以我使用您的语法模拟了一个数据文件,并在 Pig 中创建了程序。程序的输出是一个 csv 格式的文件,如果你愿意,可以使用它来创建 Impala 外部表。
--csp.pig
REGISTER piggybank.jar
A = LOAD 'csp.txt' USING PigStorage('\t') AS (st:chararray,ct:chararray,po:chararray);
data = FOREACH A GENERATE
REGEX_EXTRACT(st, '.*:(\\w.*)', 1) AS (state:chararray),
REGEX_EXTRACT(ct, '.*:(\\w.*)', 1) AS (city:chararray),
REGEX_EXTRACT(po, '.*:(\\w.*)', 1) AS (population:int);
STORE data INTO 'csp' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE');
--csp.txt(输入)
state:New York city:New York population:8336697
state:California city:Los Angeles population:3857799
state:Illinois city:Chicago population:2714856
state:Texas city:Houston population:2160821
state:Pennsylvania city:Philadelphia population:1547607
state:Arizona city:Phoenix population:1488750
state:Texas city:San Antonio population:1382951
state:California city:San Diego population:1338348
state:Texas city:Dallas population:1241162
state:California city:San Jose population:982765
state:Texas city:Austin population:842592
--csp(输出)
New York,New York,8336697
California,Los Angeles,3857799
Illinois,Chicago,2714856
Texas,Houston,2160821
Pennsylvania,Philadelphia,1547607
Arizona,Phoenix,1488750
Texas,San Antonio,1382951
California,San Diego,1338348
Texas,Dallas,1241162
California,San Jose,982765
Texas,Austin,842592
【讨论】:
以上是关于带有制表符分隔值和字段名称的 Impala 外部表的主要内容,如果未能解决你的问题,请参考以下文章
Impala 外部表读取未压缩文件但具有名称 (*.csv.gz)
使用 phpMyAdmin 将带有部分数据的制表符分隔的 csv 文件导入 mysql 表