带有制表符分隔值和字段名称的 Impala 外部表

Posted

技术标签:

【中文标题】带有制表符分隔值和字段名称的 Impala 外部表【英文标题】:Impala external table with tab separated values and field names 【发布时间】:2014-03-03 15:34:52 【问题描述】:

我在 HDFS 中有一些数据,我想创建一个外部表并通过 Impala 进行查询。数据是制表符分隔的,但也包含字段名称。示例数据:

state:IL     city:chicago     population:2714856
state:NY     city:New York     population:8336697

我知道如何创建表格并指定数据是制表符分隔的,但是有没有办法处理数据中的字段?

【问题讨论】:

【参考方案1】:

黑斑羚

Impala 中的解决方案使用与我之前发布的 Pig 示例相同的 REGEXP_EXTRACT 逻辑。

--csp.txt(输入文件,位于/user/cloudera/csp)

state:New York  city:New York   population:8336697
state:California        city:Los Angeles        population:3857799
state:Illinois  city:Chicago    population:2714856
state:Texas     city:Houston    population:2160821
state:Pennsylvania      city:Philadelphia       population:1547607
state:Arizona   city:Phoenix    population:1488750
state:Texas     city:San Antonio        population:1382951
state:California        city:San Diego  population:1338348
state:Texas     city:Dallas     population:1241162
state:California        city:San Jose   population:982765
state:Texas     city:Austin     population:842592

创建数据库和外部表

CREATE DATABASE IF NOT EXISTS CSP COMMENT 'City, State, Population';

DROP TABLE IF EXISTS CSP.original;

CREATE EXTERNAL TABLE IF NOT EXISTS CSP.original 
(
    st STRING COMMENT 'State', 
    ct STRING COMMENT 'City', 
    po STRING COMMENT 'Population'
) 
COMMENT 'Original Table' 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' 
LOCATION '/user/cloudera/csp';

选择语句以正则表达式输出“州:”、“城市:”和“人口:”文本

SELECT 
  regexp_extract(st, '.*:(\\w.*)', 1) AS state, 
  regexp_extract(ct, '.*:(\\w.*)', 1) AS city, 
  regexp_extract(po, '.*:(\\w.*)', 1) AS population 
FROM original;

查询结果

[localhost.localdomain:21000] > select regexp_extract(st, '.*:(\\w.*)', 1) AS state, regexp_extract(ct, '.*:(\\w.*)', 1) AS city, regexp_extract(po, '.*:(\\w.*)', 1) AS population FROM original limit 11;
Query: select regexp_extract(st, '.*:(\\w.*)', 1) AS state, regexp_extract(ct, '.*:(\\w.*)', 1) AS city, regexp_extract(po, '.*:(\\w.*)', 1) AS population FROM original limit 11
+--------------+--------------+------------+
| state        | city         | population |
+--------------+--------------+------------+
| New York     | New York     | 8336697    |
| California   | Los Angeles  | 3857799    |
| Illinois     | Chicago      | 2714856    |
| Texas        | Houston      | 2160821    |
| Pennsylvania | Philadelphia | 1547607    |
| Arizona      | Phoenix      | 1488750    |
| Texas        | San Antonio  | 1382951    |
| California   | San Diego    | 1338348    |
| Texas        | Dallas       | 1241162    |
| California   | San Jose     | 982765     |
| Texas        | Austin       | 842592     |
+--------------+--------------+------------+
Returned 11 row(s) in 0.22s


对我来说,概念化该过程的最简单方法实际上是首先在 Pig 中,所以我使用您的语法模拟了一个数据文件,并在 Pig 中创建了程序。程序的输出是一个 csv 格式的文件,如果你愿意,可以使用它来创建 Impala 外部表。

--csp.pig

REGISTER piggybank.jar

A = LOAD 'csp.txt' USING PigStorage('\t') AS (st:chararray,ct:chararray,po:chararray);

data = FOREACH A GENERATE 
REGEX_EXTRACT(st, '.*:(\\w.*)', 1) AS (state:chararray),
REGEX_EXTRACT(ct, '.*:(\\w.*)', 1) AS (city:chararray),
REGEX_EXTRACT(po, '.*:(\\w.*)', 1) AS (population:int);

STORE data INTO 'csp' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE');

--csp.txt(输入)

state:New York  city:New York   population:8336697
state:California        city:Los Angeles        population:3857799
state:Illinois  city:Chicago    population:2714856
state:Texas     city:Houston    population:2160821
state:Pennsylvania      city:Philadelphia       population:1547607
state:Arizona   city:Phoenix    population:1488750
state:Texas     city:San Antonio        population:1382951
state:California        city:San Diego  population:1338348
state:Texas     city:Dallas     population:1241162
state:California        city:San Jose   population:982765
state:Texas     city:Austin     population:842592

--csp(输出)

New York,New York,8336697
California,Los Angeles,3857799
Illinois,Chicago,2714856
Texas,Houston,2160821
Pennsylvania,Philadelphia,1547607
Arizona,Phoenix,1488750
Texas,San Antonio,1382951
California,San Diego,1338348
Texas,Dallas,1241162
California,San Jose,982765
Texas,Austin,842592

【讨论】:

以上是关于带有制表符分隔值和字段名称的 Impala 外部表的主要内容,如果未能解决你的问题,请参考以下文章

Impala 外部表读取未压缩文件但具有名称 (*.csv.gz)

使用 phpMyAdmin 将带有部分数据的制表符分隔的 csv 文件导入 mysql 表

Greenplum - 外部表

hive学习笔记_hive的表创建

查询带有标题的制表符分隔的文本文件时出现 VBA 错误 - “没有为一个或多个必需参数提供值”

在 hive 外部表中存储嵌套的 json,其字段由 \ 分隔