在 Impala DB 中创建表作为选择百分比子查询

Posted

技术标签:

【中文标题】在 Impala DB 中创建表作为选择百分比子查询【英文标题】:Create table as select percentage subquery in Impala DB 【发布时间】:2020-07-27 13:27:53 【问题描述】:

我是 Impala 的新手,我需要使用选择结果集创建表,此外,此 sql 是使用 JDBC 在 Java 中运行的,请参阅下面的查询:

create table if not exists my_temp_table as select 
41 as rule_id,49 as record_id,
(select count(1) as val from dirty_table where msg regexp '^[1]([3-9])[0-9]9$' )/(select count(1) from dirty_table);

我需要创建表my_temp_table 并将数据插入到该表中,这是我需要运行的一条SQL。但它运行失败并给出如下错误:

[HY000][500051] [Cloudera][ImpalaJDBCDriver](500051) ERROR processing query/statement. Error Code: 0, SQL state: TStatus(statusCode:ERROR_STATUS, sqlState:HY000, errorMessage:ParseException: Syntax error

经过检查,我知道 Impala 不支持SELECT句子查询,我们只能使用子查询 在 FROMWHERE 子句中,请参阅 Impala 文档:https://impala.apache.org/docs/build/html/topics/impala_subqueries.html。

所以对于这个问题,我该如何解决这个问题。

我的想法:

    更新sql让它执行,我试过WITH就像下面的sql,它可以工作但不能用于 CREATE TABLE ... AS ...
    WITH q1 AS (
      select count(1) as val from dirty_table where msg regexp '^[1]([3-9])[0-9]9$'
    ),
    q2 AS (
      select count(1) val2 from dirty_table
    )
    SELECT 100 * q1.val / q2.val2  result
    FROM q1, q2
    或者,mysql或Oracle中是否有类似BEGIN ... END的语句,那么我可以单独运行这个sql。

【问题讨论】:

【参考方案1】:

通过您的示例,我会尝试这些方法,我相信这些方法可以正常工作。 我用 Impala 检查了解决方案

CREATE TABLE dirty_table (
 id INT,
 msg STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED  BY ','
STORED AS TEXTFILE;


[localhost.localdomain:21000] > SELECT * FROM dirty_table;
Query: SELECT * FROM dirty_table
Query submitted at: 2020-07-28 17:05:24 (Coordinator: http://localhost.localdomain:25000)
Query progress can be monitored at: http://localhost.localdomain:25000/query_plan?query_id=5441d6a46ce61e7b:8e49432600000000
+----+-------------+
| id | msg         |
+----+-------------+
| 1  | 13321512121 |
| 2  | 13121212121 |
| 3  | 03121212121 |
| 4  | 13321512121 |
| 5  | 13121212121 |
| 6  | 03121212121 |
| 7  | 13121212121 |
+----+-------------+
Fetched 7 row(s) in 0.14s

第一个例子

CREATE TABLE IF NOT EXISTS my_temp_table AS
SELECT 41 AS rule_id, 49 AS record_id, val1 / val2 AS result
FROM (SELECT COUNT(1) AS val1 FROM dirty_table WHERE msg regexp '^[1]([3-9])[0-9]9$' ) a,
     (SELECT COUNT(1) AS val2 FROM dirty_table) b;

[localhost.localdomain:21000] > CREATE TABLE IF NOT EXISTS my_temp_table AS
                              > SELECT 41 AS rule_id, 49 AS record_id, val1 / val2 AS result
                              > FROM (SELECT COUNT(1) AS val1 FROM dirty_table WHERE msg regexp '^[1]([3-9])[0-9]9$' ) a,
                              >      (SELECT COUNT(1) AS val2 FROM dirty_table) b;
Query: CREATE TABLE IF NOT EXISTS my_temp_table AS
SELECT 41 AS rule_id, 49 AS record_id, val1 / val2 AS result
FROM (SELECT COUNT(1) AS val1 FROM dirty_table WHERE msg regexp '^[1]([3-9])[0-9]9$' ) a,
     (SELECT COUNT(1) AS val2 FROM dirty_table) b
+-------------------+
| summary           |
+-------------------+
| Inserted 0 row(s) |
+-------------------+
Fetched 1 row(s) in 0.21s

[localhost.localdomain:21000] > invalidate metadata;

[localhost.localdomain:21000] > SELECT * FROM my_temp_table;
Query: select * from my_temp_table
Query submitted at: 2020-07-28 17:03:44 (Coordinator: http://localhost.localdomain:25000)
Query progress can be monitored at: http://localhost.localdomain:25000/query_plan?query_id=47370bf793a09b:29c4dfa000000000
+---------+-----------+--------------------+
| rule_id | record_id | result             |
+---------+-----------+--------------------+
| 41      | 49        | 0.7142857142857143 |
+---------+-----------+--------------------+
Fetched 1 row(s) in 0.13s

第二个例子

DROP TABLE my_temp_table;

CREATE TABLE IF NOT EXISTS my_temp_table AS 
SELECT result FROM
    (WITH q1 AS (
      SELECT COUNT(1) AS val FROM dirty_table WHERE msg regexp '^[1]([3-9])[0-9]9$'
    ),
    q2 AS (
      SELECT COUNT(1) val2 FROM dirty_table
    )
    SELECT 100 * q1.val / q2.val2 AS result
    FROM q1, q2) t;

[localhost.localdomain:21000] > CREATE TABLE IF NOT EXISTS my_temp_table AS 
                              > SELECT result FROM
                              >     (WITH q1 AS (
                              >       SELECT COUNT(1) AS val FROM dirty_table WHERE msg regexp '^[1]([3-9])[0-9]9$'
                              >     ),
                              >     q2 AS (
                              >       SELECT COUNT(1) val2 FROM dirty_table
                              >     )
                              >     SELECT 100 * q1.val / q2.val2 AS result
                              >     FROM q1, q2) t;
Query: CREATE TABLE IF NOT EXISTS my_temp_table AS
SELECT result FROM
    (WITH q1 AS (
      SELECT COUNT(1) AS val FROM dirty_table WHERE msg regexp '^[1]([3-9])[0-9]9$'
    ),
    q2 AS (
      SELECT COUNT(1) val2 FROM dirty_table
    )
    SELECT 100 * q1.val / q2.val2 AS result
    FROM q1, q2) t
+-------------------+
| summary           |
+-------------------+
| Inserted 1 row(s) |
+-------------------+
Fetched 1 row(s) in 0.40s

[localhost.localdomain:21000] > invalidate metadata;

[localhost.localdomain:21000] > SELECT * FROM my_temp_table;
Query: SELECT * FROM my_temp_table
Query submitted at: 2020-07-28 17:08:17 (Coordinator: http://localhost.localdomain:25000)
Query progress can be monitored at: http://localhost.localdomain:25000/query_plan?query_id=3447684ef59d0c4:f70779200000000
+-------------------+
| result            |
+-------------------+
| 71.42857142857143 |
+-------------------+
Fetched 1 row(s) in 0.74s

【讨论】:

嗨@Chema,和其他人回答一样,这两个SQL仅在SELECT子句中有效,在添加CREATE TABLE后,SQL无法工作。 添加异常日志(两者都报同样的错误):[HY000][500051] [Cloudera][ImpalaJDBCDriver](500051) ERROR processing query/statement. Error Code: 0, SQL state: TStatus(statusCode:ERROR_STATUS, sqlState:HY000, errorMessage:ParseException: Syntax error in line 1: CREATE TABLE IF NOT EXISTS my_temp_table AS ^ Encountered: EOF Expected: SELECT, VALUES, W ... 嗨@KD Final,我更改了解决方案,请立即查看。 嗨@KD Final,我用Cloudera 分发和Impala 检查了解决方案,它工作正常。我用分步解决方案更改了帖子。也许您正面临其他问题。问候。 这种通用方法是正确的——将子查询放在 FROM 子句中,然后在选择列表中引用它们。在即将到来的 Impala 4.0(和 Impala 的其他 Cloudera 版本)中,我们确实支持选择列表子查询。在内部,它们被重写为完全像这样的查询。【参考方案2】:

我认为条件平均可以简单高效地完成您想要的操作,只需一次表扫描:

select avg(case when msg regexp '^[1]([3-9])[0-9]9$' then 100.0 else 0 end) result
from dirty_table

您可以将其转换为create table 声明:

create table my_temp_table as
select avg(case when msg regexp '^[1]([3-9])[0-9]9$' then 100.0 else 0 end) result
from dirty_table

【讨论】:

嗨@GMB,测试您的SQL,第一个SELECT 子句有效,但第二个CREATE TABLE 无效。 Cloudera Impala 有很多限制,官方文档中没有详细说明。 添加异常日志:[HY000][500051] [Cloudera][ImpalaJDBCDriver](500051) ERROR processing query/statement. Error Code: 0, SQL state: TStatus(statusCode:ERROR_STATUS, sqlState:HY000, errorMessage:ParseException: Syntax error in line 1: create table if not exists my_temp_table as ^ Encountered: EOF Expected: SELECT, VALUES, W ... @KDFinal: 看起来 Impala 不支持 if not exists in create table... 我更改了查询。 我在 Impala 中使用 if not exists 进行测试,它可以工作,我使用这个简单的 sql create table if not exists my_temp_table as select * from dirty_table;,但是当 SELECT 子句中有子查询时,它就失败了。

以上是关于在 Impala DB 中创建表作为选择百分比子查询的主要内容,如果未能解决你的问题,请参考以下文章

无法使用 impala-shell 在 kudu 中创建表

在mysql中创建表时使用变量作为表名

在 Impala 中查找百分比作为子查询

r 使用R在DB中创建表

如何从 JSON 数组在 DB 中创建表以在 Spring Boot 中创建 REST API

在同一个数据库 DB2 中创建表的副本