无法使用 pyspark 从 hive 表中查询复杂的 SQL 语句

Posted 2023-04-15

技术标签:

【中文标题】无法使用 pyspark 从 hive 表中查询复杂的 SQL 语句【英文标题】：Unable to query complex SQL statements, from hive table using pyspark 【发布时间】：2019-07-08 15:13:15 【问题描述】：

您好，我正在尝试从 spark 上下文中查询配置单元表。

我的代码：

from pyspark.sql import HiveContext

hive_context = HiveContext(sc)
bank = hive_context.table('select * from db.table_name')
bank.show()

像这样的简单查询可以正常工作，没有任何错误。但是当我尝试使用以下查询时。

query = """with table1         as  (   select      distinct a,b
                            from    db_first.table_first
                            order by b )
--select * from table1 order by b
,c      as  (   select      * 
                            from    db_first.table_two)
--select * from c 
,d      as  (   select      *
                            from    c
                            where   upper(e) = 'Y')
--select * from d 
,f            as  (   select      table1.b
                                       ,cast(regexp_extract(g,'(\\d+)-(A|B)- 
   (\\d+)(.*)',1) as Int) aid1
                                    ,regexp_extract(g,'(\\d+)-(A|B)- 
    (\\d+)(.*)',2) aid2
                                    ,cast(regexp_extract(g,'(\\d+)-(A|B)- 
   (\\d+)(.*)',3) as Int) aid3

,from_unixtime(cast(substr(lastdbupdatedts,1,10) as int),"yyyy-MM-dd 
HH:mm:ss") lastupdts
                                    ,d.*
                            from    d
                            left outer join table1
                                on          d.hiba = table1.a)
select * from f order by b,aid1,aid2,aid3 limit 100"""

我收到以下错误，请帮助。

ParseExceptionTraceback (most recent call last)
<ipython-input-27-cedb6fad210d> in <module>()
      3 hive_context = HiveContext(sc)
      4 #bank = hive_context.table("bdalab.test_prodapt_inv")
----> 5 bank = hive_context.table(first)

ParseException: u"\nmismatched input '*' expecting <EOF>(line 1, pos 7)\n\n== SQL ==\nselect *

【问题讨论】：

【参考方案1】：

如果我们使用 sql 查询，您需要使用 .sql 方法而不是 .table 方法。

1.Using .table method then we need to provide table name:

>>> hive_context.table("<db_name>.<table_name>").show()

2.Using .sql method then provide your with cte expression:

>>> first ="with cte..."
>>> hive_context.sql(first).show()

【讨论】：

@PA，您可以使用 JDBC 方法 ***.com/questions/50990540/… （或） 读取 impala 表我相信所有 impala 表都可以在 hive 中访问（如果是的话）然后使用hive_context.table() 我们可以阅读表格。

以上是关于无法使用 pyspark 从 hive 表中查询复杂的 SQL 语句的主要内容，如果未能解决你的问题，请参考以下文章