使用 PyCharm 通过 JDBC 连接到 AWS Athena - fetchSize 问题

Posted 2023-03-31

技术标签:

【中文标题】使用 PyCharm 通过 JDBC 连接到 AWS Athena - fetchSize 问题【英文标题】：Connecting to AWS Athena through JDBC with PyCharm - fetchSize issue 【发布时间】：2017-11-19 12:37:49 【问题描述】：

我已使用我的 PyCharm Pro 版本连接到 AWS Athena。它连接成功，但每当我运行查询时，我都会得到：

请求的 fetchSize 大于 Athena 中的允许值。请减小 fetchSize 并重试。参考雅典娜有效 fetchSize 值的文档。

我已经从 AWS Athena JDBC documentation 下载了 Athena JDBC 驱动程序

可能是什么问题？

【问题讨论】：

【参考方案1】：

在获取大小、JDBC 和 AWS athena 方面需要考虑的一件事。似乎有一个semi-documented but well known limit of 1000 rows per fetch。我知道流行的PyAthenaJDBC library 将其设置为他们的default fetch size。所以，这可能是你问题的一部分。

当我尝试一次提取超过 1000 行时，我可能会产生提取大小错误。

from pyathenajdbc import connect 
conn = connect(s3_staging_dir='s3://SOMEBUCKET/', 
region_name='us-east-1')
cur = conn.cursor()
cur.execute('SELECT * FROM SOMEDATABASE.big_table LIMIT 5000')
results = cur.fetchall()
print len(results)
# Note: The cursor class actually has a setter method to 
#       keep users from setting illegal fetch sizes   
cur._arraysize = 1001 # Set array size one greater than the default
cur.execute('SELECT * FROM athena_test.big_table LIMIT 5000')
results = cur.fetchall() # Generate an error

java.sql.SQLExceptionPyRaisable: java.sql.SQLException: The requested fetchSize is more than the allowed value in Athena. Please reduce the fetchSize and try again. Refer to the Athena documentation for valid fetchSize values.

可能的解决方案包括：

在 Web GUI 中运行查询，然后手动下载结果集在您选择的编辑器/IDE（DataGrip、Athena Web GUI 等）中开发查询，并通过 Python SDK 将查询字符串传递给 Athena。然后，您可以等待查询完成并获取结果集。您执行查询并对结果进行分页。如果您从 Python 调用 SQL（我从 PyCharm 标记推断），您可以使用 PyAthenaJDBC 之类的库来为您处理页面大小（参见上面的示例）。

对于我的许多 Python 脚本，我使用类似于以下的工作流程。

import boto3
import time

sql = 'SELECT * from athena_test.big_table'

database = 'SOMEDATABASE'
bucket_name = 'SOMEBUCKET' 
output_path = '/home/zerodf/temp/somedata.csv'

client = boto3.client('athena')
config = 'OutputLocation': 's3://' + bucket_name + '/',
      'EncryptionConfiguration': 'EncryptionOption': 'SSE_S3'

execution_results = client.start_query_execution(QueryString = sql,
                                             QueryExecutionContext =
                                             'Database': database,
                                             ResultConfiguration = config)

execution_id = str(execution_results[u'QueryExecutionId'])
remote_file = execution_id + '.csv'

while True:
    query_execution_results = client.get_query_execution(QueryExecutionId =
                                                     execution_id)
    if query_execution_results['QueryExecution']['Status']['State'] == u'SUCCEEDED':
        break
    else:
        time.sleep(60)

s3 = boto3.resource('s3')
s3.Bucket(bucket_name).download_file(remote_file, output_path)

显然，生产代码更复杂。

【讨论】：

驱动的参数改不了怎么办？根据您帖子的标签，我假设您是 Python 开发人员。如果要运行查询并下拉结果，可以使用上面提供的任一示例。为了开发复杂的查询，我通常会在 IDE 本身中将LIMIT 100 添加到我的查询末尾。这样一来，我就不必担心获取大量临时数据或降低 IDE 的速度。【参考方案2】：

我认为您应该在 DataGrip 的这个设置中设置适当的值

【讨论】：

以上是关于使用 PyCharm 通过 JDBC 连接到 AWS Athena - fetchSize 问题的主要内容，如果未能解决你的问题，请参考以下文章