雪花 python 连接器不适用于 AWS Lambda 中的更大数据集
Posted
技术标签:
【中文标题】雪花 python 连接器不适用于 AWS Lambda 中的更大数据集【英文标题】:Snowflake python connector not working on larger data set in AWS Lambda 【发布时间】:2018-07-18 13:32:16 【问题描述】:我正在使用 Snowflakes python 连接器尝试从我们的数据仓库中检索一组数据进行处理。该作业在 AWS lambda 函数中执行,当返回的行数约为 20 左右时会出现问题。当我设置limit 10
或limit 20
时,我能够取回数据集。如果我将limit
关闭,那么它很难获得仅包含 65 行的结果集。
我的 lambda 中的内存和超时值已经达到最大值,导出到 CSV 的数据集只有 300KB。在本地运行此查询返回很好,因此它可能与内存大小有关,但返回的数据实际上并没有那么大。
connector = snowflake.connector.connect(
account=os.environ['SNOWFLAKE_ACCOUNT'],
user=os.environ['SNOWFLAKE_USER'],
password=os.environ['SNOWFLAKE_PASSWORD'],
role="MY_ROLE",
ocsp_response_cache_filename="/tmp/.cache/snowflake/"
"ocsp_response_cache",
)
print("Connected to snowflake")
cursor = connector.cursor(DictCursor)
cursor.execute('USE DATA.INFORMATION_SCHEMA')
query = "SELECT * FROM TABLE WHERE X=Y" # FAKE QUERY
print("Execute query: \n\t0".format(query))
cursor.execute(query)
print("Execute query done!")
posts = []
processed = 0
for rec in cursor:
processed += 1
print("Processed count: ".format(processed))
posts.append(rec)
# These attempts also didn't work.
# posts = cursor.fetchmany(size=cursor.rowcount)
# posts = cursor.fetchall()
cursor.close()
processed
整数值最多获得 17 条记录,但随后停止。我的日志输出了很多关于尚未准备好使用的块的内容,最终 lambda 超时
[1531919679073] [DEBUG] 2018-07-18T13:14:39.72Z 7e3420c6-8a8c-11e8-a97e-c53a2c591430 Chunk Downloader in memory
[1531919679073] Execute query done!
[1531919679073] [DEBUG] 2018-07-18T13:14:39.73Z 7e3420c6-8a8c-11e8-a97e-c53a2c591430 chunk index: 0, chunk_count: 2
[1531919679073] [DEBUG] 2018-07-18T13:14:39.73Z 7e3420c6-8a8c-11e8-a97e-c53a2c591430 next_chunk_to_consume=1, next_chunk_to_download=3, total_chunks=2
[1531919679073] [DEBUG] 2018-07-18T13:14:39.73Z 7e3420c6-8a8c-11e8-a97e-c53a2c591430 waiting for chunk 1/2 in 1/10 download attempt
[1531919679073] [DEBUG] 2018-07-18T13:14:39.73Z 7e3420c6-8a8c-11e8-a97e-c53a2c591430 chunk 1/2 is NOT ready to consume in 10/3600(s)
[1531919679073] [DEBUG] 2018-07-18T13:14:39.73Z 7e3420c6-8a8c-11e8-a97e-c53a2c591430 downloading chunk 1/2
[1531919679074] [DEBUG] 2018-07-18T13:14:39.73Z 7e3420c6-8a8c-11e8-a97e-c53a2c591430 use chunk headers from result
[1531919679074] [DEBUG] 2018-07-18T13:14:39.74Z 7e3420c6-8a8c-11e8-a97e-c53a2c591430 started getting the result set 1: https://sfc-va-ds1-customer-stage.s3.amazo
naws.com/fwoi-s-vass0007/results/7b9cf772-a061-47ab-8e9f-43dbfcd923c9_0/main/data_0_0_0?x-amz-server-side-encryption-customer-algorithm=AES256&response-content-e
ncoding=gzip&AWSAccessKeyId=AKIAJKHCJ73YL7MD6ZRA&Expires=1531941279&Signature=VvGOkLNvE%2FHVMaUXoeQMn6cFUOY%3D
[1531919679074] [DEBUG] 2018-07-18T13:14:39.74Z 7e3420c6-8a8c-11e8-a97e-c53a2c591430 Active requests sessions: 1, idle: 0
[1531919679074] [DEBUG] 2018-07-18T13:14:39.74Z 7e3420c6-8a8c-11e8-a97e-c53a2c591430 remaining request timeout: 3600, retry cnt: 1
[1531919679074] [DEBUG] 2018-07-18T13:14:39.74Z 7e3420c6-8a8c-11e8-a97e-c53a2c591430 socket timeout: 60
[1531919679075] [INFO] 2018-07-18T13:14:39.75Z 7e3420c6-8a8c-11e8-a97e-c53a2c591430 Starting new HTTPS connection (1): sfc-va-ds1-customer-stage.s3.amazonaws.com
[1531919679078] [DEBUG] 2018-07-18T13:14:39.75Z 7e3420c6-8a8c-11e8-a97e-c53a2c591430 downloading chunk 2/2
[1531919679078] [DEBUG] 2018-07-18T13:14:39.76Z 7e3420c6-8a8c-11e8-a97e-c53a2c591430 use chunk headers from result
[1531919679078] [DEBUG] 2018-07-18T13:14:39.76Z 7e3420c6-8a8c-11e8-a97e-c53a2c591430 started getting the result set 2: https://sfc-va-ds1-customer-stage.s3.amazo
naws.com/fwoi-s-vass0007/results/7b9cf772-a061-47ab-8e9f-43dbfcd923c9_0/main/data_0_0_1?x-amz-server-side-encryption-customer-algorithm=AES256&response-content-e
ncoding=gzip&AWSAccessKeyId=AKIAJKHCJ73YL7MD6ZRA&Expires=1531941279&Signature=F5ix8FcsLO1dM8sWsZXZYx4uHM8%3D
[1531919679078] [DEBUG] 2018-07-18T13:14:39.76Z 7e3420c6-8a8c-11e8-a97e-c53a2c591430 Converted retries value: 1 -> Retry(total=1, connect=None, read=None, redire
ct=None)
[1531919679078] [DEBUG] 2018-07-18T13:14:39.76Z 7e3420c6-8a8c-11e8-a97e-c53a2c591430 Converted retries value: 1 -> Retry(total=1, connect=None, read=None, redire
ct=None)
[1531919679078] [DEBUG] 2018-07-18T13:14:39.76Z 7e3420c6-8a8c-11e8-a97e-c53a2c591430 Active requests sessions: 2, idle: 0
[1531919679078] [DEBUG] 2018-07-18T13:14:39.76Z 7e3420c6-8a8c-11e8-a97e-c53a2c591430 remaining request timeout: 3600, retry cnt: 1
[1531919679078] [DEBUG] 2018-07-18T13:14:39.76Z 7e3420c6-8a8c-11e8-a97e-c53a2c591430 socket timeout: 60
[1531919679078] [INFO] 2018-07-18T13:14:39.77Z 7e3420c6-8a8c-11e8-a97e-c53a2c591430 Starting new HTTPS connection (1): sfc-va-ds1-customer-stage.s3.amazonaws.com
[1531919681581] [DEBUG] 2018-07-18T13:14:41.580Z 26284dc8-8a8c-11e8-95ac-3ff42bd28642 chunk 1/2 is NOT ready to consume in 160/3600(s)
[1531919689074] [DEBUG] 2018-07-18T13:14:49.73Z 7e3420c6-8a8c-11e8-a97e-c53a2c591430 chunk 1/2 is NOT ready to consume in 20/3600(s)
[1531919691581] [DEBUG] 2018-07-18T13:14:51.581Z 26284dc8-8a8c-11e8-95ac-3ff42bd28642 chunk 1/2 is NOT ready to consume in 170/3600(s)
[1531919699074] [DEBUG] 2018-07-18T13:14:59.74Z 7e3420c6-8a8c-11e8-a97e-c53a2c591430 chunk 1/2 is NOT ready to consume in 30/3600(s)
[1531919701581] [DEBUG] 2018-07-18T13:15:01.581Z 26284dc8-8a8c-11e8-95ac-3ff42bd28642 chunk 1/2 is NOT ready to consume in 180/3600(s)
[1531919709074] [DEBUG] 2018-07-18T13:15:09.74Z 7e3420c6-8a8c-11e8-a97e-c53a2c591430 chunk 1/2 is NOT ready to consume in 40/3600(s)
[1531919711582] [DEBUG] 2018-07-18T13:15:11.581Z 26284dc8-8a8c-11e8-95ac-3ff42bd28642 chunk 1/2 is NOT ready to consume in 190/3600(s)
[1531919712739] [DEBUG] 2018-07-18T13:15:12.738Z 26284dc8-8a8c-11e8-95ac-3ff42bd28642 Incremented Retry for (url='/fwoi-s-vass0007/results/7b9cf772-a061-47ab-8e9
f-43dbfcd923c9_0/main/data_0_0_0?x-amz-server-side-encryption-customer-algorithm=AES256&response-content-encoding=gzip&AWSAccessKeyId=AKIAJKHCJ73YL7MD6ZRA&Expire
s=1531941131&Signature=mW6nXerwYHhnfwfPdRF0So1tpIQ%3D'): Retry(total=0, connect=None, read=None, redirect=None)
[1531919719075] [DEBUG] 2018-07-18T13:15:19.75Z 7e3420c6-8a8c-11e8-a97e-c53a2c591430 chunk 1/2 is NOT ready to consume in 50/3600(s)
【问题讨论】:
限制 100 是否有效? 它没有。它在 40 的限制下失败 【参考方案1】:我遇到了类似的问题,但使用的是 Snowflake JDBC 连接器。
Select * from table:获取第一块数据(600 条记录),然后在获取下一块数据时出现“连接超时”
如果我这样做,Select * from table limit 1200,它可以正常工作而不会出现任何超时
所以,把整个事情分解成两个步骤..
-
rowcount = 从表中选择 count(*)
Select * from table limit rowcount
【讨论】:
【参考方案2】:从日志看来,python 连接器一直在重试从 s3 下载结果。如果您的查询生成大量数据,这是预期的行为。我建议尝试确保您的 lambda 环境可以访问 s3 存储桶。一个简单的 curl 命令应该可以验证它。
curl -v https://sfc-va-ds1-customer-stage.s3.amazonaws.com
如果你能得到一些 http 代码(比如 403),那么这意味着你已经建立了连接。否则,如果它挂起,则说明您的环境中未正确配置某些内容。
【讨论】:
以上是关于雪花 python 连接器不适用于 AWS Lambda 中的更大数据集的主要内容,如果未能解决你的问题,请参考以下文章
AWS Lambda Snowflake Python 连接器在尝试连接时挂起
为啥 Git 忽略不适用于 __pycache__ 文件夹? [复制]