在 Python 数据流/Apache Beam 上启动 CloudSQL 代理
Posted
技术标签:
【中文标题】在 Python 数据流/Apache Beam 上启动 CloudSQL 代理【英文标题】:Start CloudSQL Proxy on Python Dataflow / Apache Beam 【发布时间】:2018-11-15 00:54:21 【问题描述】:我目前正在从事一项 ETL 数据流作业(使用 Apache Beam Python SDK),该作业从 CloudSQL(使用 psycopg2
和自定义 ParDo
)查询数据并将其写入 BigQuery。我的目标是创建一个 Dataflow 模板,我可以使用 Cron 作业从 AppEngine 开始。
我有一个使用 DirectRunner 在本地工作的版本。为此,我使用 CloudSQL (Postgres) 代理客户端,以便可以连接到 127.0.0.1 上的数据库。
当使用带有自定义命令的 DataflowRunner 在 setup.py 脚本中启动代理时,作业不会执行。 它坚持重复此日志消息:
Setting node annotation to enable volume controller attach/detach
我的 setup.py 的一部分如下所示:
CUSTOM_COMMANDS = [
['echo', 'Custom command worked!'],
['wget', 'https://dl.google.com/cloudsql/cloud_sql_proxy.linux.amd64', '-O', 'cloud_sql_proxy'],
['echo', 'Proxy downloaded'],
['chmod', '+x', 'cloud_sql_proxy']]
class CustomCommands(setuptools.Command):
"""A setuptools Command class able to run arbitrary commands."""
def initialize_options(self):
pass
def finalize_options(self):
pass
def RunCustomCommand(self, command_list):
print('Running command: %s' % command_list)
logging.info("Running custom commands")
p = subprocess.Popen(
command_list,
stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
# Can use communicate(input='y\n'.encode()) if the command run requires
# some confirmation.
stdout_data, _ = p.communicate()
print('Command output: %s' % stdout_data)
if p.returncode != 0:
raise RuntimeError(
'Command %s failed: exit code: %s' % (command_list, p.returncode))
def run(self):
for command in CUSTOM_COMMANDS:
self.RunCustomCommand(command)
subprocess.Popen(['./cloud_sql_proxy', '-instances=bi-test-1:europe-west1:test-animal=tcp:5432'])
在从 sthomp 阅读 Github 上的 this 问题和 Stackoverflo 上的 this 讨论后,我在 run()
中添加了最后一行作为单独的 subprocess.Popen()
。我还尝试使用subprocess.Popen
的一些参数。
brodin 提到的另一个解决方案是允许从每个 IP 地址访问并通过用户名和密码进行连接。据我了解,他并不认为这是最佳实践。
提前感谢您的帮助。
!!!这篇文章底部的解决方法!!!
更新 - 日志文件
这些是作业期间发生的错误级别的日志:
E EXT4-fs (dm-0): couldn't mount as ext3 due to feature incompatibilities
E Image garbage collection failed once. Stats initialization may not have completed yet: unable to find data for container /
E Failed to check if disk space is available for the runtime: failed to get fs info for "runtime": unable to find data for container /
E Failed to check if disk space is available on the root partition: failed to get fs info for "root": unable to find data for container /
E [ContainerManager]: Fail to get rootfs information unable to find data for container /
E Could not find capacity information for resource storage.kubernetes.io/scratch
E debconf: delaying package configuration, since apt-utils is not installed
E % Total % Received % Xferd Average Speed Time Time Time Current
E Dload Upload Total Spent Left Speed
E
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 3698 100 3698 0 0 25674 0 --:--:-- --:--:-- --:--:-- 25860
#-- HERE IS WHEN setup.py FOR MY JOB IS EXECUTED ---
E debconf: delaying package configuration, since apt-utils is not installed
E insserv: warning: current start runlevel(s) (empty) of script `stackdriver-extractor' overrides LSB defaults (2 3 4 5).
E insserv: warning: current stop runlevel(s) (0 1 2 3 4 5 6) of script `stackdriver-extractor' overrides LSB defaults (0 1 6).
E option = Interval; value = 60.000000;
E option = FQDNLookup; value = false;
E Created new plugin context.
E option = PIDFile; value = /var/run/stackdriver-agent.pid;
E option = Interval; value = 60.000000;
E option = FQDNLookup; value = false;
E Created new plugin context.
在这里你可以找到我自定义 setup.py 启动后的所有日志(日志级别:任意;所有日志):
https://jpst.it/1gk2Z
更新日志文件 2
作业日志(我在没有卡住一段时间后手动取消作业):
2018-06-08 (08:02:20) Autoscaling is enabled for job 2018-06-07_23_02_20-5917188751755240698. The number of workers will b...
2018-06-08 (08:02:20) Autoscaling was automatically enabled for job 2018-06-07_23_02_20-5917188751755240698.
2018-06-08 (08:02:24) Checking required Cloud APIs are enabled.
2018-06-08 (08:02:24) Checking permissions granted to controller Service Account.
2018-06-08 (08:02:25) Worker configuration: n1-standard-1 in europe-west1-b.
2018-06-08 (08:02:25) Expanding CoGroupByKey operations into optimizable parts.
2018-06-08 (08:02:25) Combiner lifting skipped for step Save new watermarks/Write/WriteImpl/GroupByKey: GroupByKey not fol...
2018-06-08 (08:02:25) Combiner lifting skipped for step Group watermarks: GroupByKey not followed by a combiner.
2018-06-08 (08:02:25) Expanding GroupByKey operations into optimizable parts.
2018-06-08 (08:02:26) Lifting ValueCombiningMappingFns into MergeBucketsMappingFns
2018-06-08 (08:02:26) Annotating graph with Autotuner information.
2018-06-08 (08:02:26) Fusing adjacent ParDo, Read, Write, and Flatten operations
2018-06-08 (08:02:26) Fusing consumer Get rows from CloudSQL tables into Begin pipeline with watermarks/Read
2018-06-08 (08:02:26) Fusing consumer Group watermarks/Write into Group watermarks/Reify
2018-06-08 (08:02:26) Fusing consumer Group watermarks/GroupByWindow into Group watermarks/Read
2018-06-08 (08:02:26) Fusing consumer Save new watermarks/Write/WriteImpl/WriteBundles/WriteBundles into Save new watermar...
2018-06-08 (08:02:26) Fusing consumer Save new watermarks/Write/WriteImpl/GroupByKey/GroupByWindow into Save new watermark...
2018-06-08 (08:02:26) Fusing consumer Save new watermarks/Write/WriteImpl/GroupByKey/Reify into Save new watermarks/Write/...
2018-06-08 (08:02:26) Fusing consumer Save new watermarks/Write/WriteImpl/GroupByKey/Write into Save new watermarks/Write/...
2018-06-08 (08:02:26) Fusing consumer Write to BQ into Get rows from CloudSQL tables
2018-06-08 (08:02:26) Fusing consumer Group watermarks/Reify into Write to BQ
2018-06-08 (08:02:26) Fusing consumer Save new watermarks/Write/WriteImpl/Map(<lambda at iobase.py:926>) into Convert dict...
2018-06-08 (08:02:26) Fusing consumer Save new watermarks/Write/WriteImpl/WindowInto(WindowIntoFn) into Save new watermark...
2018-06-08 (08:02:26) Fusing consumer Convert dictionary list to single dictionary and json into Remove "watermark" label
2018-06-08 (08:02:26) Fusing consumer Remove "watermark" label into Group watermarks/GroupByWindow
2018-06-08 (08:02:26) Fusing consumer Save new watermarks/Write/WriteImpl/InitializeWrite into Save new watermarks/Write/W...
2018-06-08 (08:02:26) Workflow config is missing a default resource spec.
2018-06-08 (08:02:26) Adding StepResource setup and teardown to workflow graph.
2018-06-08 (08:02:26) Adding workflow start and stop steps.
2018-06-08 (08:02:26) Assigning stage ids.
2018-06-08 (08:02:26) Executing wait step start25
2018-06-08 (08:02:26) Executing operation Save new watermarks/Write/WriteImpl/DoOnce/Read+Save new watermarks/Write/WriteI...
2018-06-08 (08:02:26) Executing operation Save new watermarks/Write/WriteImpl/GroupByKey/Create
2018-06-08 (08:02:26) Starting worker pool setup.
2018-06-08 (08:02:26) Executing operation Group watermarks/Create
2018-06-08 (08:02:26) Starting 1 workers in europe-west1-b...
2018-06-08 (08:02:27) Value "Group watermarks/Session" materialized.
2018-06-08 (08:02:27) Value "Save new watermarks/Write/WriteImpl/GroupByKey/Session" materialized.
2018-06-08 (08:02:27) Executing operation Begin pipeline with watermarks/Read+Get rows from CloudSQL tables+Write to BQ+Gr...
2018-06-08 (08:02:36) Autoscaling: Raised the number of workers to 0 based on the rate of progress in the currently runnin...
2018-06-08 (08:02:46) Autoscaling: Raised the number of workers to 1 based on the rate of progress in the currently runnin...
2018-06-08 (08:03:05) Workers have started successfully.
2018-06-08 (08:11:37) Cancel request is committed for workflow job: 2018-06-07_23_02_20-5917188751755240698.
2018-06-08 (08:11:38) Cleaning up.
2018-06-08 (08:11:38) Starting worker pool teardown.
2018-06-08 (08:11:38) Stopping worker pool...
2018-06-08 (08:12:30) Autoscaling: Reduced the number of workers to 0 based on the rate of progress in the currently runni...
堆栈跟踪:
No errors have been received in this time period.
更新:解决方法可以在下面我的回答中找到
【问题讨论】:
您能否向我们提供完整的日志以及实际错误是什么?因为仅从Setting node annotation to enable volume controller attach/detach
来看,我们无法看到正在发生的事情和原因。
@komarkovich 谢谢你的评论!是否有适当的方式为您提供日志文件?工作人员本身还没有显示任何日志(可能是因为没有启动)。系统日志、kubelet 等的日志太长了,无法在这里发布。
我需要您向我提供失败的 Dataflow 作业的日志。您可以在工作日志https://console.cloud.google.com/dataflow?jobsDetail/locations/<ZONE>/jobs/<JOB_ID>?project=<PROJECT_NAME>
中找到它们。应该有一些错误应该告诉我们发生了什么。您不必发布所有日志(只需发布最相关的日志)。如果太多,您可以使用 [justPasteIt](justpaste.it) 工具在这里分享。
用日志文件更新了帖子(感谢 justpaste.it 的提示)。我从 Logs Viewer 复制了日志。不幸的是,当使用上面的链接和我的规范时,总是出现在工作列表中。
谢谢你,但这并不是我真正想要的。请发布数据流日志。对不起那个链接,这个应该是正确的:https://console.cloud.google.com/dataflow/jobsDetail/locations/<ZONE>/jobs/<JOB_ID>?project=<PROJECT_NAME>
。在此处查找该作业的日志并提供堆栈跟踪。
【参考方案1】:
变通解决方案:
我终于找到了解决方法。我的想法是通过 CloudSQL 实例的公共 IP 进行连接。为此,您需要允许从每个 IP 连接到您的 CloudSQL 实例:
-
转到 GCP 中 CloudSQL 实例的概览页面
点击
Authorization
标签
点击Add network
并添加0.0.0.0/0
(!!这将允许每个IP地址连接到您的实例!!)
为了增加流程的安全性,我使用了 SSL 密钥并且只允许 SSL 连接到实例:
-
点击
SSL
标签
点击Create a new certificate
为您的服务器创建SSL证书
点击Create a client certificate
为您的客户创建SSL证书
点击Allow only SSL connections
拒绝所有非SSL连接尝试
之后,我将证书存储在 Google Cloud Storage 存储桶中并加载 在数据流作业中连接之前,即:
import psycopg2
import psycopg2.extensions
import os
import stat
from google.cloud import storage
# Function to wait for open connection when processing parallel
def wait(conn):
while 1:
state = conn.poll()
if state == psycopg2.extensions.POLL_OK:
break
elif state == psycopg2.extensions.POLL_WRITE:
pass
select.select([], [conn.fileno()], [])
elif state == psycopg2.extensions.POLL_READ:
pass
select.select([conn.fileno()], [], [])
else:
raise psycopg2.OperationalError("poll() returned %s" % state)
# Function which returns a connection which can be used for queries
def connect_to_db(host, hostaddr, dbname, user, password, sslmode = 'verify-full'):
# Get keys from GCS
client = storage.Client()
bucket = client.get_bucket(<YOUR_BUCKET_NAME>)
bucket.get_blob('PATH_TO/server-ca.pem').download_to_filename('server-ca.pem')
bucket.get_blob('PATH_TO/client-key.pem').download_to_filename('client-key.pem')
os.chmod("client-key.pem", stat.S_IRWXU)
bucket.get_blob('PATH_TO/client-cert.pem').download_to_filename('client-cert.pem')
sslrootcert = 'server-ca.pem'
sslkey = 'client-key.pem'
sslcert = 'client-cert.pem'
con = psycopg2.connect(
host = host,
hostaddr = hostaddr,
dbname = dbname,
user = user,
password = password,
sslmode=sslmode,
sslrootcert = sslrootcert,
sslcert = sslcert,
sslkey = sslkey)
return con
然后我在自定义 ParDo
中使用这些函数来执行查询。
最小的例子:
import apache_beam as beam
class ReadSQLTableNames(beam.DoFn):
'''
parDo class to get all table names of a given cloudSQL database.
It will return each table name.
'''
def __init__(self, host, hostaddr, dbname, username, password):
super(ReadSQLTableNames, self).__init__()
self.host = host
self.hostaddr = hostaddr
self.dbname = dbname
self.username = username
self.password = password
def process(self, element):
# Connect do database
con = connect_to_db(host = self.host,
hostaddr = self.hostaddr,
dbname = self.dbname,
user = self.username,
password = self.password)
# Wait for free connection
wait_select(con)
# Create cursor to query data
cur = con.cursor(cursor_factory=RealDictCursor)
# Get all table names
cur.execute(
"""
SELECT
tablename as table
FROM pg_tables
WHERE schemaname = 'public'
"""
)
table_names = cur.fetchall()
cur.close()
con.close()
for table_name in table_names:
yield table_name["table"]
管道的一部分可能如下所示:
# Current workaround to query all tables:
# Create a dummy initiator PCollection with one element
init = p |'Begin pipeline with initiator' >> beam.Create(['All tables initializer'])
tables = init |'Get table names' >> beam.ParDo(ReadSQLTableNames(
host = known_args.host,
hostaddr = known_args.hostaddr,
dbname = known_args.db_name,
username = known_args.user,
password = known_args.password))
我希望这个解决方案可以帮助其他有类似问题的人
【讨论】:
此方法是否确保在将证书下载到 Dataflow 作业时保留 GCS 的默认加密? @komarkovich 所以不能用 setup.py 文件和代理配置来做吗? @IoT 我还没有找到代理的解决方案。我希望将来会有一个不错的方法,因为我最近在工作中遇到了一些问题。有时下载的文件是空的,我需要添加一些检查和重试 谢谢@ThomasSchmidt。我希望谷歌更加努力,因为离其他两家主要的云公司太远了【参考方案2】:我设法找到了更好或至少更简单的解决方案。 在DoFn设置功能中使用云代理设置预连接
class MyDoFn(beam.DoFn):
def setup(self):
os.system("wget https://dl.google.com/cloudsql/cloud_sql_proxy.linux.amd64 -O cloud_sql_proxy")
os.system("chmod +x cloud_sql_proxy")
os.system(f"./cloud_sql_proxy -instances=self.sql_args['cloud_sql_connection_name']=tcp:3306 &")
【讨论】:
作业抛出错误 RuntimeError: mysql.connector.errors.InterfaceError: 2003: Can't connect to MySQL server on 'localhost:3306' "即使它可以访问表。" 对于私有ip数据流,我想人们可能需要在云存储中添加代理文件。 @sernle Cloud NAT 将允许使用私有 ip 数据流的上述解决方案,但如果 Cloud NAT 不是一个选项,那么我同意 Cloud Storage 中的代理文件是一个合理的解决方法 这对我帮助很大。但在最后一行,我添加了:“-dir=/cloudsql”。谢谢!【参考方案3】:2022 年要做的简单而正确的事情是使用云 sql 连接器,该连接器将与在 gcloud sql 上运行的 postgres、sqlserver 和 mysql 一起使用。
https://cloud.google.com/sql/docs/mysql/connect-connectors#python_1
https://pypi.org/project/cloud-sql-python-connector/
无需将 IP 列入白名单或让您的数据库完全开放。您对主机使用此格式:“project:region:instance”
【讨论】:
以上是关于在 Python 数据流/Apache Beam 上启动 CloudSQL 代理的主要内容,如果未能解决你的问题,请参考以下文章
Apache Beam Python SDK 会丢弃延迟数据,还是无法配置延迟参数?
CoGroupByKey 没有给出想要的结果 Apache Beam(python)
数据流管道上的 Apache Beam StatusRuntimeException