尝试在 VertexAI 管道中使用 CustomPythonPackageTrainingJobRunOp 时出错
Posted
技术标签:
【中文标题】尝试在 VertexAI 管道中使用 CustomPythonPackageTrainingJobRunOp 时出错【英文标题】:Error when trying to use CustomPythonPackageTrainingJobRunOp in VertexAI pipeline 【发布时间】:2021-10-18 01:10:08 【问题描述】:我在 VertexAI 管道中使用谷歌云管道组件 CustomPythonPackageTrainingJobRunOp。我之前已经能够将这个包作为 CustomTrainingJob 成功运行。我可以在日志中看到多条 (11) 条错误消息,但对我来说似乎唯一有意义的是“ValueError: too many values to unpack (expected 2)”,但我无法找出解决方案。如果需要,我也可以添加所有其他错误消息。我在训练代码开始时记录了一些消息,所以我知道错误发生在训练代码执行之前。我完全坚持这一点。链接到有人在管道中使用 CustomPythonPackageTrainingJobRunOp 的示例也非常有用。下面是我尝试执行的管道代码:
import kfp
from kfp.v2 import compiler
from kfp.v2.google.client import AIPlatformClient
from google_cloud_pipeline_components import aiplatform as gcc_aip
@kfp.dsl.pipeline(name=pipeline_name)
def pipeline(
project: str = "adsfafs-321118",
location: str = "us-central1",
display_name: str = "vertex_pipeline",
python_package_gcs_uri: str = "gs://vertex/training/training-package-3.0.tar.gz",
python_module_name: str = "trainer.task",
container_uri: str = "us-docker.pkg.dev/vertex-ai/training/scikit-learn-cpu.0-23:latest",
staging_bucket: str = "vertex_bucket",
base_output_dir: str = "gs://vertex_artifacts/custom_training/"
):
gcc_aip.CustomPythonPackageTrainingJobRunOp(
display_name=display_name,
python_package_gcs_uri=python_package_gcs_uri,
python_module=python_module_name,
container_uri=container_uri,
project=project,
location=location,
staging_bucket=staging_bucket,
base_output_dir=base_output_dir,
args = ["--arg1=val1", "--arg2=val2", ...]
)
compiler.Compiler().compile(
pipeline_func=pipeline, package_path=package_path
)
api_client = AIPlatformClient(project_id=project_id, region=region)
response = api_client.create_run_from_job_spec(
package_path,
pipeline_root=pipeline_root_path
)
在 CustomPythonPackageTrainingJobRunOp 的文档中,参数“python_module”的类型似乎是“google.cloud.aiplatform.training_jobs.CustomPythonPackageTrainingJob”而不是字符串,这看起来很奇怪。但是,我尝试重新定义管道,将 CustomPythonPackageTrainingJobRunOp 中的参数 python_module 替换为 CustomPythonPackageTrainingJob 对象而不是字符串,如下所示,但仍然出现相同的错误:
def pipeline(
project: str = "...",
location: str = "...",
display_name: str = "...",
python_package_gcs_uri: str = "...",
python_module_name: str = "...",
container_uri: str = "...",
staging_bucket: str = "...",
base_output_dir: str = "...",
):
job = aiplatform.CustomPythonPackageTrainingJob(
display_name= display_name,
python_package_gcs_uri=python_package_gcs_uri,
python_module_name=python_module_name,
container_uri=container_uri,
staging_bucket=staging_bucket
)
gcc_aip.CustomPythonPackageTrainingJobRunOp(
display_name=display_name,
python_package_gcs_uri=python_package_gcs_uri,
python_module=job,
container_uri=container_uri,
project=project,
location=location,
base_output_dir=base_output_dir,
args = ["--arg1=val1", "--arg2=val2", ...]
)
编辑:
添加了我正在传递但忘记在此处添加的参数。
【问题讨论】:
【参考方案1】:原来我将 args 传递给 python 模块的方式不正确。您需要指定args = ["--arg1", val1, "--arg2", val2, ...]
,而不是args = ["--arg1=val1", "--arg2=val2", ...]
【讨论】:
以上是关于尝试在 VertexAI 管道中使用 CustomPythonPackageTrainingJobRunOp 时出错的主要内容,如果未能解决你的问题,请参考以下文章
顶点管道:CustomPythonPackageTrainingJobRunOp 不提供 WorkerPoolSpecs
VertexAI Pipeline:如何使用自定义 kfp 组件的输出作为 google_cloud_pipeline_components 的输入?
Vertex AI - ModelDeployOp(...) 上没有名为“google_cloud_pipeline_components.remote”的模块