VertexAI Pipeline:如何使用自定义 kfp 组件的输出作为 google_cloud_pipeline_components 的输入?
Posted
技术标签:
【中文标题】VertexAI Pipeline:如何使用自定义 kfp 组件的输出作为 google_cloud_pipeline_components 的输入?【英文标题】:VertexAI Pipeline: How to use an output from a custom kfp component as input for google_cloud_pipeline_components? 【发布时间】:2021-12-28 13:27:17 【问题描述】:我正在尝试使用 kfp 组件为 VertexAI 中的管道编写 Python 代码。我有一个步骤,我创建了一个 system.Dataset
对象,如下所示:
@component(base_image="python:3.9", packages_to_install=["google-cloud-bigquery","pandas","pyarrow","fsspec","gcsfs"])
def create_dataframe(
project: str,
region: str,
destination_dataset: str,
destination_table_name: str,
dataset: Output[Dataset],
):
from google.cloud import bigquery
client = bigquery.Client(project=project, location=region)
dataset_ref = bigquery.DatasetReference(project, destination_dataset)
table_ref = dataset_ref.table(destination_table_name)
table = client.get_table(table_ref)
train = client.list_rows(table).to_dataframe()
train.drop("<list_of_columns>", axis=1, inplace=True)
train['class'] = [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1]
train.to_csv(dataset.uri)
然后我使用数据集作为AutoMLTabularTrainingJobRunOp
的输入:
df = create_dataframe(project=project,
region=region,
destination_dataset=destination_dataset,
destination_table_name=destination_table_name,
)
# Training with AutoML
training_op = gcc_aip.AutoMLTabularTrainingJobRunOp(
project=project,
display_name="train-automl-task",
optimization_prediction_type="classification",
column_transformations=[
"<nested_dict>",
],
dataset=df.outputs["dataset"],
target_column="class",
budget_milli_node_hours=1000,
)
查看日志,我发现了这个错误:
"Traceback (most recent call last): "
" File "/opt/python3.7/lib/python3.7/runpy.py", line 193, in _run_module_as_main "
" "__main__", mod_spec) "
" File "/opt/python3.7/lib/python3.7/runpy.py", line 85, in _run_code "
" exec(code, run_globals) "
" File "/opt/python3.7/lib/python3.7/site-packages/google_cloud_pipeline_components/remote/aiplatform/remote_runner.py", line 284, in <module> "
" main() "
" File "/opt/python3.7/lib/python3.7/site-packages/google_cloud_pipeline_components/remote/aiplatform/remote_runner.py", line 280, in main "
" print(runner(args.cls_name, args.method_name, executor_input, kwargs)) "
" File "/opt/python3.7/lib/python3.7/site-packages/google_cloud_pipeline_components/remote/aiplatform/remote_runner.py", line 236, in runner "
" prepare_parameters(serialized_args[METHOD_KEY], method, is_init=False) "
" File "/opt/python3.7/lib/python3.7/site-packages/google_cloud_pipeline_components/remote/aiplatform/remote_runner.py", line 205, in prepare_parameters "
" value = cast(value, param_type) "
" File "/opt/python3.7/lib/python3.7/site-packages/google_cloud_pipeline_components/remote/aiplatform/remote_runner.py", line 176, in cast "
" return annotation_type(value) "
" File "/opt/python3.7/lib/python3.7/site-packages/google/cloud/aiplatform/datasets/dataset.py", line 81, in __init__ "
" self._gca_resource = self._get_gca_resource(resource_name=dataset_name) "
" File "/opt/python3.7/lib/python3.7/site-packages/google/cloud/aiplatform/base.py", line 532, in _get_gca_resource "
" location=self.location, "
" File "/opt/python3.7/lib/python3.7/site-packages/google/cloud/aiplatform/utils/__init__.py", line 192, in full_resource_name "
" raise ValueError(f"Please provide a valid resource_noun[:-1] name or ID") "
"ValueError: Please provide a valid dataset name or ID "
所以,我在第 192 行查看了 google/cloud/aiplatform/utils/__init__.py
中的源代码,发现资源名称应该是:"projects/.../locations/.../datasets/12345"
或 "projects/.../locations/.../metadataStores/.../contexts/12345"
。
在运行create_dataframe
后打开在我的存储桶中创建的executor_output.json
文件,我发现文件名的格式似乎正确:
"artifacts": "dataset": "artifacts": ["name": "projects/my_project/locations/my_region/metadataStores/default/artifacts/1299...", "uri": "my_bucket/object_folder", "metadata": "name": "reshaped-training-dataset"]
我还尝试为元数据中的数据集设置一个人类可读的名称,但我没有工作。 任何建议都会很有帮助。
【问题讨论】:
【参考方案1】:您可以添加参数“dataset: Input[Dataset]”,如下例所示:
df = create_dataframe(project=project,
region=region,
destination_dataset=destination_dataset,
destination_table_name=destination_table_name,
dataset: Input[Dataset],
)
您还可以查看更多文档 pipelines 和 pipelines with kfp。
【讨论】:
您的意思是在组件定义中添加它吗?还是正在筹备中? 在输入代码中,在“#Training with AutoML”之前。 帮助你解答了吗? 如果我使用:
,我得到一个无效的语法错误,=
我得到那个数据集不是组件的输入以上是关于VertexAI Pipeline:如何使用自定义 kfp 组件的输出作为 google_cloud_pipeline_components 的输入?的主要内容,如果未能解决你的问题,请参考以下文章
Vertex AI - ModelDeployOp(...) 上没有名为“google_cloud_pipeline_components.remote”的模块