Google Cloud Platform Vertex AI 日志未显示在自定义作业中

Posted

技术标签:

【中文标题】Google Cloud Platform Vertex AI 日志未显示在自定义作业中【英文标题】:Google Cloud Platform Vertex AI logs not showing in custom job 【发布时间】:2021-12-24 01:12:36 【问题描述】:

我编写了一个训练神经网络的 python 包。然后我使用以下命令将其打包。

python3 setup.py sdist --formats=gztar

当我通过 GCP 控制台运行此作业并手动单击所有选项时,我会按预期从我的程序中获取日志(参见下面的示例)

成功日志示例:

但是,当我以编程方式运行完全相同的作业时,不会出现任何日志。只有最后一个错误(如果发生):

缺少示例日志:

在这两种情况下,程序都在运行——我只是看不到任何输出。这可能是什么原因?作为参考,我还包含了我用来以编程方式启动训练过程的代码:

ENTRY_POINT = "projects.yaw_correction.yaw_correction"
TIMESTAMP = datetime.datetime.strftime(datetime.datetime.now(),"%y%m%d_%H%M%S")
PROJECT = "yaw_correction"
GCP_PROJECT = "our_gcp_project_name"
LOCATION = "europe-west1"
BUCKET_NAME = "our_bucket_name"
DISPLAY_NAME = "Training_Job_" + TIMESTAMP
CONTAINER_URI = "europe-docker.pkg.dev/vertex-ai/training/pytorch-xla.1-9:latest"
MODEL_NAME = "Model_" + TIMESTAMP
ARGS = [f"/gcs/fotokite-training-data/yaw_correction/", "--cloud", "--gpu"]
TENSORBOARD = "projects/"our_gcp_project_name"/locations/europe-west4/tensorboards/yaw_correction"

MACHINE_TYPE = "n1-standard-4"
REPLICA_COUNT = 1
ACCELERATOR_TYPE = "ACCELERATOR_TYPE_UNSPECIFIED"
ACCELERATOR_COUNT = 0
SYNC = False

#Delete existing source distributions
def deleteDist():
    dirpath = Path('dist')
    if dirpath.exists() and dirpath.is_dir():
        shutil.rmtree(dirpath)

# Copy distribution to the cloud bucket storage
deleteDist()
subprocess.run("python3 setup.py sdist --formats=gztar", shell=True)
filename = [x for x in Path('dist').glob('*')]
if len(filename) != 1:
    raise Exception("More than one distribution was found")
print(str(filename[0]))
PACKAGE_URI = f"gs://BUCKET_NAME/distributions/"
subprocess.run(f"gsutil cp str(filename[0]) PACKAGE_URI", shell=True)
PACKAGE_URI += str(filename[0].name)
deleteDist()

# Initialise the compute instance
aiplatform.init(project=GCP_PROJECT, location=LOCATION, staging_bucket=BUCKET_NAME)

# Schedule the job
job = aiplatform.CustomPythonPackageTrainingJob(
    display_name=DISPLAY_NAME,
    #script_path="trainer/test.py",
    python_package_gcs_uri=PACKAGE_URI,
    python_module_name=ENTRY_POINT,
    #requirements=['tensorflow_datasets~=4.2.0', 'SQLAlchemy~=1.4.26', 'google-cloud-secret-manager~=2.7.2', 'cloud-sql-python-connector==0.4.2', 'Pymysql==1.0.2'],
    container_uri=CONTAINER_URI,
)

model = job.run(
    dataset=None,
    #base_output_dir=f"gs://BUCKET_NAME/PROJECT/Train_TIMESTAMP",
    base_output_dir=f"gs://BUCKET_NAME/PROJECT/",
    service_account="vertex-ai-fotokite-service-acc@fotokite-cv-gcp-exploration.iam.gserviceaccount.com",
    environment_variables=None,
    args=ARGS,
    replica_count=REPLICA_COUNT,
    machine_type=MACHINE_TYPE,
    accelerator_type=ACCELERATOR_TYPE,
    accelerator_count=ACCELERATOR_TYPE,
    #tensorboard=TENSORBOARD,
    sync=SYNC
)
print(model)
print("JOB SUBMITTED")

【问题讨论】:

【参考方案1】:

一般这种错误“The replica workerpool0-0 exited with a non-zero status of 1”是因为在打包python文件的过程中或者代码中出现了问题。

您可以看到这些选项。

您可以像这样检查所有文件是否都在包中(培训文件和依赖项) 示例:
setup.py

demo/PKG

demo/SOURCES.txt

demo/dependency_links.txt

demo/requires.txt

demo/level.txt

trainer/__init__.py

trainer/metadata.py

trainer/model.py

trainer/task.py

trainer/utils.py

你可以看到官方troubleshooting guide from Google Cloud 出现此类错误以及如何查看有关此错误的更多信息 错误。

你可以看到这个oficial documentation about packaging。

【讨论】:

以上是关于Google Cloud Platform Vertex AI 日志未显示在自定义作业中的主要内容,如果未能解决你的问题,请参考以下文章

使用新的 Google Cloud 日志记录 jar 时,日志未显示在 Google Cloud Platform Stackdriver 中

Google Cloud Platform:Cloud Functions 与 App Engine

Google Cloud Platform - AI Platform:为啥调用 API 时会得到不同的响应正文?

Google Cloud Platform:无法通过API在Storage中上传新文件版本

保存Google Cloud Platform服务帐户凭据的位置

使用Google Cloud Platform的Fastai