Google Cloud Platform Vertex AI 日志未显示在自定义作业中
Posted
技术标签:
【中文标题】Google Cloud Platform Vertex AI 日志未显示在自定义作业中【英文标题】:Google Cloud Platform Vertex AI logs not showing in custom job 【发布时间】:2021-12-24 01:12:36 【问题描述】:我编写了一个训练神经网络的 python 包。然后我使用以下命令将其打包。
python3 setup.py sdist --formats=gztar
当我通过 GCP 控制台运行此作业并手动单击所有选项时,我会按预期从我的程序中获取日志(参见下面的示例)
成功日志示例:
但是,当我以编程方式运行完全相同的作业时,不会出现任何日志。只有最后一个错误(如果发生):
缺少示例日志:
在这两种情况下,程序都在运行——我只是看不到任何输出。这可能是什么原因?作为参考,我还包含了我用来以编程方式启动训练过程的代码:
ENTRY_POINT = "projects.yaw_correction.yaw_correction"
TIMESTAMP = datetime.datetime.strftime(datetime.datetime.now(),"%y%m%d_%H%M%S")
PROJECT = "yaw_correction"
GCP_PROJECT = "our_gcp_project_name"
LOCATION = "europe-west1"
BUCKET_NAME = "our_bucket_name"
DISPLAY_NAME = "Training_Job_" + TIMESTAMP
CONTAINER_URI = "europe-docker.pkg.dev/vertex-ai/training/pytorch-xla.1-9:latest"
MODEL_NAME = "Model_" + TIMESTAMP
ARGS = [f"/gcs/fotokite-training-data/yaw_correction/", "--cloud", "--gpu"]
TENSORBOARD = "projects/"our_gcp_project_name"/locations/europe-west4/tensorboards/yaw_correction"
MACHINE_TYPE = "n1-standard-4"
REPLICA_COUNT = 1
ACCELERATOR_TYPE = "ACCELERATOR_TYPE_UNSPECIFIED"
ACCELERATOR_COUNT = 0
SYNC = False
#Delete existing source distributions
def deleteDist():
dirpath = Path('dist')
if dirpath.exists() and dirpath.is_dir():
shutil.rmtree(dirpath)
# Copy distribution to the cloud bucket storage
deleteDist()
subprocess.run("python3 setup.py sdist --formats=gztar", shell=True)
filename = [x for x in Path('dist').glob('*')]
if len(filename) != 1:
raise Exception("More than one distribution was found")
print(str(filename[0]))
PACKAGE_URI = f"gs://BUCKET_NAME/distributions/"
subprocess.run(f"gsutil cp str(filename[0]) PACKAGE_URI", shell=True)
PACKAGE_URI += str(filename[0].name)
deleteDist()
# Initialise the compute instance
aiplatform.init(project=GCP_PROJECT, location=LOCATION, staging_bucket=BUCKET_NAME)
# Schedule the job
job = aiplatform.CustomPythonPackageTrainingJob(
display_name=DISPLAY_NAME,
#script_path="trainer/test.py",
python_package_gcs_uri=PACKAGE_URI,
python_module_name=ENTRY_POINT,
#requirements=['tensorflow_datasets~=4.2.0', 'SQLAlchemy~=1.4.26', 'google-cloud-secret-manager~=2.7.2', 'cloud-sql-python-connector==0.4.2', 'Pymysql==1.0.2'],
container_uri=CONTAINER_URI,
)
model = job.run(
dataset=None,
#base_output_dir=f"gs://BUCKET_NAME/PROJECT/Train_TIMESTAMP",
base_output_dir=f"gs://BUCKET_NAME/PROJECT/",
service_account="vertex-ai-fotokite-service-acc@fotokite-cv-gcp-exploration.iam.gserviceaccount.com",
environment_variables=None,
args=ARGS,
replica_count=REPLICA_COUNT,
machine_type=MACHINE_TYPE,
accelerator_type=ACCELERATOR_TYPE,
accelerator_count=ACCELERATOR_TYPE,
#tensorboard=TENSORBOARD,
sync=SYNC
)
print(model)
print("JOB SUBMITTED")
【问题讨论】:
【参考方案1】:一般这种错误“The replica workerpool0-0 exited with a non-zero status of 1”是因为在打包python文件的过程中或者代码中出现了问题。
您可以看到这些选项。
您可以像这样检查所有文件是否都在包中(培训文件和依赖项) 示例:setup.py demo/PKG demo/SOURCES.txt demo/dependency_links.txt demo/requires.txt demo/level.txt trainer/__init__.py trainer/metadata.py trainer/model.py trainer/task.py trainer/utils.py
你可以看到官方troubleshooting guide from Google Cloud 出现此类错误以及如何查看有关此错误的更多信息 错误。
你可以看到这个oficial documentation about packaging。
【讨论】:
以上是关于Google Cloud Platform Vertex AI 日志未显示在自定义作业中的主要内容,如果未能解决你的问题,请参考以下文章
使用新的 Google Cloud 日志记录 jar 时,日志未显示在 Google Cloud Platform Stackdriver 中
Google Cloud Platform:Cloud Functions 与 App Engine
Google Cloud Platform - AI Platform:为啥调用 API 时会得到不同的响应正文?
Google Cloud Platform:无法通过API在Storage中上传新文件版本