在 Azure ML Pipeline 的 train.py 中读取/装载 csv 文件

Posted

技术标签:

【中文标题】在 Azure ML Pipeline 的 train.py 中读取/装载 csv 文件【英文标题】:Read/Mount a csv file inside train.py of Azure ML Pipeline 【发布时间】:2020-05-28 19:57:27 【问题描述】:

我们正在从 Eventhub 和 AppInsight 收集数据并将其存储在 azure blob 中。通过使用 AzureML 管道,我想通过两种不同的逻辑(一个用于 ml,另一个用于欺诈分析)将我的数据集传递到 train.py。

但我无法从 train.py 内部读取 csv 文件以进行进一步处理

这是我的 train.py,它在 Azure 机器学习管道中通过 PythonScriptStep 运行

import argparse
import os
import pandas as pd

print("In train.py")

parser = argparse.ArgumentParser("train")

parser.add_argument("--input_data", type=str, help="input data")
parser.add_argument("--output_train", type=str, help="output_train directory")

args = parser.parse_args()

print("Argument 1: %s" % args.input_data)
df = pd.read_csv(args.input_data)
print(df.head())

print("Argument 2: %s" % args.output_train)

if not (args.output_train is None):
    os.makedirs(args.output_train, exist_ok=True)
    print("%s created" % args.output_train)

这是运行管道的代码

ws = Workspace.from_config()
def_blob_store = Datastore(ws, "basic_data_store")
aml_compute_target = "test-cluster"
try:
    aml_compute = AmlCompute(ws, aml_compute_target)
    print("found existing compute target.")
except ComputeTargetException:
    print("Error")

source_directory = './train'

blob_input_data = DataReference(
    datastore=def_blob_store,
    data_reference_name="device_data",
    path_on_datastore="_fraud_data/test.csv")
trainStep = PythonScriptStep(
    script_name="train.py", 
    arguments=["--input_data", blob_input_data, "--output_train", processed_data1],
    inputs=[blob_input_data],
    outputs=[processed_data1],
    compute_target=aml_compute, 
    source_directory=source_directory,
    runconfig=run_config
)
pipeline1 = Pipeline(workspace=ws, steps=[compareStep])
pipeline_run1 = Experiment(ws, 'Data_dependency').submit(pipeline1)

在下面的输出跟踪中,您可以看到输出Argument 1 正在打印文件的路径

Argument 1: /mnt/batch/tasks/shared/LS_root/jobs/pipeline-shohoz/azureml/d92be2ab-e63f-4883-a14b-a64fa5bb431d/mounts/basic_data_store/_fraud_data/test.csv

所以我已经成功地传递了数据集,但无法在pd.read_csv(args.input_data) 行读取 train.py 中的文件。它正在显示

FileNotFoundError: [Errno 2] File b'/mnt/batch/tasks/shared/LS_root/jobs/pipeline-shohoz/azureml/d92be2ab-e63f-4883-a14b-a64fa5bb431d/mounts/basic_data_store/_fraud_data/test.csv'

这是我从 azureml 日志下载的来自 70_driver_log.txt 的完整跟踪信息,

Preparing to call script [ train.py ] with arguments: ['--input_data', '/mnt/batch/tasks/shared/LS_root/jobs/pipeline-shohoz/azureml/d92be2ab-e63f-4883-a14b-a64fa5bb431d/mounts/basic_data_store/_fraud_data/test.csv', '--output_train', '/mnt/batch/tasks/shared/LS_root/jobs/pipeline-shohoz/azureml/d92be2ab-e63f-4883-a14b-a64fa5bb431d/mounts/basic_data_store/azureml/d92be2ab-e63f-4883-a14b-a64fa5bb431d/processed_data1']
After variable expansion, calling script [ train.py ] with arguments: ['--input_data', '/mnt/batch/tasks/shared/LS_root/jobs/pipeline-shohoz/azureml/d92be2ab-e63f-4883-a14b-a64fa5bb431d/mounts/basic_data_store/_fraud_data/test.csv', '--output_train', '/mnt/batch/tasks/shared/LS_root/jobs/pipeline-shohoz/azureml/d92be2ab-e63f-4883-a14b-a64fa5bb431d/mounts/basic_data_store/azureml/d92be2ab-e63f-4883-a14b-a64fa5bb431d/processed_data1']

In train.py
Argument 1: /mnt/batch/tasks/shared/LS_root/jobs/pipeline-shohoz/azureml/d92be2ab-e63f-4883-a14b-a64fa5bb431d/mounts/basic_data_store/_fraud_data/test.csv


The experiment failed. Finalizing run...
Cleaning up all outstanding Run operations, waiting 300.0 seconds
1 items cleaning up...
Cleanup took 0.001172780990600586 seconds
Starting the daemon thread to refresh tokens in background for process with pid = 136
Traceback (most recent call last):
  File "train.py", line 18, in <module>
    df = pd.read_csv(args.input_data) #str()
  File "/azureml-envs/azureml_eb042e80b9a6abdb5821a78683153a38/lib/python3.6/site-packages/pandas/io/parsers.py", line 685, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/azureml-envs/azureml_eb042e80b9a6abdb5821a78683153a38/lib/python3.6/site-packages/pandas/io/parsers.py", line 457, in _read
    parser = TextFileReader(fp_or_buf, **kwds)
  File "/azureml-envs/azureml_eb042e80b9a6abdb5821a78683153a38/lib/python3.6/site-packages/pandas/io/parsers.py", line 895, in __init__
    self._make_engine(self.engine)
  File "/azureml-envs/azureml_eb042e80b9a6abdb5821a78683153a38/lib/python3.6/site-packages/pandas/io/parsers.py", line 1135, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/azureml-envs/azureml_eb042e80b9a6abdb5821a78683153a38/lib/python3.6/site-packages/pandas/io/parsers.py", line 1917, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 382, in pandas._libs.parsers.TextReader.__cinit__
  File "pandas/_libs/parsers.pyx", line 689, in pandas._libs.parsers.TextReader._setup_parser_source
FileNotFoundError: [Errno 2] File b'/mnt/batch/tasks/shared/LS_root/jobs/pipeline-shohoz/azureml/d92be2ab-e63f-4883-a14b-a64fa5bb431d/mounts/basic_data_store/_fraud_data/test.csv' does not exist: b'/mnt/batch/tasks/shared/LS_root/jobs/pipeline-shohoz/azureml/d92be2ab-e63f-4883-a14b-a64fa5bb431d/mounts/basic_data_store/_fraud_data/test.csv'

我试过相对路径

azureml/8d2b7bee-6cc5-4c8c-a685-1300a240de8f/mounts/basic_data_store/_fraud_data/test.csv

还有 Uri

wasbs://shohoz-container@shohozds.blob.core.windows.net/azureml/azureml/8d2b7bee-6cc5-4c8c-a685-1300a240de8f/mounts/basic_data_store/_fraud_data/test.csv

但以相同的FileNotFoundError 结果结尾。在过去的 3-4 天里,我一直在把头撞在墙上。任何帮助都会拯救我的大脑。

【问题讨论】:

你记得你是如何解决这个问题的吗?我有同样的问题。按照 Ram-msft 的规定(在参数字段中)显式传入挂载路径不起作用。 【参考方案1】:

您可以使用 PipelineDataset 对象在 PythonScriptStep 中包含已注册的数据集 - 有关更多详细信息和示例,请参阅 https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.pipelinedataset?view=azure-ml-py。

【讨论】:

我应该在哪里传递这个script_params

以上是关于在 Azure ML Pipeline 的 train.py 中读取/装载 csv 文件的主要内容,如果未能解决你的问题,请参考以下文章

在群集上运行网格搜索 CV 时 Azure ML Pipeline 失败

在 Azure ML Pipeline 的 train.py 中读取/装载 csv 文件

Azure Devops Pipeline YAML 中的 Git 标记名称

如何从 Azure ML 管道脚本步骤注册模型

Spark ML Pipeline简介

有啥方法可以在 Spark ML Pipeline 中序列化自定义 Transformer