subprocess.run() 在第二次迭代中失败

Posted 2023-03-15

技术标签:

【中文标题】subprocess.run() 在第二次迭代中失败【英文标题】：subprocess.run() fails at the second iteration 【发布时间】：2021-12-14 10:18:12 【问题描述】：

我想对统计模型进行主动（在线）学习。

这意味着我有一个在编译时已知的初始训练数据集（x-y 对）。

但是，由于主动性（在线），更多数据来自运行时，来自 3rd 方程序（cpp 模拟程序）。

我在 python 中使用GPytorch 执行此操作，并且我正在通过subprocess python 模块调用第 3 方程序。

我的问题是编程类型而不是 Gpytorch 或统计类型，因此我的问题在这里。

工作流程是： python 指示在哪些输入参数处运行.cpp，创建一个基于参数命名的新文件夹，进入文件夹，运行.cpp，收集文件夹中出现的数据，更新统计模型，python 指示在哪个输入参数处运行.cpp，创建一个基于参数命名的新文件夹，进入该文件夹，运行.cpp，收集该文件夹中出现的数据，更新统计模型...... （比如说，100 次）。

在 WSL1 终端中，我通常使用 $ mpirun -n 1 smilei namelist.py 运行 .cpp 代码，该命令在包含可执行文件 smilei 和名为 namelist.py 的 .py 的文件夹中运行

python 工作流在我的主动学习循环的第一次迭代中返回退出代码 0（和必要的数据），但在第二次迭代中失败并返回退出代码 1。它基本上在第一次迭代中完成了它的工作，但在第二次迭代中失败了。

我尝试使用subprocess.run() 和os.system()（请参阅下面的代码以及我之前由 cmets 进行的所有试验），其中在括号内我输入了我通常在 BASH WindowsSubsytemForLinux1 终端中运行的命令以运行第三部分 cpp 程序.

我无法调试第二次失败的原因。

我试图打印出子进程的stdout 和stderr，它们在查询时都返回空行，没有出现这样的东西（没有标准输出和标准错误），用于第二次迭代主动学习循环。

我知道下面的代码可能看起来很复杂，但事实并非如此。它只是遵循我上面介绍的工作流程。

def SMILEI(I):
    os.chdir(top_folder_path)
# create a new folder called a0_942.782348987103 (example value)
    a0 = "%.13f" % a0_from_IntensityWcm2(I)
    dirname = "a0_%.13f" % a0_from_IntensityWcm2(I)
    os.mkdir(dirname)
# enter the created folder
    os.chdir(top_folder_path + "/" + dirname)
    print("We change the directory and entered the newly created one!")
# copy general namelist into this newly created folder
    shutil.copy(top_folder_path + "/" + general_namelist_name, ".")
    print("We copied the general namelist!")
# add the a0 value to the general namelist, i.e. add a line a0 = 942.782348987103 , at row 8 (empty row) in the general namelist.
    with open(general_namelist_name, 'r+') as fd:
        contents = fd.readlines()
        contents.insert(8, "a0 = ".format(a0))  # new_string should end in a newline
        fd.seek(0)  # readlines consumes the iterator, so we need to start over
        fd.writelines(contents)  # No need to truncate as we are increasing filesize
    print("We modified the general namelist to contain the line a0 = ..., at line 8")
# rename the modified namelist
    os.rename(general_namelist_name, particular_namelist_name)
    print("We renamed the general namelist to namelist_Xe_GPtrial_noOAM_a0included.py")
# run the simulation
    print("We'll be now running the SMILEI command inside the folder: ")
    print(os.getcwd())
    print("The smilei executable's absolute path as dictated by os is: ")
    print(os.path.abspath("../smilei"))
    cp = subprocess.run(["mpirun", "-n", "1", os.path.abspath("../smilei"), particular_namelist_name], 
                                # stdin=subprocess.DEVNULL, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
                            #stdout=subprocess.PIPE, stderr=subprocess.PIPE)
                            #capture_output=True)
                        )
    print("The return code is: ")
    print(cp.returncode)
    #os.system("mpirun -n 1 ../smilei ".format(particular_namelist_name))         
    #subprocess.run("mpirun -n 1 ../smilei ".format(particular_namelist_name), shell=True)        
    #print(cp.stdout) # Y
    #print(cp.stderr)
    #print(cp.returncode) 
# get the results of the simulation
    # os.chdir(top_folder_path + "/" + dirname)
    # print("We changed the directory again and entered again the newly created one!")
    S = happi.Open(".")
    pbb = S.ParticleBinning(0).get()
    results_dict = dict()
    for z in range(len(pbb['data'][-1])):
        results_dict['c_%d' % z] = pbb['data'][-1][z]
    return np.asarray(list(results_dict.values()))


if __name__ == '__main__':
    # Initial Train Dataset:
    x_train = torch.from_numpy(np.array([0.1, 0.3, 0.5, 0.6, 0.8]))
    y_train = torch.from_numpy(np.array([0.1, 0.2, 0.3, 0.4, 0.5]))

    # initialize likelihood and model
    likelihood = gpytorch.likelihoods.GaussianLikelihood()
    model = ExactGPModel(x_train, y_train, likelihood)

    model.train()
    likelihood.train()

    # "Loss" for GPs - the marginal log likelihood
    mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)
    optimizer = torch.optim.Adam(model.parameters(), lr=0.1)  

    training_iters = 10
    for i in range(training_iters):
        optimizer.zero_grad()
        output = model(x_train)
        loss   = - mll(output, y_train)
        loss.backward()
        print('Iter %d/%d' % (i+1, training_iters))
        optimizer.step()

    Xn = x_train
    Yn = y_train
    ######################################################################################
    # The Active-Learning (AL) loop:
    budget_value = 100
    for i in range(budget_value):
        OldValues = lhs(1, samples=100)
        Xref = range_transform(OldValues, 10.0**20, 10.0**25)
        x_nplus1 = xnp1search(model, Xn, Xref) # x_nplus1 is Intensity in W/cm2 at which to run SMILEI next for Active-Learning the GP fit
        y_nplus1 = SMILEI(x_nplus1.detach().numpy())[53] # SMILEI(x_nplus1.detach().numpy()) returns an ndarray of shape (55,)
        Xn = torch.cat(   ( Xn, torch.reshape(x_nplus1, (1,)) )   )
        Yn = torch.cat(   ( Yn, torch.reshape(torch.from_numpy(np.reshape(y_nplus1, (1,))), (1,)) )   )
        model.set_train_data(Xn, Yn, strict=False)
        for j in range(training_iters):
            optimizer.zero_grad()
            output = model(Xn)
            loss = -mll(output, Yn)
            loss.backward()
            print('Iter %d/%d' % (j+1, training_iters) + 'inside AL step number %d/%d' % (i+1, budget_value))
            optimizer.step()

为什么第二次失败了？

我根本看不到它。而且我无法调试它，我没有收到任何错误消息或任何东西，它只是没有在第二个创建的文件夹中运行模拟，python脚本末尾的那个文件夹只包含namelist_Xe_GPtrial_noOAM_a0included.py，带有a0包含的价值（应该如此）。

谢谢！

【问题讨论】：

【参考方案1】：

我能想到的两个选项在子进程调用周围使用try: except subprocess.CalledProcessError as e:print(e)。那会给你错误。另一种选择是打印出 cmd 并在命令行上运行它以查看任何错误。可能是第二次执行代码时缺少变量。

【讨论】：

谢谢。我只能得到：

Command '['mpirun', '-n', '1', '/home/velenos14/PICsims/github/SMILEI_correctTunnelBSIrate/Smilei/SIMRESULTS/GPs_trial_Xenon_noOAM/smilei', 'namelist_Xe_GPtrial_noOAM_a0included.py']' returned non-zero exit status 1

使用 try except 包装器。我看不出这有什么帮助......（我以前得到过这个，使用print(cp.returncode)）命令 Python 执行失败，即

mpirun -n 1 /home/velenos14/PICsims/github/SMILEI_correctTunnelBSIrate/Smilei/SIMRESULTS/GPs_trial_Xenon_noOAM/smilei namelist_Xe_GPtrial_noOAM_a0included.py

在第二个文件夹中运行时完美运行。代码第一次执行时执行的是同一行吗？

mpirun -n 1 /home/velenos14/PICsims/github/SMILEI_correctTunnelBSIrate/Smilei/SIMRESULTS/GPs_trial_Xenon_noOAM/smilei namelist_Xe_GPtrial_noOAM_a0included.py

在循环过程中必须改变一些东西。循环中particular_namelist_name 的输出是什么。它在循环的第二次迭代中不再运行命令，它只是通过 os.system() 调用或 subprocess.run() 调用，而不是实际“生成” "该命令（mpirun -n 1 等），并且 Python 脚本在尝试使用新文件夹中的数据时会阻塞（数据不存在，因为模拟没有运行，即 mpirun -n 1 个笑脸未被调用）我使用plumbum 模块解决了这个问题。我的代码保持不变，它们都很好。但是，我将subprocess.run() 命令或我尝试过的许多它的变体修改为smi = local.cmd.mpirun，然后是smi("-n", "1", "../smilei", particular_namelist_name)，我可以在循环的每次迭代中运行它！【参考方案2】：

我自己使用plumbum 模块解决了这个问题。

我的代码保持不变，它们都很好。

但是，我将 subprocess.run() 命令或我尝试过的许多变体修改为 smi = local.cmd.mpirun，然后是 smi("-n", "1", "../smilei", particular_namelist_name)，我可以在循环的每次迭代中运行它！

【讨论】：

以上是关于subprocess.run() 在第二次迭代中失败的主要内容，如果未能解决你的问题，请参考以下文章