OpenMPI:MPI.Init 挂在 Java 中 - 如何调试?

Posted

技术标签:

【中文标题】OpenMPI:MPI.Init 挂在 Java 中 - 如何调试?【英文标题】:OpenMPI: MPI.Init hanging in Java - how to debug? 【发布时间】:2019-03-19 19:17:55 【问题描述】:

我已经根据https://www.open-mpi.org/faq/?category=java 在本地编译了支持 Java 的 OpenMPI。在我使用 Oracle Java 8 的本地机器上,这可以正常工作,但在使用 OpenJDK 8 的集群上,这种方法会导致 MPI Init 挂起。您对如何从这里开始有任何指示吗?追踪?玩弄其他版本的 Java?我找不到任何关于这个接口在 Java 版本方面支持什么的文档。

package com.acme.hello;
import mpi.*;

public class HelloMpi 
    public static void main(String args[]) throws Exception 
        int me,size;
        System.out.println("attempting MPI init");
        args=MPI.Init(args);
        System.out.println("MPI init done");
    


> java -version
java version "1.8.0_181"
Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)

> ~/NQSIM/java$ mpirun -version
mpirun (Open MPI) 3.1.2

> ~/NQSIM/java$ mpirun -np 2 java -classpath 
"./target/test-classes/" com.acme.hello.HelloMpi
attempting MPI init
attempting MPI init
(hangs here forever)

编辑:examples/hello_c 显示相同的行为,因此它与 Java 无关。我想这一定是运输中的东西。我必须仅使用用户权限构建/安装 OpenMPI。系统上有一个现有的 OpenMPI,但不支持 Java。关于如何进行的任何想法?

Edit2:切换到不同的字节层,例如使用--mca btl vader,self,有效。以下是聚会停止前--mca btl_base_verbose的输出:

[fdr4:33013] mca: base: components_register: registering framework btl components
[fdr4:33013] mca: base: components_register: found loaded component sm
[fdr4:33014] mca: base: components_register: registering framework btl components
[fdr4:33014] mca: base: components_register: found loaded component sm
[fdr4:33013] mca: base: components_register: component sm register function successful
[fdr4:33013] mca: base: components_register: found loaded component self
[fdr4:33014] mca: base: components_register: component sm register function successful
[fdr4:33013] mca: base: components_register: component self register function successful
[fdr4:33014] mca: base: components_register: found loaded component self
[fdr4:33013] mca: base: components_register: found loaded component tcp
[fdr4:33014] mca: base: components_register: component self register function successful
[fdr4:33013] mca: base: components_register: component tcp register function successful
[fdr4:33014] mca: base: components_register: found loaded component tcp
[fdr4:33013] mca: base: components_register: found loaded component vader
[fdr4:33013] mca: base: components_register: component vader register function successful
[fdr4:33013] mca: base: components_register: found loaded component openib
[fdr4:33014] mca: base: components_register: component tcp register function successful
[fdr4:33014] mca: base: components_register: found loaded component vader
[fdr4:33014] mca: base: components_register: component vader register function successful
[fdr4:33014] mca: base: components_register: found loaded component openib
[fdr4:33013] mca: base: components_register: component openib register function successful
[fdr4:33013] mca: base: components_open: opening btl components
[fdr4:33013] mca: base: components_open: found loaded component sm
[fdr4:33013] mca: base: components_open: component sm open function successful
[fdr4:33013] mca: base: components_open: found loaded component self
[fdr4:33013] mca: base: components_open: component self open function successful
[fdr4:33013] mca: base: components_open: found loaded component tcp
[fdr4:33013] mca: base: components_open: component tcp open function successful
[fdr4:33013] mca: base: components_open: found loaded component vader
[fdr4:33013] mca: base: components_open: component vader open function successful
[fdr4:33013] mca: base: components_open: found loaded component openib
[fdr4:33013] mca: base: components_open: component openib open function successful
[fdr4:33013] select: initializing btl component sm
[fdr4:33014] mca: base: components_register: component openib register function successful
[fdr4:33014] mca: base: components_open: opening btl components
[fdr4:33014] mca: base: components_open: found loaded component sm
[fdr4:33014] mca: base: components_open: component sm open function successful
[fdr4:33014] mca: base: components_open: found loaded component self
[fdr4:33014] mca: base: components_open: component self open function successful
[fdr4:33014] mca: base: components_open: found loaded component tcp
[fdr4:33014] mca: base: components_open: component tcp open function successful
[fdr4:33014] mca: base: components_open: found loaded component vader
[fdr4:33014] mca: base: components_open: component vader open function successful
[fdr4:33014] mca: base: components_open: found loaded component openib
[fdr4:33014] mca: base: components_open: component openib open function successful
[fdr4:33014] select: initializing btl component sm
[fdr4:33014] select: init of component sm returned success
[fdr4:33014] select: initializing btl component self
[fdr4:33014] select: init of component self returned success
[fdr4:33014] select: initializing btl component tcp
[fdr4:33013] select: init of component sm returned success
[fdr4:33013] select: initializing btl component self
[fdr4:33013] select: init of component self returned success
[fdr4:33013] select: initializing btl component tcp
[fdr4:33014] select: init of component tcp returned success
[fdr4:33014] select: initializing btl component vader
[fdr4:33013] select: init of component tcp returned success
[fdr4:33013] select: initializing btl component vader
[fdr4:33014] select: init of component vader returned success
[fdr4:33014] select: initializing btl component openib
[fdr4:33013] select: init of component vader returned success
[fdr4:33013] select: initializing btl component openib
[fdr4:33014] Checking distance from this process to device=mlx4_0
[fdr4:33013] Checking distance from this process to device=mlx4_0
[fdr4:33013] hwloc_distances->nbobjs=4
[fdr4:33013] hwloc_distances->latency[0]=1.000000
[fdr4:33013] hwloc_distances->latency[1]=2.000000
[fdr4:33013] hwloc_distances->latency[2]=3.000000
[fdr4:33014] hwloc_distances->nbobjs=4
[fdr4:33014] hwloc_distances->latency[0]=1.000000
[fdr4:33014] hwloc_distances->latency[1]=2.000000
[fdr4:33014] hwloc_distances->latency[2]=3.000000
[fdr4:33013] hwloc_distances->latency[3]=2.000000
[fdr4:33013] hwloc_distances->latency[4]=2.000000
[fdr4:33013] hwloc_distances->latency[5]=1.000000
[fdr4:33013] hwloc_distances->latency[6]=2.000000
[fdr4:33013] hwloc_distances->latency[7]=3.000000
[fdr4:33013] ibv_obj->logical_index=1
[fdr4:33014] hwloc_distances->latency[3]=2.000000
[fdr4:33014] hwloc_distances->latency[4]=2.000000
[fdr4:33014] hwloc_distances->latency[5]=1.000000
[fdr4:33014] hwloc_distances->latency[6]=2.000000
[fdr4:33014] hwloc_distances->latency[7]=3.000000
[fdr4:33014] ibv_obj->logical_index=1
[fdr4:33013] my_obj->logical_index=0
[fdr4:33013] Process is bound: distance to device is 2.000000
[fdr4:33014] my_obj->logical_index=0
[fdr4:33014] Process is bound: distance to device is 2.000000
[fdr4:33013] [rank=0] openib: using port mlx4_0:1
[fdr4:33013] select: init of component openib returned success
[fdr4:33014] [rank=1] openib: using port mlx4_0:1
[fdr4:33014] select: init of component openib returned success
[fdr4:33013] mca: bml: Using self btl for send to [[59315,1],0] on node fdr4
[fdr4:33014] mca: bml: Using self btl for send to [[59315,1],1] on node fdr4
[fdr4:33013] mca: bml: Using vader btl for send to [[59315,1],1] on node fdr4
[fdr4:33014] mca: bml: Using vader btl for send to [[59315,1],0] on node fdr4

【问题讨论】:

首先,你应该尝试一个C程序,比如examples/hello_c.c 如果还是不行,试试mpirun -np 2 hostname mpirun -np 2 --mca pml ob1 --mca btl vader,self hello_c 然后mpirun --mca pml ob1 --mca btl tcp,self hello_c 我猜mpirun --mca pml ob1 -np 2 hello_c 也可以。这强烈表明您的硬件/环境/系统堆栈有问题。您现在可以mpirun --mca pml_base_verbose 10 --mca mtl_base_verbose 10 -np 2 hello_c 更深入地了解正在发生的事情。由于这显然是一个 Open MPI 问题,因此您更有可能通过直接询问 users@lists.open-MPI.org 邮件列表来获得答案。 如果您在单个节点上运行,强制 vader 组件不太可能影响您的应用性能。 【参考方案1】:

已经解决了。在这种情况下,问题是对用户施加的限制之一。服务器被配置为使用默认设置,但是在/etc/security/limits.conf 中更改以下内容后,它开始使用默认字节层(因为我自己无法直接测试它,不幸的是我不知道这两个设置中的哪一个是肇事者):

*               -       memlock         unlimited
*               -       nofile          16384

【讨论】:

openib(infiniband)IIRC(基本上)需要无限内存锁,如果此限制太低,则会出现警告/错误消息,因此这里可能还有另一个新问题。 您的意思是因为我们没有看到警告,所以还有另一个问题? 有可能,我去调查一下。

以上是关于OpenMPI:MPI.Init 挂在 Java 中 - 如何调试?的主要内容,如果未能解决你的问题,请参考以下文章

MPI_Init() VS MPI_Init_thread()

MPI Java矩阵乘法错误

在java中的main之外初始化MPI

openmpi + java,找不到或加载主类

OpenMPI:包 mpi 不存在

Windows 10 Cygwin 中的 OpenMPI java 绑定