如何理解我的 valgrind 错误消息?

Posted

技术标签:

【中文标题】如何理解我的 valgrind 错误消息?【英文标题】:how can I understand my valgrind error message? 【发布时间】:2019-07-30 13:08:01 【问题描述】:

我从 valgrind 收到以下错误消息:

==1808== 0 bytes in 1 blocks are still reachable in loss record 1 of 1,734
==1808==    at 0x4A05E7D: malloc (vg_replace_malloc.c:309)
==1808==    by 0x4CC2BA9: hwloc_build_level_from_list (topology.c:1603)
==1808==    by 0x4CC2BA9: hwloc_connect_levels (topology.c:1774)
==1808==    by 0x4CC2F25: hwloc_discover (topology.c:2091)
==1808==    by 0x4CC2F25: opal_hwloc132_hwloc_topology_load (topology.c:2596)
==1808==    by 0x4C60957: orte_odls_base_open (odls_base_open.c:205)
==1808==    by 0x632FDB3: ???
==1808==    by 0x4C3B6B9: orte_init (orte_init.c:127)
==1808==    by 0x403E0E: orterun (orterun.c:693)
==1808==    by 0x4035E3: main (main.c:13)
==1808==
==1808== 0 bytes in 1 blocks are still reachable in loss record 2 of 1,734
==1808==    at 0x4A05E7D: malloc (vg_replace_malloc.c:309)
==1808==    by 0x4CC2BD5: hwloc_build_level_from_list (topology.c:1603)
==1808==    by 0x4CC2BD5: hwloc_connect_levels (topology.c:1775)
==1808==    by 0x4CC2F25: hwloc_discover (topology.c:2091)
==1808==    by 0x4CC2F25: opal_hwloc132_hwloc_topology_load (topology.c:2596)
==1808==    by 0x4C60957: orte_odls_base_open (odls_base_open.c:205)
==1808==    by 0x632FDB3: ???
==1808==    by 0x4C3B6B9: orte_init (orte_init.c:127)
==1808==    by 0x403E0E: orterun (orterun.c:693)
==1808==    by 0x4035E3: main (main.c:13)

我无法理解 valgrind 报告的问题类型。有人愿意解释吗?

我已经检查了所有新实例。所有这些都已正确删除。

当代码结束时,我收到了 valgrind 错误消息和 mpi 的进一步错误:

---------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 1811 on node laki.pi.ingv.it exited on signal 11 (Segmentation fault).
----------------------------------------------------------------------

这是关于 MPI_Init 的错误消息:

==31198== 0 bytes in 1 blocks are still reachable in loss record 1 of 368
==31198==    at 0x4A05E7D: malloc (vg_replace_malloc.c:309)
==31198==    by 0xC66DE49: hwloc_build_level_from_list (topology.c:1603)
==31198==    by 0xC66DE49: hwloc_connect_levels (topology.c:1774)
==31198==    by 0xC66E1C5: hwloc_discover (topology.c:2091)
==31198==    by 0xC66E1C5: opal_hwloc132_hwloc_topology_load (topology.c:2596)
==31198==    by 0xC62B473: opal_hwloc_unpack (hwloc_base_dt.c:83)
==31198==    by 0xC6270AB: opal_dss_unpack_buffer (dss_unpack.c:120)
==31198==    by 0xC62815F: opal_dss_unpack (dss_unpack.c:84)
==31198==    by 0xC5F2349: orte_util_nidmap_init (nidmap.c:146)
==31198==    by 0xED98608: ???
==31198==    by 0xC5DC0B9: orte_init (orte_init.c:127)
==31198==    by 0xC59DBAE: ompi_mpi_init (ompi_mpi_init.c:357)
==31198==    by 0xC5B443F: PMPI_Init (pinit.c:84)
==31198==    by 0x55FA53: main (solver_2d.hpp:22)

其中 line solver_2d.hpp:22 正好包含在:

MPI_Init(&argc, &argv);

另外,与 MPI_Finalize() 相关的错误信息;是

==31198== 1 errors in context 1 of 58:
==31198== Syscall param write(buf) points to uninitialised byte(s)
==31198==    at 0x38EF00E6FD: ??? (in /lib64/libpthread-2.12.so)
==31198==    by 0x11F1F548: ???
==31198==    by 0x11F1E03F: ???
==31198==    by 0x11CD7FBA: ???
==31198==    by 0x11CE519A: ???
==31198==    by 0x11CE3C37: ???
==31198==    by 0x11CD90C1: ???
==31198==    by 0x11AC2E36: ???
==31198==    by 0xC59ECC4: ompi_mpi_finalize (ompi_mpi_finalize.c:285)
==31198==    by 0x562185: main (solver_2d.hpp:171)
==31198==  Address 0x1ffeffda24 is on thread 1's stack
==31198==  Uninitialised value was created by a stack allocation
==31198==    at 0x11CCE050: ???

==31197== Syscall param write(buf) points to uninitialised byte(s)
==31197==    at 0x38EF00E6FD: ??? (in /lib64/libpthread-2.12.so)
==31197==    by 0x11F1F548: ipath_cmd_write (in /usr/lib64/libinfinipath.so.4.0)
==31197==    by 0x11F1E03F: ipath_poll_type (in /usr/lib64/libinfinipath.so.4.0)
==31197==    by 0x11CD7FBA: psmi_context_interrupt_set (in /usr/lib64/libpsm_infinipath.so.1.15)
==31197==    by 0x11CE519A: ips_ptl_rcvthread_fini (in /usr/lib64/libpsm_infinipath.so.1.15)
==31197==    by 0x11CE3C37: ??? (in /usr/lib64/libpsm_infinipath.so.1.15)
==31197==    by 0x11CD90C1: psm_ep_close (in /usr/lib64/libpsm_infinipath.so.1.15)
==31197==    by 0x11AC2E36: ompi_mtl_psm_finalize (mtl_psm.c:200)
==31197==    by 0xC59ECC4: ompi_mpi_finalize (ompi_mpi_finalize.c:285)
==31197==    by 0x562185: main (solver_2d.hpp:171)
==31197==  Address 0x1ffeffda24 is on thread 1's stack
==31197==  in frame #2, created by ipath_poll_type (???:)
==31197==  Uninitialised value was created by a stack allocation
==31197==    at 0x11CCE050: ??? (in /usr/lib64/libpsm_infinipath.so.1.15)

其中 line solver_2d.hpp:171 对应:

MPI_Finalize();

最后,对应于 MPI_write,或者更好的是,对应于 MPI_File_open 的错误信息如下:

==31198== 48 bytes in 1 blocks are still reachable in loss record 104 of 368
==31198==    at 0x4A05E7D: malloc (vg_replace_malloc.c:309)
==31198==    by 0xC58C750: opal_obj_new (opal_object.h:469)
==31198==    by 0xC58C750: ompi_attr_set_c (attribute.c:761)
==31198==    by 0xC5AA0BE: PMPI_Attr_put (pattr_put.c:58)
==31198==    by 0x118501AB: ???
==31198==    by 0x11843159: ???
==31198==    by 0x1185657D: ???
==31198==    by 0xC5CEFB5: module_init (io_base_file_select.c:442)
==31198==    by 0xC5CEFB5: mca_io_base_file_select (io_base_file_select.c:214)
==31198==    by 0xC5977A5: ompi_file_open (file.c:128)
==31198==    by 0xC5C6557: PMPI_File_open (pfile_open.c:96)
==31198==    by 0x5638A1: p_fstream (p_fstream.hpp:86)

p_fstream.hpp:86 行是:

MPI_File_open(MPI_COMM_WORLD, const_cast<char*>(fname.c_str()), flags, MPI_INFO_NULL, &mpi_file);

【问题讨论】:

Openmpi and vargrind的可能重复 【参考方案1】:

valgrind 消息报告了mpirun 中的内存泄漏,您可能不必太在意。

我假设你跑了

valgrind mpirun a.out

但您确实想在 MPI 应用程序本身中查找不正确的内存访问/泄漏。在这种情况下,您应该运行

mpirun valgrind a.out

注意所有的输出都是交错的,因为你使用的是 Open MPI,所以你可以

mpirun --tag-output valgrind a.out

为每个任务的输出添加其排名值的前缀。

【讨论】:

使用您建议的命令运行 valgrid-openmpi 会产生许多需要调试的更具体的错误。 仍然存在我认为是由 Open MPI 引起的错误。一个在 MPI 的初始化中找到,在以下行:MPI_Init(&argc, &argv);。另一个在 MPI 结束时:MPI_Finalize();。另一个在 MPI 命令 MPI_write 处。这些是小错误吗?它们需要修复吗?如果是,如何?它们会影响代码行为吗? 如果您希望我看看,请发布输出。

以上是关于如何理解我的 valgrind 错误消息?的主要内容,如果未能解决你的问题,请参考以下文章

C/C++内存检测工具Valgrind

malloc()和malloc_consolidate()中的Segfaults

如何理解 valgrind 输出的内存泄漏?

Valgrind 消息:系统调用 close() 中的文件描述符 1024 无效

如何根据 Valgrind 输出进一步调试

Valgrind错误在操作系统中有所不同