使用 Mono 的 C# 代码中的间歇性 SIGSEGV (segfault)、SIGABRT 和进程挂起

Posted

技术标签:

【中文标题】使用 Mono 的 C# 代码中的间歇性 SIGSEGV (segfault)、SIGABRT 和进程挂起【英文标题】:Intermittent SIGSEV (segfault), SIGABORT and process hangs in C# code using Mono 【发布时间】:2015-11-23 17:47:44 【问题描述】:

我们在 Ubuntu 上运行的 C# mono 项目中看到了间歇性的段错误和进程挂起。我花了很多时间尝试调试问题,包括遵循以下说明:http://www.mono-project.com/docs/debug+profile/debug/

数据点:

这种情况发生的频率在不同的环境中会有很大差异。在我们的 UAT 环境中,这种情况很少发生。在生产中,每隔几个小时,在我们的开发机器上,这个过程很幸运地运行了 20 分钟而没有失败。

我们将单声道版本升级到 4.03,但没有任何改进。

症状:

要么进程挂起,并且不响应 SIGQUIT 或 SIGTERM,要么失败并显示 SIGSEGV 或 SIGABRT

这是一个示例转储,尽管它们有所不同,并且大多不包含以下断言失败。

* Assertion: should not be reached at sgen-scan-object.h:101

Native stacktrace:

        /usr/bin/mono() [0x4b23ac]
        /lib/x86_64-linux-gnu/libpthread.so.0(+0x10340) [0x7fbaa5e50340]
        /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x39) [0x7fbaa5ab1cc9]
        /lib/x86_64-linux-gnu/libc.so.6(abort+0x148) [0x7fbaa5ab50d8]
        /usr/bin/mono() [0x629839]
        /usr/bin/mono() [0x629a47]
        /usr/bin/mono() [0x629b96]
        /usr/bin/mono() [0x5d85a8]
        /usr/bin/mono() [0x5cbd56]
        /usr/bin/mono() [0x5cd458]
        /usr/bin/mono() [0x5cdaab]
        /usr/bin/mono() [0x5d0d32]
        /usr/bin/mono(mono_gc_collect+0x28) [0x5d1458]
        /usr/bin/mono() [0x59c18a]
        /usr/bin/mono() [0x623a06]
        /lib/x86_64-linux-gnu/libpthread.so.0(+0x8182) [0x7fbaa5e48182]
        /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7fbaa5b7547d]

Debug info from gdb:

Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No threads.

=================================================================
Got a SIGABRT while executing native code. This usually indicates
a fatal error in the mono runtime or one of the native libraries
used by your application.
=================================================================

我不能 100% 确定挂起、段错误和 sigabrt 都是由同一个问题引起的,但我怀疑它们是。挂起感觉不像是普通的死锁,因为进程不响应 SIGQUIT 或 SIGTERM。

我已尝试按照http://www.mono-project.com/docs/debug+profile/debug/ 中的说明附加 gdb,但结果并不理想。

这是我的 .gdbinit:

less ~/.gdbinit
handle SIGXCPU SIG33 SIG35 SIGPWR nostop noprint
define mono_stack
 set $mono_thread = mono_thread_current ()
 if ($mono_thread == 0x00)
   printf "No mono thread associated with this thread\n"
 else
   set $ucp = malloc (sizeof (ucontext_t))
   call (void) getcontext ($ucp)
   call (void) mono_print_thread_dump ($ucp)
   call (void) free ($ucp)
 end
end

这是我的一个 gdb 调试会话(挂起的进程)的输出:

(gdb) where
#0  0x00007f2bbba05062 in do_sigsuspend (set=0x945300) at ../sysdeps/unix/sysv/linux/sigsuspend.c:31
#1  __GI___sigsuspend (set=0x945300) at ../sysdeps/unix/sysv/linux/sigsuspend.c:45
#2  0x00000000005c8ccc in ?? ()
#3  <signal handler called>
#4  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#5  0x00000000005fdda7 in ?? ()
#6  0x0000000000610aac in ?? ()
#7  0x0000000000585f6e in ?? ()
#8  0x0000000000586ee9 in ?? ()
#9  0x00000000403eb416 in ?? ()
#10 0x000000000290e8b0 in ?? ()
#11 0x00007fff29bfacb0 in ?? ()
#12 0x0000000000000000 in ?? ()

(gdb) p mono_pmip (0x00000000005fdda7)
$1 = 0

(doesn’t seem to print anything either to gdb console or process stdout)

(gdb) call mono_locks_dump (0)
$2 = 0

Total locks (in 10 array(s)): 16368, used: 399, on freelist: 213, to recycle: 15752

(gdb) mono_stack()
"<unnamed thread>" tid=0x0x7f2bbc8d47c0 this=0x0x7f2bbc858140 thread handle 0x403 state : waiting on 0x41a : Event  owns ()
  at <unknown> <0xffffffff>
  at (wrapper managed-to-native) System.Threading.WaitHandle.WaitOne_internal (System.Threading.WaitHandle,intptr,int,bool) <IL 0x0001c, 0xffffffff>
  at System.Threading.WaitHandle.WaitOne (System.TimeSpan,bool) <0x0009b>
  at System.Threading.WaitHandle.WaitOne (System.TimeSpan) <0x0001d>
  at COG.PonteDeiSospiri.PdSDaemon.CuratorDaemon.RunUntilSignaled () [0x00073] in /home/ubuntu/jenkins/sharedspace/bridge-shared-workspace/app/PdS-Daemon/CuratorDaemon.cs:184
  at COG.PonteDeiSospiri.PdSDaemon.CuratorDaemon.Run (string[]) [0x00019] in /home/ubuntu/jenkins/sharedspace/bridge-shared-workspace/app/PdS-Daemon/CuratorDaemon.cs:35
  at COG.PonteDeiSospiri.PdSDaemon.CuratorDaemon.Main (string[]) [0x00000] in /home/ubuntu/jenkins/sharedspace/bridge-shared-workspace/app/PdS-Daemon/CuratorDaemon.cs:24
  at (wrapper runtime-invoke) <Module>.runtime_invoke_int_object (object,intptr,intptr,intptr) <IL 0x0006c, 0xffffffff>


"<unnamed thread>" tid=0x0x7f2bbc8d47c0 this=0x0x7f2bbc858140 thread handle 0x403 state : waiting on 0x41a : Event  owns ()
  at <unknown> <0xffffffff>
  at (wrapper managed-to-native) System.Threading.WaitHandle.WaitOne_internal (System.Threading.WaitHandle,intptr,int,bool) <IL 0x0001c, 0xffffffff>
  at System.Threading.WaitHandle.WaitOne (System.TimeSpan,bool) <0x0009b>
  at System.Threading.WaitHandle.WaitOne (System.TimeSpan) <0x0001d>
  at COG.PonteDeiSospiri.PdSDaemon.CuratorDaemon.RunUntilSignaled () [0x00073] in /home/ubuntu/jenkins/sharedspace/bridge-shared-workspace/app/PdS-Daemon/CuratorDaemon.cs:184
  at COG.PonteDeiSospiri.PdSDaemon.CuratorDaemon.Run (string[]) [0x00019] in /home/ubuntu/jenkins/sharedspace/bridge-shared-workspace/app/PdS-Daemon/CuratorDaemon.cs:35
  at COG.PonteDeiSospiri.PdSDaemon.CuratorDaemon.Main (string[]) [0x00000] in /home/ubuntu/jenkins/sharedspace/bridge-shared-workspace/app/PdS-Daemon/CuratorDaemon.cs:24
  at (wrapper runtime-invoke) <Module>.runtime_invoke_int_object (object,intptr,intptr,intptr) <IL 0x0006c, 0xffffffff>

call mono_locks_dump (0)
$1 = 51700864
(gdb) call mono_locks_dump (1)
$2 = 56715296

Total locks (in 10 array(s)): 16368, used: 399, on freelist: 213, to recycle: 15752
Lock 0x29d68d0 in object 0x7f2ba8d13590 untaken
Lock 0x29d68f8 in object 0x7f2b7482c2c0 untaken
Lock 0x29d6920 in object 0x7f2b7482cd00 untaken
Lock 0x29d6948 in object 0x7f2b7482cb70 untaken
Lock 0x29d6970 in object 0x7f2b7482c760 untaken
Lock 0x29d6998 in object 0x7f2b7482d380 untaken
Lock 0x29d69c0 in object 0x7f2b7482c540 untaken
Lock 0x29d69e8 in object 0x7f2b7482c240 untaken
…...
times lots


(gdb) call mono_object_describe (0x41a)

The following is printed to the gdb console. 

Program received signal SIGSEGV, Segmentation fault.
0x000000000052c1a2 in mono_object_describe ()
The program being debugged was signaled while in a function called from GDB.
GDB remains in the frame where the signal was received.
To change this behavior use "set unwindonsignal on".
Evaluation of the expression containing the function
(mono_object_describe) will be abandoned.
When the function is done executing, GDB will silently stop.
(gdb) quit
A debugging session is active.

        Inferior 1 [process 7763] will be detached.

Quit anyway? (y or n) y
Detaching from program: /usr/bin/mono-sgen, process 7763

As soon as gdb finishes, the process writes remaining log messages to gdb console and then restarts (possibly by upstart)

ubuntu@shim-megastore-prod:/var/log/upstart$ 2015-08-20 01:48:20,124  INFO   (  1) .PonteDeiSospiri.PdSDaemon.CuratorDaemon  ::  Service check complete.
2015-08-20 01:48:22,641  INFO   (  5) iri.PdSDaemon.Services.CloudWatchService  ::  936 metrics averaged...
2015-08-20 01:48:22,716  INFO   (  5) iri.PdSDaemon.Services.CloudWatchService  ::  4 metrics posted to CloudWatch.
2015-08-20 01:48:29,568  INFO   (ker) piri.PdSDaemon.Services.PriceSyncService  ::  98.8% synchronised (15.1/sec)
2015-08-20 01:48:39,820  DEBUG  (  4) ri.PdSDaemon.Services.ProductSyncService  ::  Zzzz

Process restarts, or is restarted by Upstart

2015-08-20 06:51:20,163  INFO   (  1) .PonteDeiSospiri.PdSDaemon.CuratorDaemon  ::  Ponte dei Sospiri Daemon Version 1.0.5695.31695
2015-08-20 06:51:20,172  INFO   (  1) .PonteDeiSospiri.PdSDaemon.CuratorDaemon  ::  Process ID: 12625
2015-08-20 06:51:20,172  INFO   (  1) .PonteDeiSospiri.PdSDaemon.CuratorDaemon  ::
2015-08-20 06:51:20,182  INFO   (  1) .PonteDeiSospiri.PdSDaemon.CuratorDaemon  ::  ProductSyncService is not running, firing it up...
2015-08-20 06:51:20,183  INFO   (  1) .PonteDeiSospiri.PdSDaemon.CuratorDaemon  ::  CloudWatchService is not running, firing it up...
2015-08-20 06:51:20,185  INFO   (  1) .PonteDeiSospiri.PdSDaemon.CuratorDaemon  ::  OrderProcessingService is not running, firing it up...

The above is all written to the gdb console window. From then on, the output goes to the upstart console log.

这里是项目的依赖列表:

  <package id="AWSSDK" version="2.3.20.0" targetFramework="net40" />
  <package id="CsvHelper" version="2.10.0" targetFramework="net40" />
  <package id="FluentMigrator" version="1.4.0.0" targetFramework="net40" />
  <package id="Mono.Options" version="1.1" targetFramework="net40" />
  <package id="Npgsql" version="2.2.5" targetFramework="net40" />
  <package id="ServiceStack.Common" version="3.9.71" targetFramework="net40" />
  <package id="ServiceStack.OrmLite.PostgreSQL" version="3.9.71" targetFramework="net40" />
  <package id="ServiceStack.OrmLite.Sqlite.Mono" version="3.9.71" targetFramework="net40" />
  <package id="ServiceStack.Text" version="3.9.71" targetFramework="net40" />
targetFramework="net40" />
  <package id="log4net" version="2.0.3" targetFramework="net40" />

关于如何获得有关导致这种情况发生的原因的更具体信息的任何想法/建议?似乎它可能是单声道中的错误,或者是其中一个本机库中的错误(因为我们没有不安全的代码),但我似乎无法弄清楚问题出在哪里。

非常感谢任何帮助!

【问题讨论】:

你能在 Debian 机器上试用你的程序吗? Ubuntu 给了我很多关于单声道和线程的问题,也许你遇到了同样的问题。 感谢古斯曼。你使用的是哪个版本的 Ubuntu/Debian? 实际上我正在使用 Debian 7 和 8 以及从 Xamarin 存储库安装的 Mono 4.0.1,我们在生产服务器上使用它并且它 100% 稳定,我们有自己的 REST 服务器从0,所以我们使用大量线程和数千个同时用户,我们没有问题 有趣。更改操作系统是可能的,但对我们来说可能是最后的手段。自从我第一次发布这个问题以来,我已经将其范围缩小到我们在 AppDomain 上调用 AppDomain.Unload 的程度,每个都在自己的线程中运行。我认为在卸载的某个地方存在竞争条件,因为有时(并且仅)当我们在两个线程上连续调用 Unload 时它会挂起。我还没有完全排除我们代码中的死锁,但我对 Unload 工作原理的理解应该排除这种情况。我将尝试整理一个简单的复制案例。 【参考方案1】:

好的,这是 Ubuntu 内核中的一个已知错误。

Xamarin 有一个报告错误:https://bugzilla.xamarin.com/show_bug.cgi?id=29827

因此,如果您在这些机器上更新内核,该错误应该会消失(希望如此)。

干杯!

【讨论】:

太棒了,谢谢!不知道我怎么错过了这个搜索谷歌,因为它是“sigsegv ubuntu mono”的第一名!我已经升级了开发盒上的内核并正在监控。相当乐观,就是这样。如果它没有在一夜之间再次出现,我会接受这个答案。再次感谢:)。 实际上,我们仍然会看到这些问题,尽管不是那么频繁。最新的内核版本:4.0.5-040005-generic #201506061639 SMP Sat Jun 6 16:40:45 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux 是:* Assertion at mini-exceptions.c:2007, condition 'tls-&gt;signal_stack' not met 我将(重新)接受这个答案,并针对 AppDomain.Unload 问题提出一个新问题,因为我很确定存在两个问题,并且更新内核已修复其中之一。谢谢!

以上是关于使用 Mono 的 C# 代码中的间歇性 SIGSEGV (segfault)、SIGABRT 和进程挂起的主要内容,如果未能解决你的问题,请参考以下文章

调试从 C++ 调用的 C# dll(嵌入 Mono)

如何使用嵌入在 C++ 中的单声道编译 C# 代码?

使用 Mono.Cecil 在 C# 程序集中注入方法

如何在 Mono for android 中使用 Runnable

Mono c# 获取类

从 Mono C# 运行 Bash 命令