gRPC 异步服务死锁/永远卡住

Posted

技术标签:

【中文标题】gRPC 异步服务死锁/永远卡住【英文标题】:gRPC asynchronous service deadlocks/stuck forever 【发布时间】:2020-08-28 18:31:22 【问题描述】:

我实现了一个基于https://github.com/grpc/grpc/blob/master/examples/cpp/helloworld/greeter_async_client2.cc的多线程异步服务。我有 64 个线程执行一些操作,然后以异步方式联系远程服务器。但是,当我运行我的代码时,通常它会卡在某些线程的 pthread_join 上,有时我的两个节点都可以成功地在我的所有工作线程上执行 pthread_join,有时只有一个节点可以这样做。后来我在卡住的地方运行了信息线程,我从中得到了结果。

* 1    Thread 0x7ffff7fe2100 (LWP 10567) "rundb" 0x00007ffff64f4d2d in __GI___pthread_timedjoin_ex (threadid=140732532782848, thread_return=0x0, abstime=0x0, 
    block=<optimized out>) at pthread_join_common.c:89
  2    Thread 0x7ffff55ff700 (LWP 10568) "rundb" 0x00007ffff64f99f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7ffff4134118)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  3    Thread 0x7ffff4dfe700 (LWP 10569) "rundb" 0x00007ffff621cbb7 in epoll_wait (epfd=25, events=0x7ffff3c446b0, maxevents=100, timeout=-1)
    at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  4    Thread 0x7ffff3bff700 (LWP 10570) "default-executo" 0x00007ffff64f99f3 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0x7ffff4108064)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  5    Thread 0x7ffff33fe700 (LWP 10571) "resolver-execut" 0x00007ffff64f99f3 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0x7ffff410b860)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  6    Thread 0x7ffff17ff700 (LWP 10572) "grpc_global_tim" 0x00007ffff64f99f3 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0xd5a668 <g_cv_wait+40>)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  7    Thread 0x7ffff07fd700 (LWP 10574) "grpc_health_che" 0x00007ffff64f99f3 in---Type <return> to continue, or q <return> to quit---ret
 futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0x7ffff07fb1e8)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  8    Thread 0x7ffff0ffe700 (LWP 10573) "grpc_health_che" 0x00007ffff64f99f3 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0x7ffff0ffc1e8)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  9    Thread 0x7fffef7fb700 (LWP 10575) "grpcpp_sync_ser" 0x00007ffff64f9ed9 in futex_reltimed_wait_cancelable (private=<optimized out>, 
    reltime=0x7fffef7f8f60, expected=0, futex_word=0x7fffef7f90d8)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:142
  27   Thread 0x7fffcdbff700 (LWP 10594) "grpc_global_tim" 0x00007ffff64f9ed9 in futex_reltimed_wait_cancelable (private=<optimized out>, 
    reltime=0x7fffcdbfd2d0, expected=0, futex_word=0xd5a668 <g_cv_wait+40>)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:142
  41   Thread 0x7ffed89ff700 (LWP 10608) "rundb" 0x00007ffff64f99f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7ffed89fcfb0)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  81   Thread 0x7ffec49d7700 (LWP 10648) "rundb" 0x00007ffff64f99f3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7ffec49d4fb0)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  83   Thread 0x7ffec39d5700 (LWP 10650) "rundb" 0x00007ffff621cbb7 in epoll_wait (epfd=3, events=0x7ffff404c2b0, maxevents=100, timeout=-1)
---Type <return> to continue, or q <return> to quit---ret
    at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
  91   Thread 0x7ffebf9cd700 (LWP 10658) "rundb" 0x0000000000429e7f in LogManager::run (this=0x7ffff5a3a1e0) at storage/log.cpp:79
  152  Thread 0x7ffeb81fd700 (LWP 10719) "default-executo" 0x00007ffff64f99f3 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0x7ffff410810c)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  545  Thread 0x7ffeb9fff700 (LWP 11112) "default-executo" 0x00007ffff64f99f3 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0x7ffff41081b0)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  682  Thread 0x7fffefffc700 (LWP 11249) "default-executo" 0x00007ffff64f99f3 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0x7ffff4108258)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  1397 Thread 0x7fffccbff700 (LWP 11964) "default-executo" 0x00007ffff64f99f3 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0x7ffff4108300)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  3470 Thread 0x7ffeb89fe700 (LWP 14041) "default-executo" 0x00007ffff64f99f3 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0x7ffff41083a8)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
---Type <return> to continue, or q <return> to quit---re
  4766 Thread 0x7ffeb91ff700 (LWP 15337) "default-executo" 0x00007ffff64f99f3 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0x7ffff4108450)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  7122 Thread 0x7ffe9f8ff700 (LWP 17693) "default-executo" 0x00007ffff64f99f3 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0x7ffff41084f8)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  7132 Thread 0x7ffeacdff700 (LWP 17703) "grpcpp_sync_ser" 0x00007ffff64f9ed9 in futex_reltimed_wait_cancelable (private=<optimized out>, 
    reltime=0x7ffeacdfce90, expected=0, futex_word=0x7ffeacdfd050)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:142
  7135 Thread 0x7ffe651ff700 (LWP 17706) "grpcpp_sync_ser" 0x00007ffff621cbb7 in epoll_wait (epfd=14, events=0x7ffedb88a0b0, maxevents=100, timeout=9992)
    at ../sysdeps/unix/sysv/linux/epoll_wait.c:30

我和成功加入程序的结果比较,发现主要问题是41、81和83。然后我做了thread 41thread 81thread 83,得到以下结果。

线程 41

#0  0x00007ffff64f99f3 in futex_wait_cancelable (private=<optimized out>, 
    expected=0, futex_word=0x7ffed89fcfb0)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x7ffff404c278, 
    cond=0x7ffed89fcf88) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x7ffed89fcf88, mutex=0x7ffff404c278)
    at pthread_cond_wait.c:655
#3  0x000000000072f05a in gpr_cv_wait ()
#4  0x00000000006a12b0 in begin_worker ()
#5  0x00000000006a1757 in pollset_work ()
#6  0x000000000067dd5a in pollset_work(grpc_pollset*, grpc_pollset_worker**, long) ()
#7  0x00000000005219ea in grpc_pollset_work(grpc_pollset*, grpc_pollset_worker**, long) ()
#8  0x000000000054fa23 in cq_next(grpc_completion_queue*, gpr_timespec, void*)
    ()
#9  0x000000000054ff18 in grpc_completion_queue_next ()
#10 0x0000000000470676 in grpc_impl::CompletionQueue::AsyncNextInternal(void**, bool*, gpr_timespec) ()
#11 0x00000000004434f3 in grpc_impl::CompletionQueue::Next (
    this=0x7ffebe0370c0, tag=0x7ffed89fd2d8, ok=0x7ffed89fd2d3)
    at /local/include/grpcpp/impl/codegen/completion_queue_impl.h:179
#12 0x0000000000443152 in Sundial_Async_Client::contactRemoteDone (
---Type <return> to continue, or q <return> to quit---ret
    this=0x7ffff5a08240, cq=0x7ffebe0370c0, txn=0x7ffebe00a380, node_id=1, 
    response=0x0, count=1) at grpc/grpc_async_client.cpp:70
#13 0x000000000043b2dc in TxnManager::process_2pc_phase2 (this=0x7ffebe00a380, 
    rc=ABORT, cq=0x7ffebe0370c0) at system/txn.cpp:462
#14 0x000000000043a38c in TxnManager::start (this=0x7ffebe00a380)
    at system/txn.cpp:166
#15 0x000000000043f8c2 in WorkerThread::run (this=0x7fffdb133440)
    at system/worker_thread.cpp:92
#16 0x0000000000441ff9 in start_thread (thread=0x7fffdb133440)
    at system/main.cpp:204
#17 0x00007ffff64f36db in start_thread (arg=0x7ffed89ff700)
    at pthread_create.c:463
#18 0x00007ffff621c88f in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

线程 81

#0  0x00007ffff64f99f3 in futex_wait_cancelable (private=<optimized out>, 
    expected=0, futex_word=0x7ffec49d4fb0)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x7ffff404c278, 
    cond=0x7ffec49d4f88) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x7ffec49d4f88, mutex=0x7ffff404c278)
    at pthread_cond_wait.c:655
#3  0x000000000072f05a in gpr_cv_wait ()
#4  0x00000000006a12b0 in begin_worker ()
#5  0x00000000006a1757 in pollset_work ()
#6  0x000000000067dd5a in pollset_work(grpc_pollset*, grpc_pollset_worker**, long) ()
#7  0x00000000005219ea in grpc_pollset_work(grpc_pollset*, grpc_pollset_worker**, long) ()
#8  0x000000000054fa23 in cq_next(grpc_completion_queue*, gpr_timespec, void*)
    ()
#9  0x000000000054ff18 in grpc_completion_queue_next ()
#10 0x0000000000470676 in grpc_impl::CompletionQueue::AsyncNextInternal(void**, bool*, gpr_timespec) ()
#11 0x00000000004434f3 in grpc_impl::CompletionQueue::Next (
    this=0x7ffebb632000, tag=0x7ffec49d52d8, ok=0x7ffec49d52d3)
    at /local/include/grpcpp/impl/codegen/completion_queue_impl.h:179
#12 0x0000000000443152 in Sundial_Async_Client::contactRemoteDone (
---Type <return> to continue, or q <return> to quit---ret
    this=0x7ffff5a08240, cq=0x7ffebb632000, txn=0x7ffebb608000, node_id=1, 
    response=0x0, count=1) at grpc/grpc_async_client.cpp:70
#13 0x000000000043b2dc in TxnManager::process_2pc_phase2 (this=0x7ffebb608000, 
    rc=ABORT, cq=0x7ffebb632000) at system/txn.cpp:462
#14 0x000000000043a38c in TxnManager::start (this=0x7ffebb608000)
    at system/txn.cpp:166
#15 0x000000000043f8c2 in WorkerThread::run (this=0x7fffdb133e40)
    at system/worker_thread.cpp:92
#16 0x0000000000441ff9 in start_thread (thread=0x7fffdb133e40)
    at system/main.cpp:204
#17 0x00007ffff64f36db in start_thread (arg=0x7ffec49d7700)
    at pthread_create.c:463
#18 0x00007ffff621c88f in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

线程 83

#0  0x00007ffff621cbb7 in epoll_wait (epfd=3, events=0x7ffff404c2b0, 
    maxevents=100, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00000000006a0c7e in pollable_epoll(pollable*, long) ()
#2  0x00000000006a17c5 in pollset_work ()
#3  0x000000000067dd5a in pollset_work(grpc_pollset*, grpc_pollset_worker**, long) ()
#4  0x00000000005219ea in grpc_pollset_work(grpc_pollset*, grpc_pollset_worker**, long) ()
#5  0x000000000054fa23 in cq_next(grpc_completion_queue*, gpr_timespec, void*)
    ()
#6  0x000000000054ff18 in grpc_completion_queue_next ()
#7  0x0000000000470676 in grpc_impl::CompletionQueue::AsyncNextInternal(void**, bool*, gpr_timespec) ()
#8  0x00000000004434f3 in grpc_impl::CompletionQueue::Next (
    this=0x7ffebbe000c0, tag=0x7ffec39d32d8, ok=0x7ffec39d32d3)
    at /local/include/grpcpp/impl/codegen/completion_queue_impl.h:179
#9  0x0000000000443152 in Sundial_Async_Client::contactRemoteDone (
    this=0x7ffff5a08240, cq=0x7ffebbe000c0, txn=0x7ffebbe037e0, node_id=1, 
    response=0x0, count=1) at grpc/grpc_async_client.cpp:70
#10 0x000000000043b2dc in TxnManager::process_2pc_phase2 (this=0x7ffebbe037e0, 
    rc=ABORT, cq=0x7ffebbe000c0) at system/txn.cpp:462
#11 0x000000000043a38c in TxnManager::start (this=0x7ffebbe037e0)
    at system/txn.cpp:166
---Type <return> to continue, or q <return> to quit---ret
#12 0x000000000043f8c2 in WorkerThread::run (this=0x7fffdb133ec0)
    at system/worker_thread.cpp:92
#13 0x0000000000441ff9 in start_thread (thread=0x7fffdb133ec0)
    at system/main.cpp:204
#14 0x00007ffff64f36db in start_thread (arg=0x7ffec39d5700)
    at pthread_create.c:463
#15 0x00007ffff621c88f in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

似乎异步客户端偶尔会卡在cq-&gt;Next(&amp;got_tag, &amp;ok),我运行了这个执行超过 10,000 次,所以现在我怀疑 RPC 在联系期间丢失了。因此,服务器永远不会处理它,并且 rpc 也永远不会返回。我正在考虑是否需要找到一种保证交货的方法。以下是我的发送请求功能和检查响应功能。我想知道在这个卡点上我是否可以得到任何帮助。

Status Sundial_Async_Client:: contactRemote(CompletionQueue* cq, TxnManager * txn,uint64_t node_id,SundialRequest& request, SundialResponse** response)

        // Call object to store rpc data
        AsyncClientCall* call = new AsyncClientCall;

        // stub_->PrepareAsyncSayHello() creates an RPC object, returning
        // an instance to store in "call" but does not actually start the RPC
        // Because we are using the asynchronous API, we need to hold on to
        // the "call" instance in order to get updates on the ongoing RPC.
        call->response_reader =
            stub_->PrepareAsynccontactRemote(&call->context, request, cq);

        // StartCall initiates the RPC call
        call->response_reader->StartCall();
        call->reply=*response;
        // Request that, upon completion of the RPC, "reply" be updated with the
        // server's response; "status" with the indication of whether the operation
        // was successful. Tag the request with the memory address of the call object.
        call->response_reader->Finish(call->reply, &call->status, (void*)call);
  
    return Status::OK;


Status Sundial_Async_Client::contactRemoteDone(CompletionQueue* cq, TxnManager * txn,uint64_t node_id, SundialResponse* response, int count)
    void* got_tag;
        bool ok = false;
        int local_count=0;
        // Block until the next result is available in the completion queue "cq".
        while (cq->Next(&got_tag, &ok)) 
            local_count++;
            // The tag in this example is the memory location of the call object
            AsyncClientCall* call = static_cast<AsyncClientCall*>(got_tag);

            // Verify that the request was completed successfully. Note that "ok"
            // corresponds solely to the request for updates introduced by Finish().
            GPR_ASSERT(ok);


            // Once we're complete, deallocate the call object.
             //doing the cleaning
    glob_stats->_stats[GET_THD_ID]->_resp_msg_count[ call->reply->response_type() ] ++;
    glob_stats->_stats[GET_THD_ID]->_resp_msg_size[ call->reply->response_type() ] += call->reply->SpaceUsedLong();
            delete call;
            if(local_count==count)
            break;
        
        
        
     
    //txn->rpc_semaphore->decr();
      return Status::OK;
   

【问题讨论】:

【参考方案1】:

我认为这更有可能是一个逻辑问题。您确定每次都在检查响应之前发送请求吗?

【讨论】:

以上是关于gRPC 异步服务死锁/永远卡住的主要内容,如果未能解决你的问题,请参考以下文章

调度组等待永远卡住

运行命令 fastlane init 时,Fastlane 永远卡住

C# 异步命名管道永远等待

更新到 Android Studio 3.2 后,Gradle 构建卡住并且永远不会完成

死锁原理及代码

为啥 HTTP 请求永远不会通过异步等待返回?