Visual Studio 2013 C++ 本机代码中与互锁操作的线程同步挂起

Posted

技术标签:

【中文标题】Visual Studio 2013 C++ 本机代码中与互锁操作的线程同步挂起【英文标题】:Thread synchronization with interlocked ops hangs in Visual Studio 2013 C++ native code 【发布时间】:2014-05-04 18:13:12 【问题描述】:

我在 C++(64 位 Windows 8.1、Visual Studio 2013、本机 C++)中观察到线程同步的奇怪行为。

对象是获取对内存中数据结构(“表”)的读取访问权限。计数器tableRIP 跟踪当前有多少线程获取了这个(有 32 个线程)。单个线程也可以对表具有写访问权。当一个线程具有写访问权限时,没有线程可能获得读访问权限。 cacheLocks 中的位 CacheLock_WriterWaiting (=2) 在线程具有写访问权限时设置。

代码如下:

volatile long cacheLocks; // bits below
enum CacheLockBit  CacheLock_Table,
                    CacheLock_LRUQ,
                    CacheLock_WriterWaiting,
                    CacheLock_Part
                  ;
volatile short tableRIP; // # of readers now in process

Restart:
// Get read access to the table. If we need to write it, it will be changed to write access later.

InterlockedIncrement16(&tableRIP); // assume we will get read access
if(cacheLocks & (1<<CacheLock_WriterWaiting)) // non-zero if a writer is waiting or active

    InterlockedDecrement16(&tableRIP); // oops, a writer got in, so we're forbidden
    InterlockedIncrement64(&fc_Wait[0]); // counter for diagnostic purposes
    Wait(waitMs); // waitMs is a constant 1 (msec)
    goto Restart;

// Now we're a valid reader, and writer can't proceed till we've finished

莫名其妙的行为是,程序在这个循环中挂了。当我使用调试器并单步执行循环时(详情如下),它会立即退出。它的行为是 AS IF 变量 cacheLocks 不是易失性的(但是,正如您从下面的汇编代码中看到的那样)。

在我查看的时候,只有一个线程处于活动状态(这个)。其他 31 个正在等待这个完成,还有一个 UI 线程也处于活动状态,它不会访问这个数据结构。

由于这是一个发布版本,我正在使用汇编代码进行调试并直接查看内存。这里又是代码,不过是在调试器中查看的汇编代码:

Restart:
// Get read access to the table. If we need to write it, it will be changed to write access later.

InterlockedIncrement16(&tableRIP); // assume we will get read access
00007FF789DF9970  lock inc    word ptr [rbx+2A4h] // (1) before 0, after 1 
if(cacheLocks & (1<<CacheLock_WriterWaiting)) // non-zero if a writer is waiting or active
00007FF789DF9978  mov         eax,dword ptr [rbx+2A0h]  // (5) eax -> 0
00007FF789DF997E  test        al,4  
00007FF789DF9980  je          $Restart+2Fh (7FF789DF999Fh)  

    InterlockedDecrement16(&tableRIP); // oops, a writer got in, so we're forbidden
00007FF789DF9982  lock dec    word ptr [rbx+2A4h]  
    InterlockedIncrement64(&fc_Wait[0]);
00007FF789DF998A  lock inc    qword ptr [rbx+1E0h]  
    Wait(waitMs);
00007FF789DF9992  mov         ecx,dword ptr [rbx+290h]  
00007FF789DF9998  call        Concurrency::wait (7FF789FB1000h) // (3) debugger breaks here  
    goto Restart; // (4)
00007FF789DF999D  jmp         FileCache::CacheInsureLoaded+0A6h (7FF789DF9966h)  

// Now we're a valid reader, and writer can't proceed till we've finished

当我使用调试器“中断”程序时,线程位于系统例程Concurrency::wait 中。我退出这些,直到我的代码中到达 (4)。然后我检查rbx+2A4h(即tableRIP)的内存,它是0。单步执行inc 后,它是1,正如预期的那样。检查rbx+2A0h(即cacheLocks)处的内存,它在位置(5)处为0(即,没有作家活动)。再一步,我们跳到$Restart+2Fh,退出循环。

程序在循环中旋转了几个小时,直到使用调试器单步执行汇编代码。从上面的代码可以看出,C++ 编译器已经正确地将变量tableRIPcacheLocks 视为易失性变量:它每次都从内存中加载它们。我注意到这两个变量在内存中是相邻的。我需要考虑一些硬件功能吗?处理器是 Intel Core i7-4771。

编辑:为了回答我发帖的问题,这里有更详细的代码,显示了cacheLocks 的所有操作。还有cachePart[iPart]锁的用法,就是对缓冲区的细粒度锁;这与表的锁定无关,并且没有显示缓冲区锁定的所有用法。

与锁定无关的部分代码已替换为// PROCESS

// Data members of class FileCache:
volatile long cacheLocks; // bits below
enum CacheLockBit  CacheLock_Table,
                    CacheLock_LRUQ,
                    CacheLock_WriterWaiting,
                    CacheLock_Part
                  ;
volatile short tableRIP; // # of readers now in process

// Code from class FileCache:
Restart:
    // Get read access to the table. If we need to write it, it will be changed to write access later.

    InterlockedIncrement16(&tableRIP); // assume we will get read access
    if(cacheLocks & (1<<CacheLock_WriterWaiting)) // non-zero if a writer is waiting or active
    
        InterlockedDecrement16(&tableRIP); // oops, a writer got in, so we're forbidden
        InterlockedIncrement64(&fc_Wait[0]);
        Wait(waitMs);
        goto Restart;
    
    // Now we're a valid reader, and writer can't proceed till we've finished


// PROCESS
    if(iPart!=bs_NotInCache && iPart!=bs_Writing) // i e, it's in cache and not in process of being written
    
        if(cachePart[iPart].nLocks[lt_FileRead]==rl_FileReadLock)
        
            // Another thread is setting a file read lock on this part, for unknown ix. Must wait, in case it's for this ix.
            InterlockedDecrement16(&tableRIP); // reader in no longer in progress
            InterlockedIncrement64(&fc_Wait[1]);
            Wait(waitMs);
            goto Restart;
        

        // Lock this cache part
        while(InterlockedBitTestAndSet(&cachePart[iPart].partLocks, CacheLock_Part)) // returns 1 if bit (lock) was already set
        
            InterlockedIncrement64(&fc_Wait[9]);
            Wait(waitMs);
        

        while(cachePart[iPart].nLocks[lt_FileRead]!=0)
        
            // Another thread is reading the desired block. Must wait till that is complete, then start over.
            InterlockedBitTestAndReset(&cachePart[iPart].partLocks, CacheLock_Part); // release the mutex
            InterlockedDecrement16(&tableRIP); // reader in no longer in progress
            InterlockedIncrement64(&fc_Wait[2]);
            Wait(waitMs);
            goto Restart;
        

// PROCESS
        InterlockedDecrement16(&tableRIP); // reader in no longer in progress

        return iPart;
    

    else if(partFromBlock[block]==bs_Writing)
    
        // Another thread is writing this block--must wait till it's finished, then try again
        if(debugPartFromBlock)
            PartFromBlockCheck(workerThreadNum);
        InterlockedDecrement16(&tableRIP); // reader in no longer in progress
        InterlockedIncrement64(&fc_Wait[6]);
        Wait(waitMs);
        goto Restart;
    

    // Desired block isn't in cache; must read it from file.
    // Now we need a write lock.
    InterlockedDecrement16(&tableRIP); // we're no longer a reader
    if(InterlockedBitTestAndSet(&cacheLocks, CacheLock_WriterWaiting)) // get 'writer active' status
    
        InterlockedIncrement64(&fc_Wait[7]);
        Wait(waitMs);
        goto Restart;
    

    // We have 'writer active' set, but we need to wait for all readers to finish
    while(tableRIP > 0)
    
        InterlockedIncrement64(&fc_Wait[8]);
        Wait(waitMs);
    

    // Now this thread is the only one accessing the table
    iPart=CacheFill(workerThreadNum, clt, ix, block, lType);
    if(iPart<0)
    
        // CacheFill was unable to lock the part
        unsigned char locks=InterlockedBitTestAndReset(&cacheLocks, CacheLock_WriterWaiting); // no longer writer active
        InterlockedIncrement64(&fc_Wait[3]);
        Wait(waitMs);
        goto Restart;
    

    // Convert the file read lock to the desired lock type. Current lock should be exactly 1 file read.
    long long locks=InterlockedCompareExchange64(&cachePart[iPart].allLocks,1LL<<(lType*16),LOCKPARTS(0,0,1,0));

    return iPart;


// Read a block which contains the desired bit into a cache part. Table is 'writer active'.
// If the cache has been modified, write it first.
// Returns the part #, and turns off 'writer active'.
int FileCache::CacheFill(const int workerThreadNum, const CacheLocType clt, const DBIndex ix, const unsigned long long block, const LockType lType)

    int retries=0;
Restart:
// PROCESS
    // Found an eligible part--try to lock it. To succeed, there must be no locks of any kind on the cache part
    long long locks=InterlockedCompareExchange64(&cachePart[iPartLRU].allLocks,LOCKPARTS(0,0,rl_FileReadLock,0),0);
    if(locks!=0)
    
        if(retries>10)
            return -2; // unable to find an available buffer; wait till one becomes available
        InterlockedIncrement64(&fc_Wait[4]);
        Wait(waitMs);
        retries++;
        goto Restart; // try for another
    

// PROCESS
    locks=InterlockedBitTestAndReset(&cacheLocks, CacheLock_WriterWaiting); // no longer writer active

// PROCESS
    // Now the old block (if any) is gone, so we can remove it from the table
    if(oldBase!=NoIndex)
    
        // Lock table again. It was unlocked so other threads could run while we were writing.
        // However, another writer is not allowed to remove the 'oldBase' part.
        while(InterlockedBitTestAndSet(&cacheLocks, CacheLock_WriterWaiting)) // returns 1 if bit (lock) was already set
        
            InterlockedIncrement64(&fc_Wait[5]);
            Wait(waitMs);
        
        // No other threads can access the table, except readers in progress. We have to wait for those to finish.
        while(tableRIP > 0)
        
            // No other threads can access the table, except readers in progress. We have to wait for those to finish.
            InterlockedIncrement64(&fc_Wait[10]);
            Wait(waitMs);
        
        unsigned char locks=InterlockedBitTestAndReset(&cacheLocks, CacheLock_WriterWaiting); // no longer writer active
    

// PROCESS
    return iPart;

【问题讨论】:

那么问题是它永远不会退出循环,还是永远不会循环?无论如何,原子操作似乎无关紧要,对吧? 好的,重新阅读后,它在实际运行时似乎永远不存在,但在调试器中运行时确实退出。无论如何,我认为问题出在 cacheLocks 的编写上。请注意,您必须非常关注内存一致性模型。 X86 非常严格,但不是完全顺序的。 在使用 32 个线程的程序中保证不存在死锁是非常难以证明的。一个起点是使用经过测试的读写器锁实现。您的问题提供了obvious candidate。 看起来 cacheLocks 没有被写入线程清除。你能记录下写锁的设置和清除吗?写锁是否可能存在竞争?它是原子的吗? kec:pgm 有时会挂起(不退出循环)。它总是在调试器下运行。 【参考方案1】:

其他信息:

参考初始循环,看起来好像正在发生的事情是Wait(1 msec) 永远不会返回。应该发生的是,在短暂等待后,线程再次检查cacheLocks

当 pgm 被调试器“中断”,然后继续时,挂起的线程继续,找到 cacheLocks = 0,并退出循环。

Wait 函数只是

void Wait(const int msec)

    Concurrency::wait(msec);
    return;

所有工作线程都以“最低”优先级运行,主 UI 线程除外,它处于“正常”。我找不到任何关于 Concurrency::wait(msec) 和 Windows 线程调度程序如何工作的详细信息。

也许有人可以解释为什么会这样?

【讨论】:

以上是关于Visual Studio 2013 C++ 本机代码中与互锁操作的线程同步挂起的主要内容,如果未能解决你的问题,请参考以下文章

针对 Visual Studio 2012 本机 C++ 测试从命令行运行 mstest

Visual Studio 调试器显示本机类型的错误值

使用 Visual Studio 2013 为本机 DLL 正确生成 PDB 文件

Windows 7 64 位 Visual Studio 2013 上的本机 cl.exe

在 Visual Studio 2012 中调试 C++ 代码时跳过 STL 代码?

visual studio2013安装及测试