读《RDMA vs. RPC for Implementing Distributed Data Structures》

Posted 布鲁斯的读书圈

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了读《RDMA vs. RPC for Implementing Distributed Data Structures》相关的知识,希望对你有一定的参考价值。

最近研究RDMA的应用和测试方法,在SCI-HUB上找了些论文来读。初步涉及这个领域,决定先找几篇看起来比较相关的论文全文翻译一把。这么做至少有两个好处:(1)翻译一遍相比读一遍需要并且会有更好的理解。(2)以后写英文文章的话我可以用汉语搜索这些翻译以反向获取比较标准的英文描述~~~

由于论文太长,分几篇来发布,本部分是引言和背景介绍。


RDMA vs. RPC for Implementing Distributed Data Structures

RDMA和RPC在实现分布式数据结构时的比较

Abstract

Distributed data structures are key to implementing scalable applications for scientific simulations and data analysis. In this paper we look at two implementation styles for distributed data structures: remote direct memory access (RDMA) and remote procedure call (RPC). We focus on operations that require individual accesses to remote portions of a distributed data structure, e.g., accessing a hash table bucket or distributed queue, rather than global operations in which all processors collectively exchange information. We look at the trade-offs between the two styles through microbenchmarks and a performance model that approximates the cost of each. The RDMA operations have direct hardware support in the network and therefore lower latency and overhead, while the RPC operations are more expressive but higher cost and can suffer from lack of attentiveness from the remote side. We also run experiments to compare the real-world performance of RDMA- and RPC-based data structure operations with the predicted performance to evaluate the accuracy of our model, and show that while the model does not always precisely predict running time, it allows us to choose the best implementation in the examples shown. We believe this analysis will assist developers in designing data structures that will perform well on current network architectures, as well as network architects in providing better support for this class of distributed data structures.

Index Terms—distributed data structures, remote procedure call (RPC), remote direct memory access (RDMA)

摘要

分布式数据结构是执行用于科学计算和数据分析的大规模应用程序的关键。本文将研究分布式数据结构的两种实现类型:远程直接内存访问(RDMA)和远程过程调用(RPC)。我们主要研究对某个分布式数据结构的远端部分进行独立访问的操作,比如访问散列表元或分布式队列,而不是所有进程同时交换信息的操作。通过microbenchmarks和一个粗略评估消耗的性能模块,我们就可以在这两个实现类型间做出权衡。RDMA操作有硬件支持(直接访问内存)因而有低延时和低负载(对CPU来说)的特点,RPC操作有更好的表达力(意思是易于理解)但是负载较高并且受限于远端的相应速度。我们也做了实验去比较实际使用中的性能结果和(测试模型的)预测结果,以评估测试模型的准确性。结果显示虽然模型并不总是能够准确的预测运行时间,但它还是可以使我们可以在实验中选择出最好的实现类型。我们相信这个分析可以帮助开发人员设计出在当前网络架构上执行的更有效率的数据结构,以及可以为这种分布式数据结构提供更好支持的网络架构。

关键字:分布式数据结构,RPC,RDMA

I. INTRODUCTION

Many complex programs need to perform operations on abstract data structures, such as hash tables, queues, and arrays. While many mature, high quality libraries exist that provide implementations of abstract data structures for serial and multi-threaded programs, the development of techniques for high-level data structures for distributed programs is still an active area of research [1]–[3]. Of particular interest are distributed data structures for irregular applications, where data access patterns and volumes are not known in advance.

These applications commonly use data structures which may be complex to implement using traditional message passing methods in a distributed memory setting, including graphs, trees, hash tables, and distributed queues. Some recently developed distributed data structure libraries are founded on remote direct memory access (RDMA), meaning that all essential data structure operations will be executed using one-sided remote put, get, and atomic operations [2], [3]. These data structure operations have the potential to be very efficient and

to offer low latency, since they operate directly on remote data structure elements and can be executed directly by the network interface card (NIC) on most modern supercomputer and datacenter systems. Other high-level programming environments encourage users to use remote procedure call (RPC) software primitives to build distributed data structures [4]. While RPCs require the attention of a remote CPU, which leads to higher latency, they have the potential to be much more expressive than the RDMA operations available on today’s interconnects,

potentially leading to fewer round trips. 

I. 引言

很多复杂的程序需要操作一些抽象的数据结构,比如哈希表、队列和数组。同时很多成熟的高质量的库提供了对这些抽象数据结构的实现,并支持串行和多线程并行运行。但对于分布式程序来说,相关高水平的数据结构的实现依然处于研究阶段。比较特别的是,在一些不规则的应用场景下,分布式数据结构的数据访问方式和访问量都是无法预知的。

这些(不规则的)应用程序通常使用的数据结构,如果使用传统的消息传递方式去实现(比如在分布式内存中传递图形、树、哈希表和分布式队列等),会比较复杂。一些最近开发的分布式数据结构使用了RDMA,意思是所有的关键数据结构的操作都采用了单边执行(put、get、原子操作等)的方式。这种对数据结构的操作理论上可以做到非常高效并提供很低的延时,因为它们直接访问远端的数据,并且可以被大多数现代的超级计算机和数据中心系统上的网卡直接支持。还有一些高水平的开发环境鼓励使用者去使用RPC来实现分布式数据结构。RPC需要远端CPU的参与,并且延时较高。但是相比RDMA有更好的表达力(译者注:意思是单个操作可以实现很复杂的逻辑,易于理解和编程),潜在地降低了信息的交互次数。



In this paper, we evaluate the efficacy of RDMA- and RPCbased manipulation of distributed data structures with a set of systematic benchmarks. We perform two sets of experimemts. First, we perform microbenchmarks to gather the costs of the component operations that make up both RDMA- and RPCbased data structure implementations. This includes the cost of various RDMA operations along with the cost of an RPC. Second, we measure the actual costs of various data structure operations, such as queue or hash table insertions, at various levels of concurrency requirements using both RDMA and RPC-based implementations. We then compare the observed results with an analytical cost model and determine where it is more desirable to use RDMA and where it is more desirable to use RPC.

本文中,我们用一组系统的测试基准来评估基于RDMA和RPC的对分布式数据结构的操作效能。我们做了两组实验。第一组实验中,我们用microbenchmarks收集组成这两种类型的数据结构的各种“组件操作”的消耗。包括各种RDMA操作和一个RPC操作的消耗。第二组实验中,我们测试了各种数据结构操作的实际消耗,比如队列或哈希表的插入操作,涉及各种级别的同步需求。然后我们对观测结果和一个分析模型的逻辑结果进行了比较,决定哪种情况下应该使用RDMA以及哪种情况下RPC更可取。



We break down the cost of RDMA-based data structure operations in terms of an analytical cost model, which we use to predict the cost of RDMA-based data structure operations based on the real measured cost of the component operations. This type of analysis helps us to determine why RDMAbased operations are expensive, when they are expensive, and to focus in on (1) where data structures could be improved to run on current-generation network hardware (e.g. by avoiding expensive operations), and (2) which operations hardware designers might focus on optimizing in order to better support distributed data structures. Our paper has three main contributions.

我们按照逻辑消耗模型对基于RDMA的数据结构操作的消耗进行了拆分,再结合各种组件操作的实际测量结果,就可以对基于RDMA的数据结构操作的消耗进行预测。这种分析帮助我们确定为什么基于RDMA的操作比较昂贵(译者注:意思是耗时较长),什么时候昂贵。主要集中在两点:(1)什么情况下数据结构应该被进一步优化以在当代网络硬件上运行(比如避免一些耗时的操作)。(2)硬件设计者应该对哪些操作进行优化才能更好的支持分布式数据结构。本文有以下三个主要的贡献。


• We present a set of microbenchmarks that determine the cost of various component operations for RDMA-based distributed data structures on a modern supercomputer network.

我们给出了一组microbenchmarks(测试基准/方法),用于衡量在当代超级计算机网络上,基于RDMA的分布式数据结构的各种组件操作的消耗。


• We present an analytical cost model which can be used to estimate the cost of RDMA-based distributed data structure operations, based on the component costs.

我们给出了一个消耗分析模型,在基于组件消耗的基础上,可用于预估基于RDMA的分布式数据结构操作的消耗。


• We provide a comparison of RDMA- and RPC-based distributed data structure performance for queue and hash table data structures at variable levels of concurrency requirements.

我们提供了对于分别基于RDMA和RPC的分布式数据结构,在各种不同级别的并发需求上,对于排序和哈比表的实现的性能比较。



II. BACKGROUND

A. Remote Direct Memory Access

Remote direct memory access (RDMA) provides an interface to manipulate remote data in a one-sided manner, meaning that an origin process can perform operations on the remote memory of a target process without any explicit coordination with the target. This is commonly executed by having the target’s network interface card (NIC) directly communicate with its on-node memory, resulting in very low round-trip latency on the order of a microsecond. Low-latency RDMA primitives are now available on a number of supercomputer interconnects, including Cray Aries and Infiniband. RDMA is also increasingly available on datacenter commodity hardware through RDMA over converged Ethernet (RoCE).

II. 背景

A. RDMA

RDMA提供了从一端去操作远端数据的接口,意味着一个进程可以在不和目标机上运行的进程进行显式协调的基础上直接访问目标进程的内存。这需要借助目标网卡直接访问内存的能力,整个时间消耗达到了微秒级。低延时的RDMA原语已经在很多超级计算机通信中使用了,包括Cray Aries和Infiniband。RoCE技术(在现有以太网基础上实现RDMA)也在数据中心中越来越多的被使用。



For the purposes of this paper, we consider a common set of RDMA operations available in most modern supercomputer and datacenter systems. This set includes remote put and remote get, which can be of variable size, along with the fixed-size 32 and 64-bit atomic memory operations (AMOs) compare-and-swap and fetch-and-op. Some have proposed an expanded set of RDMA operations to support various types of remote and distributed data structures, such as the Infiniband extended atomics API [5]. In addition, there are recently proposed API extensions to RDMA which would allow for more expressive RDMA operations [1]. These APIs are outside the scope of this paper.

基于本文的目的,我们考虑的是在大多数现代超级计算机和数据中心系统上普遍使用的一般RDMA操作集。这个集合包括数据量大小可变的远程put和远程get,还有固定大小的32位和64位的原子内存操作(AMOs)、比较和交换操作和fetch-and-op操作。另外还有RDMA扩展操作集,可以支持不同类型的远程访问和分布式数据结构,比如Infiniband 扩展原子API。另外,最近还出现了更多扩展的RDMA操作。不过这些并不在本文的范围之内。



RDMA-Based Data Structures: Distributed data structures can be directly built on top of one-sided RDMA operations, so that all major data structure operations will be executed with RDMA. Examples of such partitioned global address space (PGAS) distributed data structure libraries include BCL, DASH, and Multipol [2], [3], [6]. Similar to shared memory concurrent data structures, these libraries are built to use a shared global memory space, with synchronization using atomics when necessary, to operate upon shared data. However, unlike shared memory data structures, the component costs and synchronization models of distributed programming frameworks can be quite different, so care must be taken to design data structures accordingly. As shown in Figure 1, libraries can use RDMA operations, which will be directly executed by the target process’ NIC, to operate on remote data. There are two remote memory operations in this code example, CAS, which is a remote compare-and-swap operation, and RPUT, which is a remote put operation. In the best case, our inserting process will perform a remote compare-andswap, succeed in reserving the first hash table slot, and then perform a remote put operation. This would have a cost of ACAS + W, that is the cost of a compare-and-swap operation and a write. However, in the case of hash table collision, the algorithm will move on to the next available slot, and multiple round trips may be required to perform the insert operation. The particular hash table shown here is a hash table with open addressing and linear probing. Observant readers of Figure 1 will also notice that the listed code is not fully atomic. While the code is atomic with respect to concurrent insert operations, there is no guarantee that the remote put operation will finish before a remote find operation reads the halfwritten value. If we wish for our insert operation to be atomic with respect to concurrent find operations, we will require a second fetch-and-op operation to mark the slot as ready for reading after the remote put operation has finalized. This would increase the best case cost of the remote insert operation to ACAS + W + AFAO. So, depending on an application’s atomicity requirements, data structure operations over RDMA may have different best-case costs. Also, depending on a particular execution of the application, the observed cost of a method may vary due to the number of round trips caused by contention.





B. Remote Procedure Calls

Remote procedure calls (RPCs), in contrast to RDMA operations, allow an origin process to remotely trigger the execution of a procedure on a target process. RPCs have the advantage of being more expressive than RDMA. While control flow in individual RDMA operations is limited to single-instruction atomics like compare-and-swap and fetchand-op, RPCs can include complex control flow and arbitrary computation. This allows more complex data structure operations, such as inserting into a hash table, pushing a value onto a queue, or even modifying a dynamically sized data structure, to be performed with a single communication event. However, this added expressivity comes at a greater latency cost, since an RPC operation must wait for the target process to enter a progress function or interrupt the processor and make a function call to execute the procedure. It also changes load balancing across processors, moving away from the clearly defined SPMD model of execution in ways that can shift computational workload, intentionally or not.

B. RPC

相比于RDMA操作,RPC使得一个进程可以远程调用目标进程中的某个过程或方法。RPC相比RDMA有更好的表达力。独立的RDMA操作中的控制流仅限于单指令原子操作,比如compare-and-swap和fetchand-op。而RPC可以包含复杂的控制流和任意的运算,所以支持更加复杂的数据结构操作,比如哈希表的插入操作,队列的入列操作,甚至修改动态大小的数据结构。这些操作都可以在某次通讯事件中执行。但是伴随着表达力的是更大的延时消耗。因为一次RPC操作必须等待目标进程进入过程函数或中断处理器然后调用函数去执行过程。RPC还改变了处理器间的负载平衡,其会有意无意地把SPMD模式已经定义好的计算负载在处理器间迁移。



In this paper, we consider a restricted type of RPC called an active message (AM) [7]. For the purposes of this paper, AMs have the following restrictions: (1) active message handlers may not send additional active messages, except for a single response to the origin process and (2) active message handlers may not perform network communication. These restrictions allow for a high-performance, low-latency implementation of active messages with bounded buffer space [8], [9].

本文中只考虑一种RPC类型,叫做主动消息(AM)。基于本文的研究目的,AM有如下限制:(1)主动消息的处理句柄不会再进一步发送另外的主动消息,只会对原进程做一次回复。(2)主动消息的处理句柄也许不会执行网络通讯(译者注:也就是连回复都不做)。这些限制保证了可以实现高性能、低延时的有限数据空间的主动消息执行。


读《RDMA vs. RPC for Implementing Distributed Data Structures》(一)


RPC Data Structures: An implementation of a distributed data structure operation with RPCs requires two parts: (1) a handler function, which is the procedure that will be executed on the target process, and (2) a wrapper function, which is the function directly called by the user on the origin process and the code that will issue the RPC request. RPC data structure implementations can be quite simple, as shown in Figure 2. The wrapper function insert uses a hash function to map data to the appropriate nodes, then issues an RPC with the handler function insert_handler. The handler function in this case simply inserts the key and value into a local hash table. In contrast to the hash table implementation based on RDMA communication, this implementation will typically only require a single round trip over the network, since the origin node can push the RPC request onto the target node’s RPC queue in a single network operation, then the target node can execute the necessary control flow to unpack and store the data. However, crucially, the number of network operations is unrelated to the control flow logic inside the data structure operation, which takes place on the target side inside the RPC function. Depending on the specific manner in which the handler function will be called (either serially or simultaneously with other threads), the handler function may require local atomic operations or other mechanisms for synchronization. However, these local mechanisms are significantly cheaper than remote memory operations.

RPC数据结构:用RPC实现的分布式数据结构的操作需要两个部分:(1)一个处理函数,用于在目标进程中执行过程。(2)一个封装函数,被初始进程直接调用,发送RPC请求。RPC数据结构可以非常简单,如图2所示。封装函数(insert)使用哈希函数去匹配数据到一个合适的节点,然后发起RPC去调用insert_handler。insert_handler简单的把key和value插入到本地哈希表中。和基于RDMA通信的实现不同的是,这种实现只需要一个回合的网络通信。因为起始节点能通过一个网络操作就把RPC请求放到目标节点的RPC队列中,然后目标节点执行必要的流控去解压和存储数据。但最关键的是,网络操作的数量和目标端的RPC函数中执行的数据结构操作中流控的逻辑是没有关系的(译者注:意思是无论多复杂和多简单的处理逻辑,都是需要一个远程调用的通讯量)。根据被调用的处理函数的实现方式(串行或者和其他线程并发),处理函数中需要本地原子操作或其他的同步机制。但是,这些本地机制明显比远程内存操作要省时。



One important detail not directly illustrated by the above code listing is that the execution of the handler function is dependent on the attentiveness of the target process, which must enter a progress function in order for its RPC queue to be serviced. While the liveness of RDMA operations is guaranteed by the network interface card, which will be constantly servicing instructions regardless of CPU state, RPCbased systems must either dedicate specific resources, such as a progress thread, to ensure attentiveness, or else pay the possible latency cost associated with waiting until the target process finishes its computation and enters a call to the RPC progress function.

上面的代码实例还间接表明了一个重要的细节,那就是句柄函数的执行依赖于目标进程的专注度,只有目标进程进入了过程函数才能去处理它的RPC队列(译者注:我理解作者想说只有目标进程有空去处理,RPC过程才能被执行)。相比之下,RDMA操作的活跃度是由网卡硬件保证的,不管CPU的状态如何,网卡硬件都可以持续的运行RDMA指令。基于RPC的系统需要专门的资源,比如一个处理线程,以保证专注力(译者注:目标端的响应能力)。或者需要接受可能的延时,等待目标进程完成自己的工作后,才能去处理RPC过程函数。



C. The Berkeley Container Library

In this paper, we compare the performance of RDMA-based implementations of distributed data structures to RPC-based implementations. For the RDMA-based implementations, we will benchmark data structures provided in the Berkeley Container Library (BCL). BCL is a cross-platform library of distributed data structures that supports running on top of MPI, OpenSHMEM, and GASNet-EX. BCL is a header-only library and is designed to offer high-level interfaces without any runtime cost for abstraction. BCL data structures are built using remote put, remote get, and remote atomic operations such as atomic compare-and-swap and fetch-and-op.

C. Berkeley 容器库

本文中,我们会比较基于RDMA的分布式数据结构的实现和基于RPC的实现的性能。对基于RDMA的实现,我们将会使用BCL提供的数据结构进行测量。BCL是一个跨平台的分布式数据结构库,其支持MPI, OpenSHMEM和GASNet-EX(译者注:这三种都可以看做是不同的编程模型)。BCL是一个header-only库(译者注:也就是在头文件中就包含了所有的实现,不需要在编译时去链接库文件),提供高层的接口,不必为抽象而在运行时耗时。BCL数据结构使用了远程put,远程get和远程原子操作(比如原子的compare-and-swap和fetch-and-op)。



Performance Model: Data structure operations in BCL can be characterized in terms of an analytical cost model, which characterizes the best-case costs of data structure operations in terms of the component RDMA operations. The component costs include remote get, remote put, compareand-swap, and fetch-and-op operations. We do not distinguish different fetch-and-op operations in this performance model, since the operations involved are typically simple binary functions such as fetch-and-add or fetch-and-XOR, which have very low cost compared to the inherent network latency. A summary of these operations and the associated notation are shown in Table I.

性能模型:BCL的数据结构操作可以按照一种“分析消耗模型”来定性描述,这种模型以RDMA操作组件为单位去描述数据结构组件的最佳消耗。这些组件消耗包括远程get、远程put、比较和交换和fetch-and-op操作。在这个性能模型中我们不再区分不同的fetch-and-op操作。因为这些操作是典型的简单的二进制函数,比如fetch-and-add和fetch-and-XOR,相比固有的网络延迟有非常低的时间消耗。表I给出了这些操作和相关语句的概览。



Alternate Implementations: As discussed in Section II-A, there are different levels of concurrency requirements with which RDMA-based data structure operations can be implemented, depending on the specific needs of an application. BCL exposes multiple implementations of data structure operations using a mechanism called concurrency promises, which allows users to optionally specify the operations that could occur concurrently with the operation being issued. To illustrate the different levels of concurrency requirements with which a data structure operation could be implemented, consider the case of a hash table insertion with arbitrarily large keys and values. Inserting an element into such a hash table will, in the general case, require at least two atomic memory operations and a write. The first atomic memory operation requests a lock on the bucket into which the element will be inserted, the write actually writes the value into the distributed hash table, and a final unlock operation signals that the bucket is ready to be read after the write hash completed. In this hash table implementation, without the final atomic memory operation, concurrent find operations might read halfway written data, resulting in an incorrect program execution. However, in the guaranteed absence of concurrent find operations within a barrier region, we can elide the final atomic memory operation, since the following barrier will ensure that the write completes before any find operations may be issued. 

其他实现:按前文所述,按照实际应用的需求的不同,基于RDMA的数据结构操作有不同的实现方案,主要按照并发需求来划分不同的等级。BCL使用一种被称作“并发承诺”的机制,给出了多种数据结构操作的实现方案。这种机制使用户可以按照发起的操作的并发情况自行指定具体操作。为了展示(和可能执行的数据结构相关的)不同层次的并发需求。我们来考虑一个例子——向一个哈希表插入任意大的key和value。向一个哈希表插入一个元素一般需要至少两个原子的内存操作和一个写操作。第一个原子内存操作请求一个锁(针对将要插入的散列表元),写操作向分布式哈希表实际写入数据,然后最终有个解锁操作以使得这个哈希表在被写入完成后重新变得可读。在这个操作哈希表的整个实现中,如果没有最终的原子内存操作,正在并发进行的查询操作也许会读出一个写了一半的数据,引起程序执行结果错误。(译者注:这段就是想说你需要根据实际并发的需求选择具体的实现)



Similar levels of concurrency requirements exist for both hash table insert and find operations, as well as operations on queues. Tables II and III show some of the data structure implementations available in BCL’s hash table and queue implementations, along with the associated best case costs. In the notation used in this paper, CW indicates that an operation is allowable with concurrent writes (pushes or inserts), while CR indicates that an operation is allowable with concurrent reads (pops or finds) and CRW indicates the operation is allowable with either.

哈希表的插入和查询操作有相近水平的并发需求,也包括队列操作。表格II和III展示了BCL的哈希表和队列实现中一些数据结构的实现,以及相关的最佳消耗。在本文使用的符号中,CW指一个可以并发写(比如push或插入)的操作,CR指一个可以并发读(比如pops或查询)的操作,CRW指一个既可以并发读也可以并发写的操作。


(待续)

以上是关于读《RDMA vs. RPC for Implementing Distributed Data Structures》的主要内容,如果未能解决你的问题,请参考以下文章

iSCSI vs iSER vs NVMe-TCP vs NVMe-RDMA

iSCSI vs iSER vs NVMe-TCP vs NVMe-RDMA

iSCSI vs iSER vs NVMe-TCP vs NVMe-RDMA

RoCE vs iWARP

关于vs2005 __RPC__out __RPC__in 没有定义编译错误

RPC简介与hdfs读过程与写过程简介