DPDK编程指南（翻译）（六）

Posted 2023-04-23

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了DPDK编程指南（翻译）（六）相关的知识，希望对你有一定的参考价值。

参考技术A 报文缓冲区库（Mbuf）提供了申请和释放缓冲区的功能，DPDK应用程序使用这些buffer存储消息缓冲。消息缓冲存储在mempool中，使用内存池库。

数据结构rte_mbuf可以承载网络数据包buffer或者通用控制消息buffer(由CTRL_MBUF_FLAG指示)。也可以扩展到其他类型。rte_mbuf头部结构尽可能小，目前只使用两个缓存行，最常用的字段位于第一个缓存行中。

为了存储数据包数据（包括协议头部），考虑了两种方法：

第一种方法的优点是他只需要一个操作来分配/释放数据包的整个存储表示。但是，第二种方法更加灵活，并允许将元数据的分配与报文数据缓冲区的分配完全分离。

DPDK选择了第一种方法。Metadata包含诸如消息类型，长度，到数据开头的偏移量等控制信息，以及允许缓冲链接的附加mbuf结构指针。

用于承载网络数据包buffer的消息缓冲可以处理需要多个缓冲区来保存完整数据包的情况。许多通过下一个字段链接在一起的mbuf组成的jumbo帧，就是这种情况。

对于新分配的mbuf，数据开始的区域是buffer之后 RTE_PKTMBUF_HEADROOM 字节的位置，这是缓存对齐的。 Message buffers可以在系统中的不同实体中携带控制信息，报文，事件等。 Message buffers也可以使用起buffer指针来指向其他消息缓冲的数据字段或其他数据结构。

Buffer Manager实现了一组相当标准的buffer访问操作来操纵网络数据包。

Buffer Manager使用内存池库来申请buffer。因此确保了数据包头部均衡分布到信道上，有利于L3处理。mbuf中包含一个字段，用于表示它从哪个池中申请出来。当调用 rte_ctrlmbuf_free(m) 或 rte_pktmbuf_free(m)，mbuf被释放到原来的池中。

Packet及control mbuf构造函数由API提供。接口rte_pktmbuf_init()及rte_ctrlmbuf_init()初始化mbuf结构中的某些字段，这些字段一旦创建将不会被用户修改（如mbuf类型、源池、缓冲区起始地址等）。此函数在池创建时作为rte_mempool_create()函数的回掉函数给出。

分配一个新mbuf需要用户指定从哪个池中申请。对于任意新分配的mbuf，它包含一个段，长度为0。缓冲区到数据的偏移量被初始化，以便使得buffer具有一些字节（RTE_PKTMBUF_HEADROOM）的headroom。

释放mbuf意味着将其返回到原始的mempool。当mbuf的内容存储在一个池中（作为一个空闲的mbuf）时，mbuf的内容不会被修改。由构造函数初始化的字段不需要在mbuf分配时重新初始化。

当释放包含多个段的数据包mbuf时，他们都被释放，并返回到原始mempool。

这个库提供了一些操作数据包mbuf中的数据的功能。例如：

数据包的一些信息由网络驱动程序检索并存储在mbuf中使得处理更简单。例如，VLAN、RSS哈希结果（参见 Poll Mode Driver）及校验和由硬件计算的标志等。

一个报文缓冲区中还包含数据源端口和报文链中mbuf数目。对于链接的mbuf，只有链的第一个mbuf存储这个元信息。

例如，对于IEEE1588数据包，RX侧就是这种情况，时间戳机制，VLAN标记和IP校验和计算。

在TX端，应用程序还可以将一些处理委托给硬件。例如，PKT_TX_IP_CKSUM标志允许卸载IPv4校验和的计算。

以下示例说明如何在vxlan封装的tcp数据包上配置不同的TX offloads：out_eth/out_ip/out_udp/vxlan/in_eth/in_ip/in_tcp/payload

Flage标记的意义在mbuf API文档(rte_mbuf.h)中有详细描述。更多详细信息还可以参阅testpmd 源码(特别是csumonly.c)。

直接缓冲区是指缓冲区完全独立。间接缓冲区的行为类似于直接缓冲区，但缓冲区的指针和数据偏移量指的是另一个直接缓冲区的数据。这在数据包需要复制或分段的情况下是很有用的，因为间接缓冲区提供跨越多个缓冲区重用相同数据包数据的手段。

当使用接口 rte_pktmbuf_attach() 函数将缓冲区附加到直接缓冲区时，该缓冲区变成间接缓冲区。每个缓冲区有一个引用计数器字段，每当直接缓冲区附加一个间接缓冲区时，直接缓冲区上的引用计数器递增。类似的，每当间接缓冲区被分裂时，直接缓冲区上的引用计数器递减。如果生成的引用计数器为0，则直接缓冲区将被释放，因为它不再使用。

处理间接缓冲区时需要注意几件事情。首先，间接缓冲区从不附加到另一个间接缓冲区。尝试将缓冲区A附加到间接缓冲区B（且B附加到C上了），将使得rte_pktmbuf_attach() 自动将A附加到C上。其次，为了使缓冲区变成间接缓冲区，其引用计数必须等于1，也就是说它不能被另一个间接缓冲区引用。最后，不可能将间接缓冲区重新链接到直接缓冲区（除非它已经被分离了）。

虽然可以使用推荐的rte_pktmbuf_attach()和rte_pktmbuf_detach()函数直接调用附加/分离操作，但建议使用更高级的rte_pktmbuf_clone()函数，该函数负责间接缓冲区的正确初始化，并可以克隆具有多个段的缓冲区。

由于间接缓冲区不应该实际保存任何数据，间接缓冲区的内存池应配置为指示减少的内存消耗。可以在几个示例应用程序中找到用于间接缓冲区的内存池（以及间接缓冲区的用例示例）的初始化示例，例如IPv4组播示例应用程序。

在调试模式（CONFIG_RTE_MBUF_DEBUG使能）下，mbuf库的函数在任何操作之前执行完整性检查(如缓冲区检查、类型错误等)。

所有网络应用程序都应该使用mbufs来传输网络数据包。

原文链接： http://www.jianshu.com/p/94e96c426c4c 。

[DPDK][转]DPDK编程开发—lcore

1、知识百科

返回值	操作函数	函数功能
	RTE_DECLARE_PER_LCORE (unsigned, _lcore_id)
	RTE_DECLARE_PER_LCORE (rte_cpuset_t, _cpuset)
static unsigned	rte_lcore_id (void)	返回当前运行的lcore ID
static unsigned	rte_get_master_lcore (void)	返回管理lcore的ID
static unsigned	rte_lcore_count (void)	返回系统执行lcore的数目
static int	rte_lcore_index (int lcore_id)	Return the index of the lcore starting from zero
unsigned	rte_socket_id (void)	返回正在运行的lcore所对应的物理socket
static unsigned	rte_lcore_to_socket_id (unsigned lcore_id)	获得指定lcore的物理socket ID
static int	rte_lcore_is_enabled (unsigned lcore_id)	判断lcore是否enabled，如果enable，则返回True
static unsigned	rte_get_next_lcore (unsigned i, int skip_master, int wrap)	获得下一个enable的lcore ID
int	rte_thread_set_affinity (rte_cpuset_t *cpusetp)
void	rte_thread_get_affinity (rte_cpuset_t *cpusetp)

2、头文件

#include <rte_per_lcore.h>

#include <rte_eal.h>

#include <rte_launch.h>

struct lcore_config lcore_config[RTE_MAX_LCORE]

struct lcore_config {

unsigned detected; /**< true if lcore was detected */

pthread_t thread_id; /**< pthread identifier */

int pipe_master2slave[2]; /**< communication pipe with master */

int pipe_slave2master[2]; /**< communication pipe with master */

lcore_function_t * volatile f; /**< function to call */

void * volatile arg; /**< argument of function */

volatile int ret; /**< return value of function */

volatile enum rte_lcore_state_t state; /**< lcore state */

unsigned socket_id; /**< physical socket id for this lcore */

unsigned core_id; /**< core number on socket for this lcore */

};

3、操作函数

rte_lcore_count(void)

函数功能：返回系统执行lcore的数目（和RTE_MAX_LCORE（宏64）不是一样的概念）。

rte_lcore_id(void)

函数功能：返回当前运行的lcore ID。

rte_get_master_lcore(void)

函数功能：返回管理lcore的ID。

rte_get_next_lcore(unsigned i, int skip_master, int wrap)

函数功能：获得下一个enable的lcore ID。

rte_lcore_index(int lcore_id)

函数功能：Return the index of the lcore starting from zero。

rte_lcore_is_enabled(unsigned lcore_id)

函数功能：判断lcore是否enabled，如果enable，则返回True。

rte_lcore_to_socket_id(unsigned lcore_id)

函数功能：获得指定lcore的物理socket ID。

rte_socket_id(void)

函数功能：返回正在运行的lcore所对应的物理socket。

rte_thread_get_affinity(rte_cpuset_t * cpusetp)

函数功能：获得当前线程的core affinity。

rte_thread_set_affinity(rte_cpuset_t * cpusetp)

函数功能：对当前线程设置core affinity，成功返回0，失败返回-1。

4、知识扩展

NUMA

NUMA（Non-Uniform Memory Access，非一致性内存访问）和SMP（Symmetric Multi-Processor，对称多处理器系统）是两种不同的CPU硬件体系架构。

SMP的主要特征是共享，所有的CPU共享使用全部资源，例如内存、总线和I/O，多个CPU对称工作，彼此之间没有主次之分，平等地访问共享的资源，这样势必引入资源的竞争问题，从而导致它的扩展内力非常有限；NUMA架构在中大型系统上一直非常盛行，也是高性能的解决方案，在系统延迟方面表现也都很优秀。

在NUMA架构下，CPU的概念从大到小依次是：Socket、Core、Processor。随着多核技术的发展，我们将多个CPU封装在一起，这个封装一般被称为Socket（插槽），而Socket中的每个核心被称为Core，为了进一步提升CPU的处理能力，Intel又引入了HT（Hyper-Threading，超线程）的技术，一个Core打开HT之后，在OS看来就是两个核，当然这个核是逻辑上的概念，所以也被称为Logical Processor，本文简称为Processor。

node

NUMA体系结构中多了node的概念，主要用于解决core分组问题，在目前常见的分组中，一个socket里有一个node，每个node有自己的内部CPU、总线和内存，同时还可以访问其他node内的内存，NUMA最大的优势是可以方便增加CPU数量。

#lscpu

#numactl --hardware

备注：从指令的结果可以看出本机有1个NUMA node。（available: 1 nodes (0)）

备注：从指令的结果可以看出本机有2个NUMA node。（available: 2 nodes (0-1)）

# ls /sys/devices/system/node/node0

备注：node0包含0~11个processor。

socket（physical id）

一个socket对应主板上的CPU插槽，在/proc/cpuinfo中的physical id就是socket的ID。

# grep ‘physical id‘ /proc/cpuinfo | awk -F: ‘{print $2 | "sort -un"}‘

备注：通过以上信息，可以知道本机有2个socket，编号为0和1。

#grep ‘physical id‘ /proc/cpuinfo | awk -F: ‘{print $2}‘ | sort | uniq -c

备注：每个socket对应6个processer。

#cat /proc/cpuinfo |grep core|sort -u

备注：一个socket有6个cores，它们的ID分别为0~5。

processer

# grep ‘processor‘ /proc/cpuinfo | wc -l

备注：本机共有12个processor。

# grep ‘siblings‘ /proc/cpuinfo | sort -u

备注：每个socket中有几个processor也可以从siblings字段中获取。

cpu.sh

#!/bin/bash

# Simple print cpu topology

# Author: kodango

function get_nr_processor()

{

grep ‘^processor‘ /proc/cpuinfo | wc -l

}

function get_nr_socket()

{

grep ‘physical id‘ /proc/cpuinfo | awk -F: ‘{

print $2 | "sort -un"}‘ | wc -l

}

function get_nr_siblings()

{

grep ‘siblings‘ /proc/cpuinfo | awk -F: ‘{

print $2 | "sort -un"}‘

}

function get_nr_cores_of_socket()

{

grep ‘cpu cores‘ /proc/cpuinfo | awk -F: ‘{

print $2 | "sort -un"}‘

}

echo ‘===== CPU Topology Table =====‘

echo

echo ‘+--------------+---------+-----------+‘

echo ‘| Processor ID | Core ID | Socket ID |‘

echo ‘+--------------+---------+-----------+‘

while read line; do

if [ -z "$line" ]; then

printf ‘| %-12s | %-7s | %-9s |\n‘ $p_id $c_id $s_id

echo ‘+--------------+---------+-----------+‘

continue

if echo "$line" | grep -q "^processor"; then

p_id=`echo "$line" | awk -F: ‘{print $2}‘ | tr -d ‘ ‘`

if echo "$line" | grep -q "^core id"; then

c_id=`echo "$line" | awk -F: ‘{print $2}‘ | tr -d ‘ ‘`

if echo "$line" | grep -q "^physical id"; then

s_id=`echo "$line" | awk -F: ‘{print $2}‘ | tr -d ‘ ‘`

done < /proc/cpuinfo

echo

awk -F: ‘{

if ($1 ~ /processor/) {

gsub(/ /,"",$2);

p_id=$2;

} else if ($1 ~ /physical id/){

gsub(/ /,"",$2);

s_id=$2;

arr[s_id]=arr[s_id] " " p_id

}

END{

for (i in arr)

printf "Socket %s:%s\n", i, arr[i];

}‘ /proc/cpuinfo

echo

echo ‘===== CPU Info Summary =====‘

echo

nr_processor=`get_nr_processor`

echo "Logical processors: $nr_processor"

nr_socket=`get_nr_socket`

echo "Physical socket: $nr_socket"

nr_siblings=`get_nr_siblings`

echo "Siblings in one socket: $nr_siblings"

nr_cores=`get_nr_cores_of_socket`

echo "Cores in one socket: $nr_cores"

let nr_cores*=nr_socket

echo "Cores in total: $nr_cores"

if [ "$nr_cores" = "$nr_processor" ]; then

echo "Hyper-Threading: off"

else

echo "Hyper-Threading: on"

echo

echo ‘===== END =====‘

5、常用指令

lscpu

#lscpu

Architecture: x86_64

CPU op-mode(s): 32-bit, 64-bit

Byte Order: Little Endian

CPU(s): 12 //共有12个逻辑CPU

On-line CPU(s) list: 0-11

Thread(s) per core: 1 //每个core有1个threads

Core(s) per socket: 6 //每个socket有6个cores

CPU socket(s): 2 //共有2个sockets

NUMA node(s): 2 //共有2个NUMA nodes

Vendor ID: GenuineIntel

CPU family: 6

Model: 45

Stepping: 7

CPU MHz: 2294.387 //主频

BogoMIPS: 4588.30

Virtualization: VT-x

L1d cache: 32K //L1 data cache

L1i cache: 32K //L1 instruction cache

L2 cache: 256K

L3 cache: 15360K

NUMA node0 CPU(s): 0-5

NUMA node1 CPU(s): 6-11

numactl

#numactl --hardware

/proc/cpuinfo

# cat /proc/cpuinfo |grep ‘physical id‘|awk -F: ‘{print $2}‘|sort|uniq -c

备注：可以知道有2个socket，1个socket上有12个processor。

#cat /proc/cpuinfo |grep core|sort -u

备注：可以知道1个socket上有6个cores，结合上个信息，可以知道开启了超线程。

6、参考资料

lcore：

http://www.dpdk.org/doc/api/rte__lcore_8h.html

CPU Topology：

http://kodango.com/cpu-topology

SMP VS NUMA VS MPP：

http://xasun.com/article/4d/2076.html

http://www.ibm.com/developerworks/cn/linux/l-numa/index.html

http://blog.csdn.net/ustc_dylan/article/details/45667227

以上是关于DPDK编程指南（翻译）（六）的主要内容，如果未能解决你的问题，请参考以下文章