TensorFlow 中 CPU 到 GPU 的数据传输是不是缓慢？

Posted 2023-02-16

技术标签:

【中文标题】TensorFlow 中 CPU 到 GPU 的数据传输是不是缓慢？【英文标题】：Is CPU to GPU data transfer slow in TensorFlow?TensorFlow 中 CPU 到 GPU 的数据传输是否缓慢？ 【发布时间】：2022-01-01 23:42:43 【问题描述】：

我已经使用 TensorFlow 测试了 CPU 到 GPU 的数据传输吞吐量，它似乎明显低于 PyTorch。对于慢 2 倍和 5 倍之间的大张量。在 TF 中，我达到了 25MB 张量（~4 GB/s）的最大速度，随着张量大小的增加，它下降到 2 GB/s。 PyTorch 数据传输速度随着张量大小而增长，并在 9 GB/s（25MB 张量）时达到饱和。该行为在 RTX 2080ti 和 GTX 1080ti 以及 TF 2.4 和 2.6 上是一致的。

我做错了吗？有什么方法可以匹配 PyTorch 的数据吞吐量？我不只是想隐藏延迟，例如使用异步队列，但我想获得完整的数据带宽。

TF 中批量 256x256x3 图像的结果（平均超过 100 次传输）：

code: tf.cast(x, dtype=tf.float32)[0, 0]
Batch size 1; Batch time 0.0005; BPS 1851.8; FPS 1851.8; MB/S 364.1
Batch size 2; Batch time 0.0004; BPS 2223.5; FPS 4447.1; MB/S 874.3
Batch size 4; Batch time 0.0006; BPS 1555.2; FPS 6220.6; MB/S 1223.0
Batch size 8; Batch time 0.0006; BPS 1784.8; FPS 14278.7; MB/S 2807.3
Batch size 16; Batch time 0.0013; BPS 755.3; FPS 12084.7; MB/S 2376.0
Batch size 32; Batch time 0.0023; BPS 443.8; FPS 14201.3; MB/S 2792.1
Batch size 64; Batch time 0.0035; BPS 282.5; FPS 18079.5; MB/S 3554.6
Batch size 128; Batch time 0.0061; BPS 163.4; FPS 20916.4; MB/S 4112.3
Batch size 256; Batch time 0.0241; BPS 41.5; FPS 10623.0; MB/S 2088.6
Batch size 512; Batch time 0.0460; BPS 21.7; FPS 11135.8; MB/S 2189.4

与 PyTorch 的结果相同：

Code: torch.from_numpy(x).to(self.device).type(torch.float32)[0, 0].cpu()
Batch size 1; Batch time 0.0001; BPS 10756.6; FPS 10756.6; MB/S 2114.8
Batch size 1; Batch time 0.0001; BPS 12914.7; FPS 12914.7; MB/S 2539.1
Batch size 2; Batch time 0.0001; BPS 10204.4; FPS 20408.7; MB/S 4012.5
Batch size 4; Batch time 0.0002; BPS 5841.1; FPS 23364.3; MB/S 4593.6
Batch size 8; Batch time 0.0003; BPS 3994.4; FPS 31955.4; MB/S 6282.7
Batch size 16; Batch time 0.0004; BPS 2713.8; FPS 43421.3; MB/S 8537.0
Batch size 32; Batch time 0.0007; BPS 1486.3; FPS 47562.7; MB/S 9351.2
Batch size 64; Batch time 0.0015; BPS 679.3; FPS 43475.9; MB/S 8547.7
Batch size 128; Batch time 0.0028; BPS 359.5; FPS 46017.7; MB/S 9047.5
Batch size 256; Batch time 0.0054; BPS 185.2; FPS 47404.1; MB/S 9320.0
Batch size 512; Batch time 0.0108; BPS 92.9; FPS 47564.5; MB/S 9351.6

重现测量的完整代码是：

import time
import numpy as np
import tensorflow as tf
import torch
import argparse


def parseargs():
    parser = argparse.ArgumentParser(usage='Test GPU transfer speed in TensorFlow(default) and Pytorch.')
    parser.add_argument('--pytorch', action='store_true', help='Use PyTorch instead of TensorFlow')
    args = parser.parse_args()
    return args


class TimingModelTF(tf.keras.Model):
    def __init__(self, ):
        super(TimingModelTF, self).__init__()

    @tf.function
    def call(self, x):
        return tf.cast(x, dtype=tf.float32)[0, 0]


class TimingModelTorch(torch.nn.Module):
    def __init__(self, ):
        super(TimingModelTorch, self).__init__()
        self.device = torch.device('cuda')

    def forward(self, x):
        with torch.no_grad():
            return torch.from_numpy(x).to(self.device).type(torch.float32)[0, 0].cpu()


if __name__ == '__main__':
    args = parseargs()
    width = 256
    height = 256
    channels = 3
    iterations = 100
    model = TimingModelTorch() if args.pytorch else TimingModelTF()

    for batch_size in [1, 2, 4, 8, 16, 32, 64, 128, 256, 512]:
        img = np.random.randint(5, size=(batch_size, height, width, channels), dtype=np.uint8)

        result = model(img)
        result.numpy()

        start = time.time()
        for i in range(iterations):
            result = model(img)
            result.numpy()
        batch_time = (time.time() - start) / iterations
        print(f'Batch size batch_size; Batch time batch_time:.4f; BPS 1 / batch_time:.1f; FPS (1 / batch_time) * batch_size:.1f; MB/S (((1 / batch_time) * batch_size) * 256 * 256 * 3) / 1000000:.1f')

【问题讨论】：

可能 pytorch 使用固定缓冲区和 tensorflow 仍然可以流水线化多个操作以接近固定缓冲区的性能。我不确定我是否理解。该代码不使用固定内存（主机） - 它是一个绝对分页的 numpy 数组。流水线将如何提高 CPU-GPU 吞吐量？我对固定内存的理解来自developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc 将数组固定到 gpu，而不是 cpu，应该减少 tf.对于 pytorch，如果 .cpu() 已经在 cpu 中，则 .cpu() 会返回没有副本的原始对象。好的。固定到 GPU = 将所有数据复制到 GPU 并保留在那里并仅使用该数据。这本身无济于事，数据不适合 GPU 内存。问题仍然存在 - 我可以比发布的代码更快地将数据传输到 GPU 吗？在代码中 .cpu() 用于将数据从设备返回到主机 - 我不明白相关评论。 【参考方案1】：

如果Tensorflow函数为JIT compiled，吞吐量会增加，因为某些操作会被融合，中间值不会写入内存，会降低内存带宽。要从文档中突出显示相关的 sn-p：

Fusion 是 XLA 最重要的优化。内存带宽通常是硬件加速器上最稀缺的资源，因此删除内存操作是提高性能的最佳方法之一。

在您的示例中，我们可以通过将jit_compile=True 添加到应用于call 方法的tf.function 装饰器来完成此操作。

class TimingModelTF(tf.keras.Model):
    def __init__(self, ):
        super(TimingModelTF, self).__init__()

    @tf.function(jit_compile=True)
    def call(self, x):
        return tf.cast(x, dtype=tf.float32)[0, 0]

注意：对于 Tensorflow 2.4 及更低版本，请将其更改为 experimental_compile=True。可以在here 找到有关已弃用的关键字参数的详细信息。

在 GTX 1060 上，原始测试的结果：

Batch size 1; Batch time 0.0005; BPS 2040.5; FPS 2040.5; MB/S 401.2
Batch size 2; Batch time 0.0007; BPS 1521.3; FPS 3042.5; MB/S 598.2
Batch size 4; Batch time 0.0006; BPS 1602.7; FPS 6410.8; MB/S 1260.4
Batch size 8; Batch time 0.0009; BPS 1112.5; FPS 8900.0; MB/S 1749.8
Batch size 16; Batch time 0.0013; BPS 760.9; FPS 12174.9; MB/S 2393.7
Batch size 32; Batch time 0.0020; BPS 498.8; FPS 15962.6; MB/S 3138.4
Batch size 64; Batch time 0.0034; BPS 290.2; FPS 18575.1; MB/S 3652.0
Batch size 128; Batch time 0.0063; BPS 158.0; FPS 20222.4; MB/S 3975.9
Batch size 256; Batch time 0.0297; BPS 33.6; FPS 8607.2; MB/S 1692.3
Batch size 512; Batch time 0.0595; BPS 16.8; FPS 8609.1; MB/S 1692.6

峰值约为 4 GB/s。使用函数 JIT 编译的结果：

Batch size 1; Batch time 0.0006; BPS 1610.8; FPS 1610.8; MB/S 316.7
Batch size 2; Batch time 0.0007; BPS 1500.6; FPS 3001.1; MB/S 590.0
Batch size 4; Batch time 0.0006; BPS 1744.3; FPS 6977.1; MB/S 1371.8
Batch size 8; Batch time 0.0009; BPS 1114.2; FPS 8913.9; MB/S 1752.5
Batch size 16; Batch time 0.0013; BPS 788.1; FPS 12609.8; MB/S 2479.2
Batch size 32; Batch time 0.0018; BPS 556.9; FPS 17820.8; MB/S 3503.7
Batch size 64; Batch time 0.0019; BPS 518.5; FPS 33184.4; MB/S 6524.3
Batch size 128; Batch time 0.0054; BPS 186.1; FPS 23818.1; MB/S 4682.8
Batch size 256; Batch time 0.0291; BPS 34.4; FPS 8806.2; MB/S 1731.4
Batch size 512; Batch time 0.0567; BPS 17.6; FPS 9034.3; MB/S 1776.2

峰值约为 6.5 GB/s。在更大/更新的 GPU 上，该速率可能会更高。

作为参考，运行 Torch 测试时，速率峰值约为 7 GB/s：

Batch size 1; Batch time 0.0001; BPS 13396.1; FPS 13396.1; MB/S 2633.8
Batch size 2; Batch time 0.0001; BPS 9231.2; FPS 18462.5; MB/S 3629.9
Batch size 4; Batch time 0.0002; BPS 5752.5; FPS 23009.9; MB/S 4523.9
Batch size 8; Batch time 0.0003; BPS 3463.8; FPS 27710.1; MB/S 5448.0
Batch size 16; Batch time 0.0005; BPS 2027.8; FPS 32444.5; MB/S 6378.8
Batch size 32; Batch time 0.0010; BPS 1040.9; FPS 33308.6; MB/S 6548.7
Batch size 64; Batch time 0.0019; BPS 533.7; FPS 34155.2; MB/S 6715.2
Batch size 128; Batch time 0.0036; BPS 274.0; FPS 35069.0; MB/S 6894.8
Batch size 256; Batch time 0.0072; BPS 138.4; FPS 35425.8; MB/S 6965.0
Batch size 512; Batch time 0.0145; BPS 69.1; FPS 35391.0; MB/S 6958.2

【讨论】：

这很有趣。我认为这在这种情况下不会有任何影响。我会在我的机器上检查它，并验证当网络做一些有用的事情时它确实有效。有趣的是，大批量的传输率仍然下降（与峰值相比为 3.6 倍）。这是否意味着我必须优化张量大小？我需要拆分更大的批次吗？批量大小 256 只有 50 MB！还可以通过tf.function (tensorflow.org/api_docs/python/tf/function#args) 的参数进行其他优化，这可能会进一步提高某些用例的性能，但我不知道它们是否与此处相关。例如，为input_signature 提供传递给函数的张量的已知形状可以减少跟踪，但如果您提供多个具有不同形状的张量，这主要有帮助。如果这些选项没有帮助，您可能需要在最后执行额外的优化

以上是关于TensorFlow 中 CPU 到 GPU 的数据传输是不是缓慢？的主要内容，如果未能解决你的问题，请参考以下文章