在循环中保存图像比多线程/多处理更快

Posted 2023-02-16

技术标签:

【中文标题】在循环中保存图像比多线程/多处理更快【英文标题】：Saving images in a loop faster than multithreading / multiprocessing 【发布时间】：2022-01-21 16:27:03 【问题描述】：

这是一个定时示例，多个不同大小的图像数组被保存在一个循环中以及同时使用线程/进程：

import tempfile
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor, as_completed
from pathlib import Path
from time import perf_counter

import numpy as np
from cv2 import cv2


def save_img(idx, image, dst):
    cv2.imwrite((Path(dst) / f'idx.jpg').as_posix(), image)


if __name__ == '__main__':
    l1 = np.random.randint(0, 255, (100, 50, 50, 1))
    l2 = np.random.randint(0, 255, (1000, 50, 50, 1))
    l3 = np.random.randint(0, 255, (10000, 50, 50, 1))
    temp_dir = tempfile.mkdtemp()
    workers = 4
    t1 = perf_counter()
    for ll in l1, l2, l3:
        t = perf_counter()
        for i, img in enumerate(ll):
            save_img(i, img, temp_dir)
        print(f'Time for len(ll): perf_counter() - t seconds')
        for executor in ThreadPoolExecutor, ProcessPoolExecutor:
            with executor(workers) as ex:
                futures = [
                    ex.submit(save_img, i, img, temp_dir) for (i, img) in enumerate(ll)
                ]
                for f in as_completed(futures):
                    f.result()
            print(
                f'Time for len(ll) (executor.__name__): perf_counter() - t seconds'
            )

我在 i5 mbp 上获得这些持续时间：

Time for 100: 0.09495482999999982 seconds
Time for 100 (ThreadPoolExecutor): 0.14151873999999998 seconds
Time for 100 (ProcessPoolExecutor): 1.5136184309999998 seconds
Time for 1000: 0.36972280300000016 seconds
Time for 1000 (ThreadPoolExecutor): 0.619205703 seconds
Time for 1000 (ProcessPoolExecutor): 2.016624468 seconds
Time for 10000: 4.232915643999999 seconds
Time for 10000 (ThreadPoolExecutor): 7.251599262 seconds
Time for 10000 (ProcessPoolExecutor): 13.963426469999998 seconds

难道线程/进程不需要更少的时间来完成同样的事情吗？在这种情况下为什么不呢？

【问题讨论】：

用ex.map代替submit会一样吗？进程和线程持续时间提高到与 for 循环持续时间完全相同，这几乎是相同的问题。 i/o 绑定计算一般不会被多线程加速。线程为多个 cpu 提供同时提供周期的潜力。但是保持 i/o 通道完全充满需要很少的 cpu 工作。因此，提高 CPU 功率的潜力并没有帮助。那么在这个特定的用例中，你的意思是多线程和多处理方法都不是必需的，最好使用 for 循环吗？如果是这样，那么加快速度的正确方法是什么？同时与否正在写入 100、1000 和 10000 个图像，您会混淆图像大小的第一个维度。我在一些文本生成代码中使用相同的逻辑，将文本呈现为图像并保存它们。该示例只是一个简化版本。我提到在 i5 macbook pro 上运行示例。 【参考方案1】：

代码中的计时是错误的，因为计时器t 在测试池之前没有重置。然而，时间的相对顺序是正确的。带有计时器重置的可能代码是：

import tempfile
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor, as_completed
from pathlib import Path
from time import perf_counter

import numpy as np
from cv2 import cv2


def save_img(idx, image, dst):
    cv2.imwrite((Path(dst) / f'idx.jpg').as_posix(), image)

if __name__ == '__main__':
    l1 = np.random.randint(0, 255, (100, 50, 50, 1))
    l2 = np.random.randint(0, 255, (1000, 50, 50, 1))
    l3 = np.random.randint(0, 255, (10000, 50, 50, 1))
    temp_dir = tempfile.mkdtemp()
    workers = 4

    for ll in l1, l2, l3:
        t = perf_counter()
        for i, img in enumerate(ll):
            save_img(i, img, temp_dir)
        print(f'Time for len(ll): perf_counter() - t seconds')
        for executor in ThreadPoolExecutor, ProcessPoolExecutor:
            t = perf_counter()
            with executor(workers) as ex:
                futures = [
                    ex.submit(save_img, i, img, temp_dir) for (i, img) in enumerate(ll)
                ]
                for f in as_completed(futures):
                    f.result()
            print(
                f'Time for len(ll) (executor.__name__): perf_counter() - t seconds'
            )

多线程处理速度更快，特别是对于 I/O 绑定进程。在这种情况下，压缩图像是 cpu 密集型的，因此根据 OpenCV 和 python 包装器的实现，多线程可能会慢得多。在许多情况下，罪魁祸首是 CPython 的 GIL，但我不确定是否是这种情况（我不知道 GIL 是否在 imwrite 调用期间释放）。在我的设置（i7 第 8 代）中，线程处理 100 个图像的速度与循环一样快，而 1000 和 10000 个图像的速度几乎不快。如果ThreadPoolExecutor 重用线程，则将新任务分配给现有线程会产生开销。如果它不重用线程，则启动新线程会产生开销。

多处理绕过了 GIL 问题，但还有一些其他问题。首先，腌制数据以在进程之间传递需要一些时间，而且对于图像，它可能非常昂贵。其次，在 windows 的情况下，产生一个新进程需要很多时间。查看开销（进程和线程）的简单测试是将save_image 函数更改为不执行任何操作但仍需要酸洗等的函数：

def save_img(idx, image, dst):
    if idx != idx:
        print("impossible!")

通过类似的不带参数的方法来查看生成进程等的开销。

我的设置中的时间显示，仅生成 10000 个进程就需要 2.3 秒，而酸洗需要 0.6 秒，这比处理所需的时间要长得多。

提高吞吐量并将开销保持在最低限度的一种方法是中断块上的工作，并将每个块提交给工作人员：

import tempfile
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor, as_completed
from pathlib import Path
from time import perf_counter

import numpy as np
from cv2 import cv2


def save_img(idx, image, dst):
    cv2.imwrite((Path(dst) / f'idx.jpg').as_posix(), image)

def multi_save_img(idx_start, images, dst):
    for idx, image in zip(range(idx_start, idx_start + len(images)), images):
        cv2.imwrite((Path(dst) / f'idx.jpg').as_posix(), image)


if __name__ == '__main__':
    l1 = np.random.randint(0, 255, (100, 50, 50, 1))
    l2 = np.random.randint(0, 255, (1000, 50, 50, 1))
    l3 = np.random.randint(0, 255, (10000, 50, 50, 1))
    temp_dir = tempfile.mkdtemp()
    workers = 4

    for ll in l1, l2, l3:
        t = perf_counter()
        for i, img in enumerate(ll):
            save_img(i, img, temp_dir)
        print(f'Time for len(ll): perf_counter() - t seconds')
        chunk_size = len(ll)//workers 
        ends = [chunk_size * (_+1)  for _ in range(workers)]
        ends[-1] += len(ll) % workers
        starts = [chunk_size * _  for _ in range(workers)]
        for executor in ThreadPoolExecutor, ProcessPoolExecutor:
            t = perf_counter()
            with executor(workers) as ex:
                futures = [
                    ex.submit(multi_save_img, start, ll[start:end], temp_dir) for (start, end) in zip(starts, ends)
                ]
                for f in as_completed(futures):
                    f.result()
            print(
                f'Time for len(ll) (executor.__name__): perf_counter() - t seconds'
            )

对于多处理和多线程方法，这应该会给您带来比简单 for 的显着提升。

map 函数提供相同的功能和更好的性能。如果你改变内循环

with executor(workers) as ex:
    rv = ex.map(save_img, [(i, img, temp_dir) for (i, img) in enumerate(ll)], chunksize=len(ll)//workers+1)

你会得到最好的时机。

【讨论】：

然而，多处理的时机最差。那么，你有什么建议来加快这个操作呢？我不认为它是特定于操作系统的，我在我的 mbp 和 ubuntu 不同的机器上尝试过，我得到了相似的结果。在我看来，进程和线程同样没用。查看我的更新答案。最后一段代码显示了您的问题的解决方案我的错误，我想我放错了perf_counter() 电话。我尝试了您更新的解决方案，我想它可以解决问题。为什么线程有最好的时间？例如：for、ThreadPoolExecutor 和 ProcessPoolExecutor 分别为 7.372398026000001、2.9415655140000005 和 6.112366614999999（n = 10000）。这是否意味着 GIL 在cv2.imwrite 调用期间被释放？这意味着GIL在整个调用过程中没有保持，而是在某个时候被释放（我确定它在I/O调用过程中被释放，但我不知道它是否调用底层 OpenCV 函数时释放）

以上是关于在循环中保存图像比多线程/多处理更快的主要内容，如果未能解决你的问题，请参考以下文章

为啥多线程（使用 pthread）似乎比多进程（使用 fork）慢？

c++ 中的多线程线程安全动画建议

『Python』多进程处理

c#中利用system.timers多线程做图像处理，图像保存时提示“GDI+ 中发生一般性错误”，如何解决？

一般来说，Node.js 如何处理 10,000 个并发请求？

如何进行多线程以更快地将图像加载到 tableView