使用多处理池在 python 中加速 TFLite 推理

Posted 2023-02-16

技术标签:

【中文标题】使用多处理池在 python 中加速 TFLite 推理【英文标题】：speedup TFLite inference in python with multiprocessing pool 【发布时间】：2020-03-03 19:47:42 【问题描述】：

我在玩 tflite，并在我的多核 CPU 上观察到它在推理期间没有受到太大压力。我通过预先使用 numpy 创建随机输入数据（类似于图像的随机矩阵）消除了 IO 瓶颈，但是 tflite 仍然没有充分利用 CPU 的潜力。

documentation 提到了调整使用线程数的可能性。但是我无法在 Python API 中找到如何做到这一点。但是由于我看到人们为不同的模型使用多个解释器实例，我认为可以使用同一模型的多个实例并在不同的线程/进程上运行它们。我写了以下简短的脚本：

import numpy as np
import os, time
import tflite_runtime.interpreter as tflite
from multiprocessing import Pool


# global, but for each process the module is loaded, so only one global var per process
interpreter = None
input_details = None
output_details = None
def init_interpreter(model_path):
    global interpreter
    global input_details
    global output_details
    interpreter = tflite.Interpreter(model_path=model_path)
    input_details = interpreter.get_input_details()
    output_details = interpreter.get_output_details()
    interpreter.allocate_tensors()
    print('done init')

def do_inference(img_idx, img):
    print('Processing image %d'%img_idx)
    print('interpreter: %r' % (hex(id(interpreter)),))
    print('input_details: %r' % (hex(id(input_details)),))
    print('output_details: %r' % (hex(id(output_details)),))

    tstart = time.time()

    img = np.stack([img]*3, axis=2) # replicates layer three time for RGB
    img = np.array([img]) # create batch dimension
    interpreter.set_tensor(input_details[0]['index'], img )
    interpreter.invoke()

    logit= interpreter.get_tensor(output_details[0]['index'])
    pred = np.argmax(logit, axis=1)[0]
    logit = list(logit[0])
    duration = time.time() - tstart 

    return logit, pred, duration

def main_par():
    optimized_graph_def_file = r'./optimized_graph.lite'

    # init model once to find out input dimensions
    interpreter_main = tflite.Interpreter(model_path=optimized_graph_def_file)
    input_details = interpreter_main.get_input_details()
    input_w, intput_h = tuple(input_details[0]['shape'][1:3])

    num_test_imgs=1000
    # pregenerate random images with values in [0,1]
    test_imgs = np.random.rand(num_test_imgs, input_w,intput_h).astype(input_details[0]['dtype'])

    scores = []
    predictions = []
    it_times = []

    tstart = time.time()
    with Pool(processes=4, initializer=init_interpreter, initargs=(optimized_graph_def_file,)) as pool:         # start 4 worker processes

        results = pool.starmap(do_inference, enumerate(test_imgs))
        scores, predictions, it_times = list(zip(*results))
    duration =time.time() - tstart

    print('Parent process time for %d images: %.2fs'%(num_test_imgs, duration))
    print('Inference time for %d images: %.2fs'%(num_test_imgs, sum(it_times)))
    print('mean time per image: %.3fs +- %.3f' % (np.mean(it_times), np.std(it_times)) )



if __name__ == '__main__':
    # main_seq()
    main_par()

但是，通过hex(id(interpreter)) 打印的解释器实例的内存地址对于每个进程都是相同的。然而，输入/输出细节的内存地址是不同的。因此，我想知道即使我可以体验到加速，这种方式是否可能是错误的？如果是这样，如何使用 TFLite 和 python 实现并行推理？

tflite_runtime 版本：来自 here 的 1.14.0（x86-64 Python 3.5 版本）

python 版本：3.5

【问题讨论】：

我认为您正在尝试解决我需要解决的相同问题。仅供参考，我问了这个问题***.com/questions/61263640/…。 @mherzog 我成功地使用了上述方法，并且根据我从一些测试推理结果中可以看出，单独的 tflite 解释器实例可以正常且独立地工作。我认为内存地址是相同的，因为进程启动相同，因此变量具有相同的内存布局。然而，这只是一个猜测，我没有深入研究这个问题。我尝试运行类似的东西，但只是为了比较，它也在一个简单的循环中运行，并且我使用 5 个工作人员获得了 50 个数据点的加速（相对于在 for 循环中运行这 50 个图像）是 @VikramMurthy 在我的情况下，从单核到四核的加速并不完全是 4 倍，但可以测量到 3.5 倍左右的速度更快。因此，上述代码在撰写本文时正在运行。但是我不知道后来的 tf 版本是否会改变（尽管我非常怀疑）。也许您应该确保速度瓶颈是模型的推断而不是某些 IO 过程？此外，启动比可用 CPU 内核更多的工作线程可能会导致速度变慢。 【参考方案1】：

我没有设置初始化程序并使用以下代码加载模型，并在同一函数中进行推理以解决此问题。

with Pool(processes=multiprocessing.cpu_count()) as pool:
   results = pool.starmap(inference, enumerate(test_imgs))

【讨论】：

以上是关于使用多处理池在 python 中加速 TFLite 推理的主要内容，如果未能解决你的问题，请参考以下文章