并行化只需要在每 X 次迭代中运行的慢速函数，以免降低循环速度

Posted 2023-02-16

技术标签:

【中文标题】并行化只需要在每 X 次迭代中运行的慢速函数，以免降低循环速度【英文标题】：Parallelize slow functions that needs to be run only every X iterations in order to not slow the loop 【发布时间】：2021-12-23 13:43:56 【问题描述】：

项目

我正在进行一个项目，我需要检测人脸（边界框和地标）并执行人脸识别（识别人脸）。检测速度非常快（在我的笔记本电脑上甚至不需要几毫秒），但识别速度可能非常慢（在我的笔记本电脑上大约需要 0.4 秒）。我正在使用face_recognition Python 库来执行此操作。经过几次测试，我发现是图像的嵌入慢。

这是一个示例代码，您可以自己尝试一下：

# Source : https://pypi.org/project/face-recognition/
import face_recognition

known_image = face_recognition.load_image_file("biden.jpg")
biden_encoding = face_recognition.face_encodings(known_image)[0]

image = face_recognition.load_image_file("your_file.jpg")
face_locations = face_recognition.face_locations(image)
face_landmarks_list = face_recognition.face_landmarks(image)

unknown_encoding = face_recognition.face_encodings(image)[0]
results = face_recognition.compare_faces([biden_encoding], unknown_encoding)

问题

我需要做的是处理视频（30 FPS），因此 0.4s 的计算是不可接受的。我的想法是识别只需要运行几次，而不是每一帧，因为从一帧到另一帧，如果视频中没有 cuts，给定的头部将是接近之前的位置。因此，第一次出现头部时，我们运行识别非常慢，但是对于接下来的 X 帧，我们不必这样做，因为我们会检测到位置接近前一个，因此它必须是那个感动的人。当然，这种方法并不完美，但似乎是一个很好的折衷方案，我想尝试一下。

唯一的问题是，通过这样做，视频在出现头部之前是流畅的，然后视频会因为识别而冻结，然后再次变得流畅。这就是我想介绍多处理的地方，我希望能够在循环遍历视频帧的同时计算识别。如果我能做到这一点，我只需要提前处理几帧，这样当一张脸出现时，它已经在几秒钟前在几帧中计算了它的识别，所以我们没有看到降低的帧速率。

简单的公式

因此，这就是我所拥有的（在 python 伪代码中以便更清晰）：

def slow_function(image):
    # This function takes a lot of time to compute and would normally slow down the loop
    return Recognize(image)
    
# Loop that we need to maintain at a given speed
person_name = "unknown"
frame_index = -1
while True:
    frame_index += 1
    frame = new_frame() # this is not important and therefore not detailes
    
    # Every ten frames, we run a heavy function
    if frame_index % 10 == 0:
        person_name = slow_function(image)

    # each frame we use the person_name even if we only compute it every so often
    frame.drawText(person_name)

我想做这样的事情：

def slow_function(image):
    # This function takes a lot of time to compute and would normally slow down the loop
    return Recognize(image)
    
# Loop that we need to maintain at a given speed
person_name = "unknown"
frame_index = -1
while True:
    frame_index += 1
    frame = new_frame() # this is not important and therefore not detailes
    
    # Every ten frames, we run a heavy function
    if frame_index % 10 == 0:
        DO slow_function(image) IN parallel WITH CALLBACK(person_name = result)

    # each frame we use the person_name even if we only compute it every so often
    frame.drawText(person_name)

目标是在循环的多次迭代中计算一个慢速函数。

我尝试过的

我查找了multiprocessing 和Ray，但没有找到我想做的示例。大多数时候，我发现人们使用multiprocessing 同时计算不同输入的函数结果。这不是我想要的。我希望有一个并行的循环和一个从循环（帧）接受数据的过程，进行一些计算，然后在不中断或减慢循环的情况下向循环返回一个值（或者至少，传播减慢而不是而不是一个非常慢的迭代和 9 个快速的迭代）。

【问题讨论】：

我不明白你想做什么以及它与multiprocessing 中提到的示例有何不同。至于我，您也使用不同的输入运行相同的函数，但每 10 个循环运行一次新进程 - 但它仍然类似于同时运行新函数的示例。每 10 个循环，您就可以使用不同的输入运行新的 Process()。你能给我一个这样的例子吗？因为找不到 【参考方案1】：

我想我几乎找到了如何做我想做的事。这是一个例子：

from multiprocessing import Pool
import time

# This seems to me more precise than time.sleep()
def sleep(duration, get_now=time.perf_counter):
    now = get_now()
    end = now + duration
    while now < end:
        now = get_now()
        

def myfunc(x):
    time.sleep(1)
    return x
 
def mycallback(x):
     print('Callback for i = '.format(x))

if __name__ == '__main__':
    pool=Pool()
    
    # Approx of 5s in total
    # Without parallelization, this should take 15s
    t0 = time.time()
    titer = time.time()
    for i in range(100):
        if i% 10 == 0: pool.apply_async(myfunc, (i,), callback=mycallback)
        sleep(0.05) # 50ms
        print("- i =", i, "/ Time iteration:", 1000*(time.time()-titer), "ms")
        titer = time.time()
        
    print("\n\nTotal time:", (time.time()-t0), "s")
    
    t0 = time.time()
    for i in range(100):
        sleep(0.05)
    print("\n\nBenchmark sleep time time:", 10*(time.time()-t0), "ms")

当然，我需要添加标志，这样我就不会在循环中读取回调的同时写入值。

【讨论】：

你应该使用 Process 而不是 Pool 因为通常只创建 4 个进程（如果你有 4 个内核的 CPU），当你尝试运行第 5 个进程时，它必须等到其他进程停止运行。您可能需要Queue 将值从进程发送到主线程。并且使用Queue，您可以检查主for-loop 中的队列以获取当前值 - 所以当您读取其他元素时它不会得到它。谢谢，我试过了，一开始没用，甚至更慢。但那是在 Windows 上，在切换到 Ubuntu 后它运行良好！再次感谢

以上是关于并行化只需要在每 X 次迭代中运行的慢速函数，以免降低循环速度的主要内容，如果未能解决你的问题，请参考以下文章