Python多进程处理数据

Posted 2022-05-23 JasonLiu1919

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了Python多进程处理数据相关的知识，希望对你有一定的参考价值。

背景

假设有1千万任务数据，每条数据处理耗时1s，那么如何加速整个任务的处理速度？其中一种解决方案就是使用多进程处理。

解决方案

pandas + pandarallel
安装pandarallel: pip install pandarallel

示例

# -*- coding: utf-8 -*-
# @Time    : 2022/5/21 6:14 下午
# @Author  : JasonLiu
# @FileName: test.py
import time
import pandas as pd
from pandarallel import pandarallel
pandarallel.initialize(nb_workers=5)
text1 = ["华中科技大学"+str(i) for i in range(10)]

task_df = pd.DataFrame("text1": text1)


def text_processing(text):
    """
    做一些文本的处理操作, 这里仅仅是为了演示
    """
    text += " HUST"
    time.sleep(2)
    return text


start_time = time.time()
task_df["new_text1"] = task_df["text1"].apply(text_processing)
end_time = time.time()
print("raw apply cost=", end_time-start_time)
print(task_df)
start_time = time.time()
task_df["new_text2"] = task_df["text1"].parallel_apply(text_processing)
end_time = time.time()
print("parallel_apply cost=", end_time-start_time)
print(task_df)

运行结果如下：

INFO: Pandarallel will run on 5 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.
raw apply cost= 20.01844358444214
     text1     new_text1
0  华中科技大学0  华中科技大学0 HUST
1  华中科技大学1  华中科技大学1 HUST
2  华中科技大学2  华中科技大学2 HUST
3  华中科技大学3  华中科技大学3 HUST
4  华中科技大学4  华中科技大学4 HUST
5  华中科技大学5  华中科技大学5 HUST
6  华中科技大学6  华中科技大学6 HUST
7  华中科技大学7  华中科技大学7 HUST
8  华中科技大学8  华中科技大学8 HUST
9  华中科技大学9  华中科技大学9 HUST
parallel_apply cost= 4.040616035461426
     text1     new_text1     new_text2
0  华中科技大学0  华中科技大学0 HUST  华中科技大学0 HUST
1  华中科技大学1  华中科技大学1 HUST  华中科技大学1 HUST
2  华中科技大学2  华中科技大学2 HUST  华中科技大学2 HUST
3  华中科技大学3  华中科技大学3 HUST  华中科技大学3 HUST
4  华中科技大学4  华中科技大学4 HUST  华中科技大学4 HUST
5  华中科技大学5  华中科技大学5 HUST  华中科技大学5 HUST
6  华中科技大学6  华中科技大学6 HUST  华中科技大学6 HUST
7  华中科技大学7  华中科技大学7 HUST  华中科技大学7 HUST
8  华中科技大学8  华中科技大学8 HUST  华中科技大学8 HUST
9  华中科技大学9  华中科技大学9 HUST  华中科技大学9 HUST

从中可以看出，使用 Pandarallel 使得整个处理耗时从原始的20s，缩减到4s。

以上是关于Python多进程处理数据的主要内容，如果未能解决你的问题，请参考以下文章