在未加载重复项的多个 python 脚本之间共享变量（来自文件的数据）

Posted 2023-02-16

技术标签:

【中文标题】在未加载重复项的多个 python 脚本之间共享变量（来自文件的数据）【英文标题】：share variable (data from file) among multiple python scripts with not loaded duplicates 【发布时间】：2016-01-15 21:10:07 【问题描述】：

我想加载一个包含在matrix_file.mtx 中的大矩阵。此负载必须执行一次。一旦变量 matrix 被加载到内存中，我希望许多 python 脚本共享它而不重复，以便在 bash（或 python 本身）中拥有一个内存高效的多脚本程序。我可以想象一些这样的伪代码：

# Loading and sharing script:
import share
matrix = open("matrix_file.mtx","r")
share.send_to_shared_ram(matrix, as_variable('matrix'))

# Shared matrix variable processing script_1
import share
pointer_to_matrix = share.share_variable_from_ram('matrix')
type(pointer_to_matrix)
# output: <type 'numpy.ndarray'>

# Shared matrix variable processing script_2
import share
pointer_to_matrix = share.share_variable_from_ram('matrix')
type(pointer_to_matrix)
# output: <type 'numpy.ndarray'>
...

这个想法是pointer_to_matrix指向RAM中的matrix，它只被n个脚本加载一次（不是n次）。它们分别从 bash 脚本中调用（或者如果可能，从 python 主脚本中调用）：

$ python Load_and_share.py
$ python script_1.py -args string &
$ python script_2.py -args string &
$ ...
$ python script_n.py -args string &

我也对通过硬盘的解决方案感兴趣，即matrix 可以存储在磁盘上，而share 对象可以根据需要访问它。尽管如此，RAM中的对象（一种指针）可以看作是整个矩阵。

感谢您的帮助。

【问题讨论】：

您的脚本在哪个操作系统下运行？它们在 Ubuntu 14.04 中运行，我的脚本是用 Python2.7 编写的，我找到了这个 ***.com/questions/19289171/…，但我不知道序列化变量是否真的共享或实际加载了 n 次。跨度> 查找 mmap -- 内存映射文件支持。您可以根据需要在单独的 python 脚本中打开保存的 matrix 文件。如果您可以只读方式打开（PROT_READ 但不能 PROT_WRITE），则复制将被最小化或消除。 【参考方案1】：

在mmap module 和numpy.frombuffer 之间，这相当简单：

import mmap
import numpy as np

with open("matrix_file.mtx","rb") as matfile:
    mm = mmap.mmap(matfile.fileno(), 0, access=mmap.ACCESS_READ)
    # Optionally, on UNIX-like systems in Py3.3+, add:
    # os.posix_fadvise(matfile.fileno(), 0, len(mm), os.POSIX_FADV_WILLNEED)
    # to trigger background read in of the file to the system cache,
    # minimizing page faults when you use it

matrix = np.frombuffer(mm, np.uint8)

每个进程将单独执行这项工作，并获得同一内存的只读视图。您可以根据需要将dtype 更改为uint8 以外的其他值。切换到ACCESS_WRITE 将允许对共享数据进行修改，但它需要同步并可能显式调用mm.flush 以实际确保数据反映在其他进程中。

更接近您的初始设计的更复杂的解决方案可能是使用multiprocessing.SyncManager 为数据创建可连接的共享“服务器”，允许向管理器注册单个公共数据存储并返回给尽可能多的数据用户根据需要；在管理器上创建一个具有正确类型的Array（基于ctypes 类型），然后register-ing 一个向所有调用者返回相同共享Array 的函数也可以工作（然后每个调用者将转换像以前一样通过numpy.frombuffer返回Array）。它涉及更多（让单个 Python 进程初始化 Array，然后启动 Processes 会更容易，由于 fork 语义，它会自动共享它），但它最接近您描述的概念。

【讨论】：

以上是关于在未加载重复项的多个 python 脚本之间共享变量（来自文件的数据）的主要内容，如果未能解决你的问题，请参考以下文章