记忆到磁盘 - python - 持久记忆

Posted 2023-02-23

技术标签:

【中文标题】记忆到磁盘 - python - 持久记忆【英文标题】：memoize to disk - python - persistent memoization 【发布时间】：2013-05-04 00:51:16 【问题描述】：

有没有办法将函数的输出记忆到磁盘？

我有一个函数

def gethtmlOfUrl(url):
    ... # expensive computation

并且想做类似的事情：

def getHtmlMemoized(url) = memoizeToFile(getHtmlOfUrl, "file.dat")

然后调用getHtmlMemoized(url)，这样每个url只做一次昂贵的计算。

【问题讨论】：

只需腌制（或使用 json）缓存字典。谢谢，但我是 python 新手（第二天）。我不知道你的意思是什么...... 好吧，作为新手，你要做的就是在 Google 中查找“pickle python”，如果您有任何其他问题，请回复我们。与其尝试重新发明这个***，这里有一个库，它做得很好，并且对你在很久以后才会预料到的各种极端情况（并发、磁盘使用、雷鸣般的羊群）具有鲁棒性): bitbucket.org/zzzeek/dogpile.cache 您可以使用库 redis-simple-cache 正是这样做的，函数调用的持久记忆。看看：github.com/vivekn/redis-simple-cache 【参考方案1】：

由 Python 的 Shelve module 提供支持的更简洁的解决方案。优点是缓存通过众所周知的dict 语法实时更新，而且它是异常证明（无需处理烦人的KeyError）。

import shelve
def shelve_it(file_name):
    d = shelve.open(file_name)

    def decorator(func):
        def new_func(param):
            if param not in d:
                d[param] = func(param)
            return d[param]

        return new_func

    return decorator

@shelve_it('cache.shelve')
def expensive_funcion(param):
    pass

这将有助于函数只计算一次。接下来的后续调用将返回存储的结果。

【讨论】：

【参考方案2】：

查看joblib.Memory。这是一个可以做到这一点的库。

from joblib import Memory
memory = Memory("cachedir")
@memory.cache
def f(x):
    print('Running f(%s)' % x)
    return x

【讨论】：

优点：对我有用，而且速度超级快。缺点：文件夹名称是字母和数字，如@987654324@，输出是.pkl文件，所以我无法轻易看到缓存数据的样子。【参考方案3】：

大多数答案都是以装饰者的方式。但也许我不想每次调用函数时都缓存结果。

我使用上下文管理器做了一个解决方案，因此该函数可以称为

with DiskCacher('cache_id', myfunc) as myfunc2:
    res=myfunc2(...)

当您需要缓存功能时。

'cache_id'字符串用于区分数据文件，命名为[calling_script]_[cache_id].dat。所以如果你在循环中这样做，需要将循环变量合并到这个cache_id中，否则数据将被覆盖。

或者：

myfunc2=DiskCacher('cache_id')(myfunc)
res=myfunc2(...)

或者（这可能不太有用，因为始终使用相同的 id）：

@DiskCacher('cache_id')
def myfunc(*args):
    ...

带有示例的完整代码（我使用pickle 来保存/加载，但可以更改为任何保存/读取方法。请注意，这也是假设所讨论的函数仅返回 1 个返回值）：

from __future__ import print_function
import sys, os
import functools

def formFilename(folder, varid):
    '''Compose abspath for cache file

    Args:
        folder (str): cache folder path.
        varid (str): variable id to form file name and used as variable id.
    Returns:
        abpath (str): abspath for cache file, which is using the <folder>
            as folder. The file name is the format:
                [script_file]_[varid].dat
    '''
    script_file=os.path.splitext(sys.argv[0])[0]
    name='[%s]_[%s].nc' %(script_file, varid)
    abpath=os.path.join(folder, name)

    return abpath


def readCache(folder, varid, verbose=True):
    '''Read cached data

    Args:
        folder (str): cache folder path.
        varid (str): variable id.
    Keyword Args:
        verbose (bool): whether to print some text info.
    Returns:
        results (tuple): a tuple containing data read in from cached file(s).
    '''
    import pickle
    abpath_in=formFilename(folder, varid)
    if os.path.exists(abpath_in):
        if verbose:
            print('\n# <readCache>: Read in variable', varid,
                    'from disk cache:\n', abpath_in)
        with open(abpath_in, 'rb') as fin:
            results=pickle.load(fin)

    return results


def writeCache(results, folder, varid, verbose=True):
    '''Write data to disk cache

    Args:
        results (tuple): a tuple containing data read to cache.
        folder (str): cache folder path.
        varid (str): variable id.
    Keyword Args:
        verbose (bool): whether to print some text info.
    '''
    import pickle
    abpath_out=formFilename(folder, varid)
    if verbose:
        print('\n# <writeCache>: Saving output to:\n',abpath_out)
    with open(abpath_out, 'wb') as fout:
        pickle.dump(results, fout)

    return


class DiskCacher(object):
    def __init__(self, varid, func=None, folder=None, overwrite=False,
            verbose=True):
        '''Disk cache context manager

        Args:
            varid (str): string id used to save cache.
                function <func> is assumed to return only 1 return value.
        Keyword Args:
            func (callable): function object whose return values are to be
                cached.
            folder (str or None): cache folder path. If None, use a default.
            overwrite (bool): whether to force a new computation or not.
            verbose (bool): whether to print some text info.
        '''

        if folder is None:
            self.folder='/tmp/cache/'
        else:
            self.folder=folder

        self.func=func
        self.varid=varid
        self.overwrite=overwrite
        self.verbose=verbose

    def __enter__(self):
        if self.func is None:
            raise Exception("Need to provide a callable function to __init__() when used as context manager.")

        return _Cache2Disk(self.func, self.varid, self.folder,
                self.overwrite, self.verbose)

    def __exit__(self, type, value, traceback):
        return

    def __call__(self, func=None):
        _func=func or self.func
        return _Cache2Disk(_func, self.varid, self.folder, self.overwrite,
                self.verbose)



def _Cache2Disk(func, varid, folder, overwrite, verbose):
    '''Inner decorator function

    Args:
        func (callable): function object whose return values are to be
            cached.
        varid (str): variable id.
        folder (str): cache folder path.
        overwrite (bool): whether to force a new computation or not.
        verbose (bool): whether to print some text info.
    Returns:
        decorated function: if cache exists, the function is <readCache>
            which will read cached data from disk. If needs to recompute,
            the function is wrapped that the return values are saved to disk
            before returning.
    '''

    def decorator_func(func):
        abpath_in=formFilename(folder, varid)

        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            if os.path.exists(abpath_in) and not overwrite:
                results=readCache(folder, varid, verbose)
            else:
                results=func(*args, **kwargs)
                if not os.path.exists(folder):
                    os.makedirs(folder)
                writeCache(results, folder, varid, verbose)
            return results
        return wrapper

    return decorator_func(func)



if __name__=='__main__':

    data=range(10)  # dummy data

    #--------------Use as context manager--------------
    def func1(data, n):
        '''dummy function'''
        results=[i*n for i in data]
        return results

    print('\n### Context manager, 1st time call')
    with DiskCacher('context_mananger', func1) as func1b:
        res=func1b(data, 10)
        print('res =', res)

    print('\n### Context manager, 2nd time call')
    with DiskCacher('context_mananger', func1) as func1b:
        res=func1b(data, 10)
        print('res =', res)

    print('\n### Context manager, 3rd time call with overwrite=True')
    with DiskCacher('context_mananger', func1, overwrite=True) as func1b:
        res=func1b(data, 10)
        print('res =', res)

    #--------------Return a new function--------------
    def func2(data, n):
        results=[i*n for i in data]
        return results

    print('\n### Wrap a new function, 1st time call')
    func2b=DiskCacher('new_func')(func2)
    res=func2b(data, 10)
    print('res =', res)

    print('\n### Wrap a new function, 2nd time call')
    res=func2b(data, 10)
    print('res =', res)

    #----Decorate a function using the syntax sugar----
    @DiskCacher('pie_dec')
    def func3(data, n):
        results=[i*n for i in data]
        return results

    print('\n### pie decorator, 1st time call')
    res=func3(data, 10)
    print('res =', res)

    print('\n### pie decorator, 2nd time call.')
    res=func3(data, 10)
    print('res =', res)

输出：

### Context manager, 1st time call

# <writeCache>: Saving output to:
 /tmp/cache/[diskcache]_[context_mananger].nc
res = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90]

### Context manager, 2nd time call

# <readCache>: Read in variable context_mananger from disk cache:
 /tmp/cache/[diskcache]_[context_mananger].nc
res = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90]

### Context manager, 3rd time call with overwrite=True

# <writeCache>: Saving output to:
 /tmp/cache/[diskcache]_[context_mananger].nc
res = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90]

### Wrap a new function, 1st time call

# <writeCache>: Saving output to:
 /tmp/cache/[diskcache]_[new_func].nc
res = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90]

### Wrap a new function, 2nd time call

# <readCache>: Read in variable new_func from disk cache:
 /tmp/cache/[diskcache]_[new_func].nc
res = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90]

### pie decorator, 1st time call

# <writeCache>: Saving output to:
 /tmp/cache/[diskcache]_[pie_dec].nc
res = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90]

### pie decorator, 2nd time call.

# <readCache>: Read in variable pie_dec from disk cache:
 /tmp/cache/[diskcache]_[pie_dec].nc
res = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90]

【讨论】：

【参考方案4】：

还有diskcache。

from diskcache import Cache

cache = Cache("cachedir")

@cache.memoize()
def f(x, y):
    print('Running f(, )'.format(x, y))
    return x, y

【讨论】：

如果要记忆的函数是一个对象的方法，而cache变量是在那个对象中初始化的，怎么用呢？它使用简单，效果很好。它将所有缓存数据存储在您指定的缓存目录中名为cache.db 的文件中。【参考方案5】：

你可以使用cache_to_disk包：

    from cache_to_disk import cache_to_disk

    @cache_to_disk(3)
    def my_func(a, b, c, d=None):
        results = ...
        return results

这会将结果缓存 3 天，具体到参数 a、b、c 和 d。结果存储在您机器上的 pickle 文件中，并在下次调用该函数时取消腌制并返回。 3天后，pickle文件被删除，直到函数重新运行。每当使用新参数调用该函数时，该函数将重新运行。更多信息在这里：https://github.com/sarenehan/cache_to_disk

【讨论】：

【参考方案6】：

Artemis library 有一个用于此目的的模块。（你需要pip install artemis-ml）

你装饰你的功能：

from artemis.fileman.disk_memoize import memoize_to_disk

@memoize_to_disk
def fcn(a, b, c = None):
    results = ...
    return results

在内部，它对输入参数进行散列，并通过该散列保存备忘录文件。

【讨论】：

【参考方案7】：

假设你的数据是 json 可序列化的，这段代码应该可以工作

import os, json

def json_file(fname):
    def decorator(function):
        def wrapper(*args, **kwargs):
            if os.path.isfile(fname):
                with open(fname, 'r') as f:
                    ret = json.load(f)
            else:
                with open(fname, 'w') as f:
                    ret = function(*args, **kwargs)
                    json.dump(ret, f)
            return ret
        return wrapper
    return decorator

装饰getHtmlOfUrl然后简单地调用它，如果之前已经运行过，你会得到你的缓存数据。

用 python 2.x 和 python 3.x 检查

【讨论】：

【参考方案8】：

Python 提供了一种非常优雅的方法——装饰器。基本上，装饰器是包装另一个函数以提供附加功能而不更改函数源代码的函数。你的装饰器可以这样写：

import json

def persist_to_file(file_name):

    def decorator(original_func):

        try:
            cache = json.load(open(file_name, 'r'))
        except (IOError, ValueError):
            cache = 

        def new_func(param):
            if param not in cache:
                cache[param] = original_func(param)
                json.dump(cache, open(file_name, 'w'))
            return cache[param]

        return new_func

    return decorator

完成后，使用@-syntax '装饰'函数就可以了。

@persist_to_file('cache.dat')
def html_of_url(url):
    your function code...

请注意，此装饰器是有意简化的，可能不适用于所有情况，例如，当源函数接受或返回无法进行 json 序列化的数据时。

更多关于装饰器的信息：How to make a chain of function decorators?

下面是如何让装饰器在退出时只保存一次缓存：

import json, atexit

def persist_to_file(file_name):

    try:
        cache = json.load(open(file_name, 'r'))
    except (IOError, ValueError):
        cache = 

    atexit.register(lambda: json.dump(cache, open(file_name, 'w')))

    def decorator(func):
        def new_func(param):
            if param not in cache:
                cache[param] = func(param)
            return cache[param]
        return new_func

    return decorator

【讨论】：

这将在每次更新缓存时写入一个新文件 - 取决于用例，这可能（或可能不会）破坏您从记忆中获得的加速...... 它也包含一个非常好的竞争条件，如果这个装饰器被同时使用，或者（更有可能）以可重入的方式使用。如果a() 和b() 都被记忆，并且a() 调用b()，则可以读取a() 的缓存，然后再读取b()，第一个b 的结果被记忆，然后是陈旧的缓存从对 a 的调用覆盖它，b 对缓存的贡献就丢失了。 @root: 当然，atexit 可能是刷新缓存的更好地方。另一方面，添加过早的优化可能会破坏此代码的教育目的。当然，在一般情况下，腌制而不是序列化为 json 会更好。标准库也包含这个docs.python.org/2/library/shelve.html，它将酸洗包装在一个dict风格的界面中......几乎是一个现成的缓存系统，只需添加装饰器我创建了一个名为 simpler 的包，带有一个注释 simpler.files.disk_cache，它完全符合此目的，包括缓存生命周期和使用注释参数的参数感知缓存。【参考方案9】：

应该这样做：

import json

class Memoize(object):
    def __init__(self, func):
        self.func = func
        self.memo = 

    def load_memo(filename):
        with open(filename) as f:
            self.memo.update(json.load(f))

    def save_memo(filename):
        with open(filename, 'w') as f:
            json.dump(self.memo, f)

    def __call__(self, *args):
        if not args in self.memo:
            self.memo[args] = self.func(*args)
        return self.memo[args]

基本用法：

your_mem_func = Memoize(your_func)
your_mem_func.load_memo('yourdata.json')
#  do your stuff with your_mem_func

如果你想在使用后将你的 "cache" 写入一个文件 -- 以便将来再次加载：

your_mem_func.save_memo('yournewdata.json')

【讨论】：

@seguso - 更新了使用情况。更多关于记忆：***.com/questions/1988804/… @Merlin - 这是一个典型的记忆化案例，一个类应该（或者也可以）被使用...... 谢谢，但这不是我所说的 memoize。我不想手动管理加载和保存到磁盘。这应该是透明的。 @seguso - 然后使用 thg435 变体，但请注意，每次更新缓存时它都会写入文件 - 这可能最终会破坏记忆的目的。（在这种情况下，您可以管理写入，而您仍然只需要加载文件一次）

以上是关于记忆到磁盘 - python - 持久记忆的主要内容，如果未能解决你的问题，请参考以下文章