如何使用缓存过滤器?

Posted

技术标签:

【中文标题】如何使用缓存过滤器?【英文标题】:How to use cache filter? 【发布时间】:2021-11-30 17:43:33 【问题描述】:

我的缓存过滤器有问题。

这个想法是不缓存包含"incomplete_result":true的响应

这是我的过滤功能:

import requests
import requests_cache

def phrase_filter(response: requests.models.Response)->bool:
    if '"incomplete_results":true' in response.text:
        return False
    return True

但是当我用这段代码测试它时:

requests_cache.install_cache('demo_cache',expired_after=600,filter_fn=phrase_filter)
requests_cache.clear()

url1 = 'https://raw.githubusercontent.com/KienTrann/requests-cache-testing/main/should_be_cached.txt'
url2 = 'https://raw.githubusercontent.com/KienTrann/requests-cache-testing/main/should_not_be_cached.txt'

with requests_cache.enabled():
    r = requests.get(url1) # First request
    r = requests.get(url1) # Second request
    print(f'Text from url1:\nr.text')
    assert r.from_cache==True
    #
    r1 = requests.get(url2) # First request
    r1 = requests.get(url2) # Second request
    print('---')
    print(f'Text from url2:\nr1.text')
    assert r1.from_cache==False

requests_cache.disabled()

结果如下:

Text from url1:
abc
xyz
"incomplete_results":false

---
Text from url2:
abc
xyz
"incomplete_results":true

Traceback (most recent call last):
  File "C:\Users\ADMIN\source\repos\LearningPython\py_2\py_2.py", line 25, in <module>
    assert r1.from_cache==False
AssertionError

我不明白为什么 r1 被缓存了。

有什么问题?我该如何解决?

感谢您花时间回答

【问题讨论】:

【参考方案1】:

打补丁

看起来你快到了! requests_cache.enabled()disabled()install_cache()uninstall_cache() 的上下文管理器替代品。只需将您的设置传递给enabled() 而不是install_cache()

with requests_cache.enabled('demo_cache', expire_after=600, filter_fn=phrase_filter):
    # ... make requests

这与以下基本相同:

requests_cache.install_cache('demo_cache', expire_after=600, filter_fn=phrase_filter)
# ... make requests
requests_cache.uninstall_cache()

会话

我个人建议使用requests_cache.CachedSession 而不是修补方法,因为它使缓存的内容更加明确,如果您想发出非缓存请求,您可以使用常规的requests 函数。此处的文档中有更多信息:https://requests-cache.readthedocs.io/en/stable/user_guide/general.html

例子:

from requests import Response
from requests_cache import CachedSession

def phrase_filter(response: Response) -> bool:
    return '"incomplete_results":true' not in response.text

url1 = 'https://raw.githubusercontent.com/KienTrann/requests-cache-testing/main/should_be_cached.txt'
url2 = 'https://raw.githubusercontent.com/KienTrann/requests-cache-testing/main/should_not_be_cached.txt'
session = CachedSession('demo_cache', expire_after=600, filter_fn=phrase_filter)
session.cache.clear()

nonfiltered_response = session.get(url1)
nonfiltered_response = session.get(url1)
assert nonfiltered_response.from_cache is True

filtered_response = session.get(url2)
filtered_response = session.get(url2)
assert filtered_response.from_cache is False

调试

如果您以后遇到类似的问题,不确定为什么响应被缓存或没有被缓存,您可以启用调试日志记录:

import logging
logging.basicConfig(level='DEBUG')

您将获得每个响应的缓存信息,如下所示:

DEBUG:requests_cache.session: Pre-cache checks for response from https://raw.githubusercontent.com/KienTrann/requests-cache-testing/main/should_not_be_cached.txt: 

    'disabled cache': False,
    'disabled method': False,
    'disabled status': False,
    'disabled by filter': True,
    'disabled by headers or expiration params': False,

这里的文档中的更多信息:https://requests-cache.readthedocs.io/en/stable/user_guide/troubleshooting.html

【讨论】:

【参考方案2】:

我也试过了,但无法正常工作:

# Added by Eurico Covas
# see https://requests-cache.readthedocs.io/en/stable/user_guide/filtering.html
@staticmethod
def filter_by_error(response: requests.models.Response) -> bool:
    """Don't cache responses with ErrMsg"""
    if response is None:
       return True
    if response.ok ==False:
       return True
    if len(response.json()['GDSSDKResponse']) == 1:
        if len(response.json()['GDSSDKResponse'][0]) >= 1:
           if "ErrMsg" in response.json()['GDSSDKResponse'][0].keys():
              if response.json()['GDSSDKResponse'][0]['ErrMsg'] is not None and response.json()['GDSSDKResponse'][0]['ErrMsg'] != '':
                  return True
    return False

def __init__(self, username, password, verify=True, debug=False, request_caching_enabled=False):
    assert username is not None
    assert password is not None
    assert verify is not None
    assert debug is not None
    assert request_caching_enabled is not None
    self._username = username
    self._password = password
    self._verify = verify
    self._debug = debug
    self._request_caching_enabled=request_caching_enabled
    if self._request_caching_enabled:
        self.request_count = self.get_cached_request_count()
    if not self._verify:
        requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
    if self._debug:
        self.enable_request_debugging()
    else:
        self.enable_error_logging()
    # cache requests for 30*24 hours = 1 month!
    if self._request_caching_enabled:
        requests_cache.install_cache('capiq_cache', backend='sqlite', expire_after=30*86400, allowable_methods=('POST',), filter_fn=self.filter_by_error)

但是

    response = requests.post(self._endpoint, headers=self._headers, data=json.dumps(req),
                             auth=HTTPBasicAuth(self._username, self._password), verify=self._verify)

永远不会调用 filter_by_error()...

【讨论】:

这并不能真正回答问题。如果您有其他问题,可以点击 提问。要在此问题有新答案时收到通知,您可以follow this question。一旦你有足够的reputation,你也可以add a bounty 来引起对这个问题的更多关注。 - From Review

以上是关于如何使用缓存过滤器?的主要内容,如果未能解决你的问题,请参考以下文章

什么是布隆过滤器?如何解决高并发缓存穿透问题?

面试官:什么是布隆过滤器?如何解决高并发缓存穿透问题?

面试官:什么是布隆过滤器?如何解决高并发缓存穿透问题?

如何控制 JSP 页面中的缓存?

Spark Streaming:如何定期刷新缓存的 RDD?

高可用服务设计|如何应对缓存穿透|玩转布隆过滤器