使用 BeautifulSoup 和 Requests 解析 html 页面源时出现内存泄漏

Posted 2023-02-23

技术标签:

【中文标题】使用 BeautifulSoup 和 Requests 解析 html 页面源时出现内存泄漏【英文标题】：Memory Leak while parsing html page source with BeautifulSoup & Requests 【发布时间】：2019-01-24 11:08:58 【问题描述】：

因此，基本思想是通过使用 beautifulsoup 删除 html 标记和脚本，对某些列表 URL 发出 get 请求并从这些页面源解析文本。 python 2.7版

问题是，在每次请求时，解析器函数都会在每次请求时不断增加内存。大小逐渐增加。

def get_text_from_page_source(page_source):
    soup = BeautifulSoup(open(page_source),'html.parser')
#     soup = BeautifulSoup(page_source,"lxml")
    # kill all script and style elements
    for script in soup(["script", "style"]):
        script.decompose()    # rip it out
    # get text
    text = soup.get_text()
    # break into lines and remove leading and trailing space on each
    lines = (line.strip() for line in text.splitlines())
    # break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    # drop blank lines
    text = '\n'.join(chunk for chunk in chunks if chunk)

    # print text
    return text

甚至在本地文本文件中解析内存泄漏。例如：

#request 1
response = requests.get(url,timeout=timeout)
parsed_string_from_html_source = get_text_from_page_source(response.content) #100 MB

#request 2
response = requests.get(url,timeout=timeout)
parsed_string_from_html_source = get_text_from_page_source(response.content) #150 MB
#request 3
response = requests.get(url,timeout=timeout)
parsed_string_from_html_source = get_text_from_page_source(response.content) #300 MB

【问题讨论】：

您可以将响应存储在一个临时文件中，然后逐行读取该文件并进行处理。我很好奇你是如何运行这段代码的？是通过一些IDE吗？如果有，是哪个？ @serbia99 是的，我尝试了两种方式。首先，直接在内存中解析。其次，将页面源代码保存在文本文件中，然后解析该文件。出现同样的问题 @roganjosh 不在 IDE 中。但在终端 【参考方案1】：

您可以尝试在结束 get_text_from_page_source 函数之前运行 soup.decompose 以销毁树。

如果你打开一个文本文件而不是直接提供请求内容，可以在这里看到：

soup = BeautifulSoup(open(page_source),'html.parser')

完成后记得关闭它。为了简短起见，您可以将该行更改为：

with open(page_source, 'r') as html_file:
    soup = BeautifulSoup(html_file.read(),'html.parser')

【讨论】：

试过了，添加了 '.close()' 但仍然没有任何改变。解析完成后有没有尝试使用soup.decompose()？是的，我确实添加了soup.decompose.()。但没有任何改变你能以相反的顺序（3,2,1）执行这 3 个请求，并与我们分享你所做的美好记忆图吗？这没有任何意义巴勃罗【参考方案2】：

你可以尝试调用垃圾收集器：

import gc
response.close()
response = None
gc.collect()

这也可能对您有所帮助：Python high memory usage with BeautifulSoup

【讨论】：

以上是关于使用 BeautifulSoup 和 Requests 解析 html 页面源时出现内存泄漏的主要内容，如果未能解决你的问题，请参考以下文章