如何使用 Scrapy 从网站获取所有纯文本？

Posted 2023-02-23

技术标签:

【中文标题】如何使用 Scrapy 从网站获取所有纯文本？【英文标题】：How can I get all the plain text from a website with Scrapy? 【发布时间】：2014-06-03 02:53:06 【问题描述】：

我希望在呈现 html 之后让网站上的所有文本都可见。我正在使用带有 Scrapy 框架的 Python 工作。使用xpath('//body//text()') 我可以得到它，但是使用 HTML 标签，我只想要文本。有什么解决办法吗？

【问题讨论】：

【参考方案1】：

xpath('//body//text()') 并不总是将 dipper 驱动到您上次使用的标签中的节点（在您的案例正文中）。如果您键入 xpath('//body/node()/text()').extract()，您将看到您的 html 正文中的节点。你可以试试xpath('//body/descendant::text()')。

【讨论】：

【参考方案2】：

最简单的选择是 extract //body//text() 和 join 找到所有内容：

''.join(sel.select("//body//text()").extract()).strip()

其中sel 是Selector 实例。

另一种选择是使用nltk的clean_html()：

>>> import nltk
>>> html = """
... <div class="post-text" itemprop="description">
... 
...         <p>I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework.
... With <code>xpath('//body//text()')</code> I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks !</p>
... 
...     </div>"""
>>> nltk.clean_html(html)
"I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework.\nWith xpath('//body//text()') I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks !"

另一种选择是使用BeautifulSoup的get_text()：

get_text()

如果您只想要文档或标签的文本部分，您可以可以使用get_text() 方法。它返回文档中的所有文本或在标签下方，作为单个 Unicode 字符串。

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> print soup.get_text().strip()
I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework.
With xpath('//body//text()') I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks !

另一种选择是使用lxml.html的text_content()：

.text_content()

返回元素的文本内容，包括其子项的文本内容，没有标记。

>>> import lxml.html
>>> tree = lxml.html.fromstring(html)
>>> print tree.text_content().strip()
I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework.
With xpath('//body//text()') I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks !

【讨论】：

我已经删除了我的问题。我使用了下面的代码 html = sel.select("//body//text()") tree = lxml.html.fromstring(html) item[ 'description'] = tree.text_content().strip() 但我得到 is_full_html = _looks_like_full_html_unicode(html) 异常。TypeError：预期的字符串或缓冲区 ..erro。出了什么问题作为更新，nltk 弃用了他们的 clean_html 方法，而是推荐：NotImplementedError: To remove HTML markup, use BeautifulSoup's get_text() function 【参考方案3】：

你试过了吗？

xpath('//body//text()').re('(\w+)')

或

 xpath('//body//text()').extract()

【讨论】：

这实际上工作得很好，但仍然返回一些html标签和其他标签。

以上是关于如何使用 Scrapy 从网站获取所有纯文本？的主要内容，如果未能解决你的问题，请参考以下文章