Scrapy 是不是可以从原始 HTML 数据中获取纯文本?

Posted

技术标签:

【中文标题】Scrapy 是不是可以从原始 HTML 数据中获取纯文本?【英文标题】:Is it possible for Scrapy to get plain text from raw HTML data?Scrapy 是否可以从原始 HTML 数据中获取纯文本? 【发布时间】:2013-07-17 07:18:16 【问题描述】:

例如:

scrapy shell http://scrapy.org/
content = hxs.select('//*[@id="content"]').extract()[0]
print content

然后,我得到以下原始 html 代码:

<div id="content">


  <h2>Welcome to Scrapy</h2>

  <h3>What is Scrapy?</h3>

  <p>Scrapy is a fast high-level screen scraping and web crawling
    framework, used to crawl websites and extract structured data from their
    pages. It can be used for a wide range of purposes, from data mining to
    monitoring and automated testing.</p>

  <h3>Features</h3>

  <dl>

    <dt>Simple</dt>
    <dt>
    </dt>
    <dd>Scrapy was designed with simplicity in mind, by providing the features
      you need without getting in your way
    </dd>

    <dt>Productive</dt>
    <dd>Just write the rules to extract the data from web pages and let Scrapy
      crawl the entire web site for you
    </dd>

    <dt>Fast</dt>
    <dd>Scrapy is used in production crawlers to completely scrape more than
      500 retailer sites daily, all in one server
    </dd>

    <dt>Extensible</dt>
    <dd>Scrapy was designed with extensibility in mind and so it provides
      several mechanisms to plug new code without having to touch the framework
      core

    </dd>
    <dt>Portable, open-source, 100% Python</dt>
    <dd>Scrapy is completely written in Python and runs on Linux, Windows, Mac and BSD</dd>

    <dt>Batteries included</dt>
    <dd>Scrapy comes with lots of functionality built in. Check <a
        href="http://doc.scrapy.org/en/latest/intro/overview.html#what-else">this
      section</a> of the documentation for a list of them.
    </dd>

    <dt>Well-documented &amp; well-tested</dt>
    <dd>Scrapy is <a href="/doc/">extensively documented</a> and has an comprehensive test suite
      with <a href="http://static.scrapy.org/coverage-report/">very good code
        coverage</a></dd>

    <dt><a href="/community">Healthy community</a></dt>
    <dd>
      1,500 watchers, 350 forks on Github (<a href="https://github.com/scrapy/scrapy">link</a>)<br>
      700 followers on Twitter (<a href="http://twitter.com/ScrapyProject">link</a>)<br>
      850 questions on *** (<a href="http://***.com/tags/scrapy/info">link</a>)<br>
      200 messages per month on mailing list (<a
        href="https://groups.google.com/forum/?fromgroups#!aboutgroup/scrapy-users">link</a>)<br>
      40-50 users always connected to IRC channel (<a href="http://webchat.freenode.net/?channels=scrapy">link</a>)
    </dd>

    <dt><a href="/support">Commercial support</a></dt>
    <dd>A few companies provide Scrapy consulting and support</dd>

    <p>Still not sure if Scrapy is what you're looking for?. Check out <a
        href="http://doc.scrapy.org/en/latest/intro/overview.html">Scrapy at a
      glance</a>.

    </p>
    <h3>Companies using Scrapy</h3>

    <p>Scrapy is being used in large production environments, to crawl
      thousands of sites daily. Here is a list of <a href="/companies/">Companies
        using Scrapy</a>.</p>

    <h3>Where to start?</h3>

    <p>Start by reading <a href="http://doc.scrapy.org/en/latest/intro/overview.html">Scrapy at a glance</a>,
      then <a href="/download/">download Scrapy</a> and follow the <a
          href="http://doc.scrapy.org/en/latest/intro/tutorial.html">Tutorial</a>.


    </p></dl>
</div>

但我想直接从scrapy获取纯文本

我不想使用任何 xPath 选择器来提取 ph2h3... 标签,因为我正在抓取一个网站,其主要内容嵌入到 table、@987654328 中@;递归地。查找 xPath 可能是一项繁琐的任务。

这可以通过 Scrapy 中的内置函数来实现吗?还是我需要外部工具来转换它?我已经阅读了 Scrapy 的所有文档,但一无所获。

这是一个可以将原始 HTML 转换为纯文本的示例站点:http://beaker.mailchimp.com/html-to-text

【问题讨论】:

使用内置类或函数? 【参考方案1】:

Scrapy 没有内置这样的功能。 html2text 就是你要找的。​​p>

这是一个抓取 wikipedia's python page 的示例蜘蛛,使用 xpath 获取第一段并使用 html2text 将 html 转换为纯文本:

from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
import html2text


class WikiSpider(BaseSpider):
    name = "wiki_spider"
    allowed_domains = ["www.wikipedia.org"]
    start_urls = ["http://en.wikipedia.org/wiki/Python_(programming_language)"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sample = hxs.select("//div[@id='mw-content-text']/p[1]").extract()[0]

        converter = html2text.HTML2Text()
        converter.ignore_links = True
        print(converter.handle(sample)) #Python 3 print syntax

打印:

**Python** 是一种广泛使用的通用高级编程语言。[11][12][13]它的设计理念强调代码 可读性,其语法允许程序员表达概念 代码行数少于语言中的代码行数,例如 C.[14][15]该语言提供了旨在使清晰 小型和大型项目。[16]

【讨论】:

【参考方案2】:

使用lxml.htmltostring() 和参数method="text" 的另一种解决方案。 lxml 在 Scrapy 内部使用。 (参数encoding=unicode 通常是你想要的。)

详情请见http://lxml.de/api/lxml.html-module.html。

from scrapy.spider import BaseSpider
import lxml.etree
import lxml.html

class WikiSpider(BaseSpider):
    name = "wiki_spider"
    allowed_domains = ["www.wikipedia.org"]
    start_urls = ["http://en.wikipedia.org/wiki/Python_(programming_language)"]

    def parse(self, response):
        root = lxml.html.fromstring(response.body)

        # optionally remove tags that are not usually rendered in browsers
        # javascript, HTML/HEAD, comments, add the tag names you dont want at the end
        lxml.etree.strip_elements(root, lxml.etree.Comment, "script", "head")

        # complete text
        print lxml.html.tostring(root, method="text", encoding=unicode)

        # or same as in alecxe's example spider,
        # pinpoint a part of the document using XPath
        #for p in root.xpath("//div[@id='mw-content-text']/p[1]"):
        #   print lxml.html.tostring(p, method="text")

【讨论】:

我已经阅读了 lxml 文档,它确实是一个强大的工具,非常感谢。 不客气。确实,lxml 可以做到这一切,而且学习起来并不难。 谢谢先生,对我帮助很大。【参考方案3】:

目前,我认为您不需要安装任何 3rd 方库。 scrapy provides 这个功能使用选择器: 假设这个复杂的选择器:

sel = Selector(text='<a href="#">Click here to go to the <strong>Next Page</strong></a>')

我们可以使用以下方法获取整个文本:

text_content = sel.xpath("//a[1]//text()").extract()
# which results [u'Click here to go to the ', u'Next Page']

然后您可以轻松地将它们连接在一起:

   ' '.join(text_content)
   # Click here to go to the Next Page

【讨论】:

以上是关于Scrapy 是不是可以从原始 HTML 数据中获取纯文本?的主要内容,如果未能解决你的问题,请参考以下文章

scrapy:将 html 字符串转换为 HtmlResponse 对象

C ++:我在一种方法中获得了一个迭代器,如何在另一种方法中通过迭代器修改原始列表?

从 AntiForgeryToken() 获取原始值(不是 html)

使用 Scrapy 从 HTML 中的 <script> 标签获取数据

在 Scrapy 中嵌套项目数据

AlamofireImage:是不是可以从 Swift 中的 NSimageview 获取原始数据?