从 csv 文件中读取 start_urls

Posted

技术标签:

【中文标题】从 csv 文件中读取 start_urls【英文标题】:Reading start_urls from csv file 【发布时间】:2015-05-23 23:44:31 【问题描述】:

所以我正在使用scrapy 库开发一个刮刀,作为一种易于使用的东西,我想让它从.csv 文件中抓取它的起始网址。我已经对该主题进行了一些研究,并且我相信它正确地从 .csv 中获取了 url,但是我遇到了一些奇怪的错误。如果有人可以看看并让我知道我做错了什么,那就太好了。我的蜘蛛看起来像这样,而且我的项目非常基本,因为我并没有真正将它用于任何事情。最终,我会将信息存储回项目中,以便将其写回 .csv,但现在我只想让爬网工作。

from scrapy.spider import BaseSpider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.utils.markup import remove_tags
from scrapy.selector import Selector
from scrapy.selector import htmlXPathSelector
from tutorial.items import BSiteItem
import csv
import sys

class BsiteSPider(CrawlSpider):
    name = "Bsite"
    l= []
    my_file = open("aerospace.csv", "rb")
    reader = csv.reader(my_file)
    for row in reader:
        l.append(row)
    print l[0]
    start_urls = l[0]
    download_delay = 1
    rules = [Rule(SgmlLinkExtractor(allow=()), follow=True, callback='parse_item')]

    def parse_item(self, response):
        text = Selector(response).xpath("//body//text()").re('(\w+)')

        for text in text:
            newtext = text.encode('utf8')
            hxs = HtmlXPathSelector(response)
            item = BSiteItem()
            if newtext == 'aerospace' or newtext == 'Aerospace' or newtext == 'AEROSPACE':
                print 'True'
                test = response.url
                print test

csv 是一个测试文件,其中包含一个 url,http://www.ballaerospace.com。

我的输出看起来像:

scrapy crawl Bsite
['http://www.ballaerospace.com']
2015-03-20 10:03:15-0400 [scrapy] INFO: Scrapy 0.24.5 started (bot: tutorial)
2015-03-20 10:03:15-0400 [scrapy] INFO: Optional features available: ssl, http11
2015-03-20 10:03:15-0400 [scrapy] INFO: Overridden settings: 'NEWSPIDER_MODULE': 'tutorial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'BOT_NAME': 'tutorial'
2015-03-20 10:03:15-0400 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-03-20 10:03:15-0400 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-03-20 10:03:15-0400 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-03-20 10:03:15-0400 [scrapy] INFO: Enabled item pipelines: 
2015-03-20 10:03:15-0400 [Bsite] INFO: Spider opened
2015-03-20 10:03:15-0400 [Bsite] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-03-20 10:03:15-0400 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-03-20 10:03:15-0400 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2015-03-20 10:03:15-0400 [Bsite] DEBUG: Crawled (200) <GET http://www.ballaerospace.com> (referer: None)
2015-03-20 10:03:15-0400 [Bsite] ERROR: Spider error processing <GET http://www.ballaerospace.com>
    Traceback (most recent call last):
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/base.py", line 800, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/task.py", line 602, in _tick
        taskObj._oneWorkUnit()
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/task.py", line 479, in _oneWorkUnit
        result = self._iterator.next()
      File "/Library/Python/2.7/site-packages/Scrapy-0.24.5-py2.7.egg/scrapy/utils/defer.py", line 57, in <genexpr>
        work = (callable(elem, *args, **named) for elem in iterable)
    --- <exception caught here> ---
      File "/Library/Python/2.7/site-packages/Scrapy-0.24.5-py2.7.egg/scrapy/utils/defer.py", line 96, in iter_errback
        yield next(it)
      File "/Library/Python/2.7/site-packages/Scrapy-0.24.5-py2.7.egg/scrapy/contrib/spidermiddleware/offsite.py", line 26, in process_spider_output
        for x in result:
      File "/Library/Python/2.7/site-packages/Scrapy-0.24.5-py2.7.egg/scrapy/contrib/spidermiddleware/referer.py", line 22, in <genexpr>
        return (_set_referer(r) for r in result or ())
      File "/Library/Python/2.7/site-packages/Scrapy-0.24.5-py2.7.egg/scrapy/contrib/spidermiddleware/urllength.py", line 33, in <genexpr>
        return (r for r in result or () if _filter(r))
      File "/Library/Python/2.7/site-packages/Scrapy-0.24.5-py2.7.egg/scrapy/contrib/spidermiddleware/depth.py", line 50, in <genexpr>
        return (r for r in result or () if _filter(r))
      File "/Library/Python/2.7/site-packages/Scrapy-0.24.5-py2.7.egg/scrapy/contrib/spiders/crawl.py", line 73, in _parse_response
        for request_or_item in self._requests_to_follow(response):
      File "/Library/Python/2.7/site-packages/Scrapy-0.24.5-py2.7.egg/scrapy/contrib/spiders/crawl.py", line 52, in _requests_to_follow
        links = [l for l in rule.link_extractor.extract_links(response) if l not in seen]
      File "/Library/Python/2.7/site-packages/Scrapy-0.24.5-py2.7.egg/scrapy/contrib/linkextractors/sgml.py", line 119, in extract_links
        links = self._extract_links(body, response.url, response.encoding, base_url)
      File "/Library/Python/2.7/site-packages/Scrapy-0.24.5-py2.7.egg/scrapy/linkextractor.py", line 94, in _extract_links
        return self.link_extractor._extract_links(*args, **kwargs)
      File "/Library/Python/2.7/site-packages/Scrapy-0.24.5-py2.7.egg/scrapy/contrib/linkextractors/sgml.py", line 28, in _extract_links
        self.feed(response_text)
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/sgmllib.py", line 104, in feed
        self.goahead(0)
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/sgmllib.py", line 174, in goahead
        k = self.parse_declaration(i)
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/markupbase.py", line 96, in parse_declaration
        return self.parse_marked_section(i)
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/markupbase.py", line 160, in parse_marked_section
        self.error('unknown status keyword %r in marked section' % rawdata[i+3:j])
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/sgmllib.py", line 111, in error
        raise SGMLParseError(message)
    sgmllib.SGMLParseError: unknown status keyword 'Flash ' in marked section

2015-03-20 10:03:15-0400 [Bsite] INFO: Closing spider (finished)
2015-03-20 10:03:15-0400 [Bsite] INFO: Dumping Scrapy stats:
    'downloader/request_bytes': 220,
     'downloader/request_count': 1,
     'downloader/request_method_count/GET': 1,
     'downloader/response_bytes': 13700,
     'downloader/response_count': 1,
     'downloader/response_status_count/200': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2015, 3, 20, 14, 3, 15, 791776),
     'log_count/DEBUG': 3,
     'log_count/ERROR': 1,
     'log_count/INFO': 7,
     'response_received_count': 1,
     'scheduler/dequeued': 1,
     'scheduler/dequeued/memory': 1,
     'scheduler/enqueued': 1,
     'scheduler/enqueued/memory': 1,
     'spider_exceptions/SGMLParseError': 1,
     'start_time': datetime.datetime(2015, 3, 20, 14, 3, 15, 691908)
2015-03-20 10:03:15-0400 [Bsite] INFO: Spider closed (finished)

对可能出现的问题有任何想法吗?

【问题讨论】:

【参考方案1】:

问题发生在“提取链接”步骤。

替换:

rules = [Rule(SgmlLinkExtractor(allow=()), follow=True, callback='parse_item')]

与:

rules = [Rule(LinkExtractor(), follow=True, callback='parse_item')]

还有,别忘了导入LinkExtractor

from scrapy.contrib.linkextractors import LinkExtractor

【讨论】:

这似乎很有魅力,如果找到了航天关键字,您是否有机会为我指出正确的方向,以便转到下一个 start_url? @sylverfyst 您能否将问题详细说明为一个单独的问题并在此处给我一个链接?谢谢!

以上是关于从 csv 文件中读取 start_urls的主要内容,如果未能解决你的问题,请参考以下文章

从 csv 文件中逐块读取和反转数据并复制到新的 csv 文件

批处理文件:从 .csv 文件中读取浮点值

从巨大的 CSV 文件中读取随机行

从 S3 存储桶中读取大量 CSV 文件

从 csv 文件读取时,熊猫会添加列

使用 pig 从 csv 文件中读取数据