Scrapy Spider没有返回所有元素

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Scrapy Spider没有返回所有元素相关的知识,希望对你有一定的参考价值。

规划/ Python6 Geovisions。 AS:落后​​10

我在Scrapy中写了一个蜘蛛,我的json文件只返回了一些值。它让我觉得xpath语法是正确的,但我似乎无法找到错误。

This question根本不是一个帮助,因为这不是我的代码片段的问题。

代码+输出:

# -*- coding: utf-8 -*-
import scrapy


class OfriSpider(scrapy.Spider):
    name = "ofri"
    allowed_domains = ["www.ofri.ch/firmen"]
    start_urls = ['http://www.ofri.ch/firmen/Abbruchunternehmen/']

    def parse(self, response):
        entries_description = response.xpath('//div[@class="directory-entry"]//div[@class="directory-entry-description"]')
        for entry_description in entries_description:
            company_name = entry_description.xpath('.//h2/a/text()').extract()
            address_street = entry_description.xpath('.//p[@itemprop="address"]/span[@itemprop="streetAddress"]/text()').extract()
            zip_locality = entry_description.xpath('.//p[@itemprop="address"]/span[@itemprop="addressLocality"]/text()').extract()
            contact_data = entry_description.xpath('.//*[@id="business_directory_contact_data"]/div/ul').extract_first()
            tel = entry_description.xpath('.//span[@itemprop="telephone"]//text()').extract()
            company_url = entry_description.xpath('.//a[@itemprop="url"]/@href').extract()
            yield{'name':company_name,
                  'street':address_street,
                  'zip_locality':zip_locality,
                  'tel':tel,
                  'url':company_url}

产量

[
,
{"name": ["Eugen Schnabel Transport & Transfer"], "street": ["Talstrasse 24"], "zip_locality": ["8885 Mols"], "tel": ["0041767198236"], "url": []},
,
,
{"name": ["Hanna-reinigung"], "street": ["Schlierenstrasse 48"], "zip_locality": ["8902 Urdorf"], "tel": ["0799481971"], "url": []},
,
{"name": ["Malergeschu00e4ft Peic"], "street": ["Affolternstrasse 60"], "zip_locality": ["8105 Regensdorf"], "tel": ["0764824149"], "url": []},
,
,
,
,
{"name": ["Einzelfirma, Michael Heidelberger"], "street": ["Dorfstrasse 151"], "zip_locality": ["8424 Embrach"], "tel": ["0787907234"], "url": []},
{"name": ["ASS Immo + Bau GmbH"], "street": ["Tretteliweg 3b"], "zip_locality": ["8305 Dietlikon"], "tel": ["0764029370"], "url": []},
,
,
{"name": ["Baumontagen R. Schneiter"], "street": ["Jonenbachstrasse 7"], "zip_locality": ["8911 Rifferswil"], "tel": ["0796552188"], "url": []},
{"name": ["Plus Bau & GU GmbH"], "street": ["Wiesentrasse 83"], "zip_locality": ["3014 Bern"], "tel": ["0763462153"], "url": []},
,
,

]

有人发现这里有什么问题吗?

答案

我得到你的代码并创建不使用项目中任何其他文件的独立版本 - 并且获取所有数据没有问题。

如果你创建了项目,问题可能在于设置或其他文件。

在某些地方,我将extract()改为extract_first(),因为总有单一元素。

#!/usr/bin/env python3

import scrapy

class MySpider(scrapy.Spider):

    name = 'myspider'

    allowed_domains = ['www.ofri.ch/firmen']
    start_urls = ['http://www.ofri.ch/firmen/Abbruchunternehmen/']

    def parse(self, response):
        print('url:', response.url)

        entries_description = response.xpath('//div[@class="directory-entry"]//div[@class="directory-entry-description"]')

        for entry_description in entries_description:
            company_name = entry_description.xpath('.//h2/a/text()').extract_first()
            address_street = entry_description.xpath('.//p[@itemprop="address"]/span[@itemprop="streetAddress"]/text()').extract_first()
            zip_locality = entry_description.xpath('.//p[@itemprop="address"]/span[@itemprop="addressLocality"]/text()').extract_first()
            contact_data = entry_description.xpath('.//*[@id="business_directory_contact_data"]/div/ul').extract_first()
            tel = entry_description.xpath('.//span[@itemprop="telephone"]//text()').extract_first()
            company_url = entry_description.xpath('.//a[@itemprop="url"]/@href').extract_first()

            item = {
                'name': company_name,
                'street': address_street,
                'zip_locality': zip_locality,
                'tel': tel,
                'url':company_url
            }    

            print(item)

            yield item


# --- it runs without project and saves in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',

    # save in file as CSV, JSON or XML
    'FEED_FORMAT': 'json',
    'FEED_URI': 'output.csv',
})
c.crawl(MySpider)
c.start()

结果:

[
{"name": "SZ.AC Team", "street": "Aadorferstrasse 19", "zip_locality": "8362 Balterswil", "tel": "0719601976", "url": "http://allesuntereinemdach.ch"},
{"name": "Eugen Schnabel Transport & Transfer", "street": "Talstrasse 24", "zip_locality": "8885 Mols", "tel": "0041767198236", "url": null},
{"name": "Glarner Rueckbau", "street": "Reimen 1", "zip_locality": "8775 Hu00e4tzingen", "tel": "0794017852", "url": "http://glarner-rueckbau-klg.ch"},
{"name": "Poldeschtrans", "street": "Talgasse 21", "zip_locality": "5503 Schafisheim", "tel": "0764414732", "url": "http://poldesch.ch"},
{"name": "Hanna-reinigung", "street": "Schlierenstrasse 48", "zip_locality": "8902 Urdorf", "tel": "0799481971", "url": null},
{"name": "FREUDENBERG HANDWERK UND INSTALLATIONSSERVICE", "street": "Bahnhofstrasse 2", "zip_locality": "4543 Deitingen", "tel": "0762872322", "url": "http://freudenberg-handwerk.ch"},
{"name": "Malergeschu00e4ft Peic", "street": "Affolternstrasse 60", "zip_locality": "8105 Regensdorf", "tel": "0764824149", "url": null},
{"name": "Linsin Bau", "street": "Mu00fcllerweg 1", "zip_locality": "4633 Hauenstein", "tel": "0041793390620", "url": "http://Linsin-Bau.ch"},
{"name": "Sanimpex GmbH", "street": "Badenerstrasse 549", "zip_locality": "8048 Zu00fcrich", "tel": "0432109670", "url": "http://sanimpex.ch"},
{"name": "M.Spangenberg", "street": "Kronenweg.3", "zip_locality": "8165 Oberweningen", "tel": "0434228168", "url": "http://spangenberg-kundenmaurer.ch"},
{"name": "M. Johann Tiefbau", "street": "Bahnhofstrasse 2", "zip_locality": "6162 Entlebuch", "tel": "0415303406", "url": "http://johann-tiefbau.ch"},
{"name": "Einzelfirma, Michael Heidelberger", "street": "Dorfstrasse 151", "zip_locality": "8424 Embrach", "tel": "0787907234", "url": null},
{"name": "ASS Immo + Bau GmbH", "street": "Tretteliweg 3b", "zip_locality": "8305 Dietlikon", "tel": "0764029370", "url": null},
{"name": "tm rent Gmbh", "street": "Im Ebnet 66", "zip_locality": "8700 Ku00fcsnacht", "tel": "0764148714", "url": "http://www.tm-rent.ch"},
{"name": "HBS Bau GmbH", "street": "Eisenbahnstrasse 18", "zip_locality": "8730 Uznach", "tel": "0797343583", "url": "http://hbs-bau.ch"},
{"name": "Baumontagen R. Schneiter", "street": "Jonenbachstrasse 7", "zip_locality": "8911 Rifferswil", "tel": "0796552188", "url": null},
{"name": "Plus Bau & GU GmbH", "street": "Wiesentrasse 83", "zip_locality": "3014 Bern", "tel": "0763462153", "url": null},
{"name": "Ettlin Mont-& Demontagen", "street": "Untere Gru00fcndlistrasse 20", "zip_locality": "6055 Alpnach Dorf", "tel": "0794121162", "url": "http://ettlin-montagen.ch"},
{"name": "ISISERVICE", "street": "Tiefenaustrasse 131", "zip_locality": "3004 Bern", "tel": "0313815488", "url": "http://isiservice.ch"},
{"name": "S & G Services GmbH", "street": "Mutschellenstrasse 85", "zip_locality": "8038 Zu00fcrich", "tel": "0762502222", "url": "http://sg-reinigung.ch"}
]
另一答案

根据我的检查,您的Xpath是正确的,您可以使用chrome验证您的主xpath:

$x('//div[@class="directory-entry"]//div[@class="directory-entry-description"]');

所以问题应该在其他地方,可能是不同的div内容。

除了url之外(它获取对象,而不是url值),例如,查看结果:

$x('//div[@class="directory-entry"]//div[@class="directory-entry-description"]//a[@itemprop="url"]/@href'); in chrome。

以上是关于Scrapy Spider没有返回所有元素的主要内容,如果未能解决你的问题,请参考以下文章

scrapy知识补充--scrapy shell 及Spider

Spider使用scrapy运行,但没有数据存储到csv中

Scrapy(爬虫框架)中,Spider类中parse()方法的工作机制

类 InstagramSpider(scrapy.Spider): AttributeError: 'module' 对象没有属性 'Spider'

Scrapy spider不会在start-url列表上进行迭代

爬虫日记(84):Scrapy的Crawler类