制作scrapy蜘蛛跟随给定起始URL的链接

Question

我正在尝试使用scrapy构建一个简单的蜘蛛来导航链接从给定的start_urls开始并在页面内部，刮掉两个项目。

目标：这是我的starting page。在这里你看到一个护身符列表，我想进入每个护身符页面和这些页面内部，刮去风味文本和项目名称。

我首先构建了一个工作原型，给出了一个单独的护身符，它刮掉了他的数据，现在我想扩展它，所以它会立刻为所有这些做到这一点，但我在寻找如何这样做时苦苦挣扎。

这是迄今为止的代码：

import scrapy
from PoExtractor.items import PoextractorItem
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class ArakaaliSpider(scrapy.Spider):
    name = "arakaali"
    allowed_domains = ['pathofexile.gamepedia.com']
    start_urls = ['https://pathofexile.gamepedia.com/List_of_unique_accessories']

    rules = (Rule(LinkExtractor(restrict_xpaths=(unique=True), callback='parse', follow=True))


    def parse(self, response):
        for link in LinkExtractor(allow=(), deny=()).extract_links(response):
          item = PoextractorItem()
          item["item_name"] = response.xpath("//*[@id='mw-content-text']/span/span[1]/span[1]/text()[1]").extract()
          item["flavor_text"] = response.xpath("//*[@id='mw-content-text']/span/span[1]/span[2]/span[3]/text()").extract()
          yield item

item_name和flavor_text xpath运行良好，它是使用Chrome“检查元素”功能提取的，但是规则或parse的循环中有些东西不起作用，因为这是首次输出：

2018-08-30 09:23:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://pathofexile.gamepedia.com/List_of_unique_accessories>
{'flavor_text': [], 'item_name': []}
2018-08-30 09:23:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://pathofexile.gamepedia.com/List_of_unique_accessories>
{'flavor_text': [], 'item_name': []}
2018-08-30 09:23:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://pathofexile.gamepedia.com/List_of_unique_accessories>
{'flavor_text': [], 'item_name': []}
2018-08-30 09:23:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://pathofexile.gamepedia.com/List_of_unique_accessories>
{'flavor_text': [], 'item_name': []}
2018-08-30 09:23:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://pathofexile.gamepedia.com/List_of_unique_accessories>
{'flavor_text': [], 'item_name': []}
2018-08-30 09:23:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://pathofexile.gamepedia.com/List_of_unique_accessories>
{'flavor_text': [], 'item_name': []}
2018-08-30 09:23:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://pathofexile.gamepedia.com/List_of_unique_accessories>
{'flavor_text': [], 'item_name': []}
2018-08-30 09:23:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://pathofexile.gamepedia.com/List_of_unique_accessories>
{'flavor_text': [], 'item_name': []}

这会持续一段时间，然后包含名称和风味的文件显示：

flavor_text,item_name

,

,

,

,

,

,

它持续超过300行。

其他有用的信息：并非页面中的所有链接都指向另一个页面，其中存在项目名称和风格，因此可以找到空白点，我的问题是，为什么它们都是白色的？它不遵循游戏项目页面的链接吗？

每个回复都要提前感谢

Answer 1

另一答案