使用 Selenium 爬行但不爬行的 Scrapy

Posted

技术标签:

【中文标题】使用 Selenium 爬行但不爬行的 Scrapy【英文标题】:Scrapy with Selenium crawling but not scraping 【发布时间】:2015-02-19 12:17:35 【问题描述】:

我已经阅读了关于使用 scrapy 处理 AJAX 页面的所有线程并安装了 selenium webdrive 以简化任务,我的蜘蛛可以部分抓取但无法将任何数据放入我的项目中。

我的目标是:

    从this page 爬到this page

    刮掉每个项目(帖子)的:

    author_name (xpath:/html/body/div[8]/div/div[1]/div[3]/div[3]/ul/li[2]/div[2]/span[2]/ul/li[3]/a/text())
    author_page_url (xpath:/html/body/div[8]/div/div[1]/div[3]/div[3]/ul/li[2]/div[2]/span[2]/ul/li[3]/a/@href)
    post_title (xpath://a[@class="title_txt"])
    post_page_url (xpath://a[@class="title_txt"]/@href)
    post_text (xpath on a separate post page: //div[@id="a_NMContent/text()")
    

这是我的猴子代码(因为作为一名有抱负的自然语言处理学生,我只是在 Python 中迈出第一步,他过去主修语言学):

import scrapy
import time
from selenium import webdriver
from scrapy.contrib.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import XPathSelector

class ItalkiSpider(CrawlSpider):
    name = "italki"
    allowed_domains = ['italki.com']
    start_urls = ['http://www.italki.com/entries/korean']
    # not sure if the rule is set correctly
    rules = (Rule(LxmlLinkExtractor(allow="\entry"), callback = "parse_post", follow = True),)
    def __init__(self):
        self.driver = webdriver.Firefox()

    def parse(self, response):
        # adding necessary search parameters to the URL
        self.driver.get(response.url+"#language=korean&author-language=russian&marks-min=-5&sort=1&page=1")
        # pressing the "Show More" button at the bottom of the search results page to show the next 15 posts, when all results are loaded to the page, the button disappears
        more_btn = self.driver.find_element_by_xpath('//a[@id="a_show_more"]')

        while more_btn:
            more_btn.click()
            # sometimes waiting for 5 sec made spider close prematurely so keeping it long in case the server is slow
            time.sleep(10)

        # here is where the problem begins, I am making a list of links to all the posts on the big page, but I am afraid links will contain only the first link, because selenium doesn't do the multiple selection as one would expect from this xpath...how can I grab all the links and put them in the links list (and should I?)
        links=self.driver.find_elements_by_xpath('/html/body/div[8]/div/div[1]/div[3]/div[3]/ul/li/div[2]/a')
        for link in links:
            link.click()
            time.sleep(3)

    # this is the function for parsing individual posts, called back by the *parse* method as specified in the rule of the spider; if it is correct, it should have saved at least one post into an item... I don't really understand how and where this callback function gets the response from the new page (the page of the post in this case)...is it automatically loaded to drive and then passed on to the callback function as soon as selenium has clicked on the link (link.click())? or is it all total nonsense...
    def parse_post(self, response):
        hxs = Selector(response)
        item = ItalkiItem()
        item["post_item"] = hxs.xpath('//div [@id="a_NMContent"]/text()').extract()
        return item

【问题讨论】:

【参考方案1】:

让我们换个角度想一想:

在浏览器中打开该页面并单击“显示更多”,直到您到达所需的页面 使用当前页面源初始化一个scrapy TextResponse(加载所有必要的帖子) 为每个帖子初始化一个Item,生成一个Request 到帖子页面,并将item 实例从请求传递到meta 字典中的响应

我要介绍的注意事项和更改:

使用normal Spider class 使用Selenium Waits 等待“显示更多”按钮可见 关闭spider_closed signal调度程序中的驱动程序实例

代码:

import scrapy
from scrapy import signals
from scrapy.http import TextResponse 
from scrapy.xlib.pydispatch import dispatcher

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


class ItalkiItem(scrapy.Item):
    title = scrapy.Field()
    url = scrapy.Field()
    text = scrapy.Field()


class ItalkiSpider(scrapy.Spider):
    name = "italki"
    allowed_domains = ['italki.com']
    start_urls = ['http://www.italki.com/entries/korean']

    def __init__(self):
        self.driver = webdriver.Firefox()
        dispatcher.connect(self.spider_closed, signals.spider_closed)

    def spider_closed(self, spider):
        self.driver.close()

    def parse(self, response):
        # selenium part of the job
        self.driver.get('http://www.italki.com/entries/korean')
        while True:
            more_btn = WebDriverWait(self.driver, 10).until(
                EC.visibility_of_element_located((By.ID, "a_show_more"))
            )

            more_btn.click()

            # stop when we reach the desired page
            if self.driver.current_url.endswith('page=52'):
                break

        # now scrapy should do the job
        response = TextResponse(url=response.url, body=self.driver.page_source, encoding='utf-8')
        for post in response.xpath('//ul[@id="content"]/li'):
            item = ItalkiItem()
            item['title'] = post.xpath('.//a[@class="title_txt"]/text()').extract()[0]
            item['url'] = post.xpath('.//a[@class="title_txt"]/@href').extract()[0]

            yield scrapy.Request(item['url'], meta='item': item, callback=self.parse_post)

    def parse_post(self, response):
        item = response.meta['item']
        item["text"] = response.xpath('//div[@id="a_NMContent"]/text()').extract()
        return item

这是您应该用作基本代码并改进以填写所有其他字段的内容,例如 authorauthor_url。希望对您有所帮助。

【讨论】:

谢谢,但不幸的是我现在得到(想知道我做错了什么)引发 TypeError('Request url must be str or unicode, got %s:' % type (url).__name__) 异常。 TypeError: Request url must be str or unicode, got list: @YuryKim 确定,已修复。

以上是关于使用 Selenium 爬行但不爬行的 Scrapy的主要内容,如果未能解决你的问题,请参考以下文章

使用 Node.js 爬行

百度蜘蛛爬行CSS和JS很多,这会不会浪费了蜘蛛的资源

使用scrapy爬行时如何跳过某些文件类型?

UITableView 在按钮下方爬行

脚本突然停止爬行,没有错误或异常

wordpress怎么隐藏栏目不让游客看到,只让蜘蛛爬行?