用Scrapy和Selenium RC呈现Javascript爬虫

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了用Scrapy和Selenium RC呈现Javascript爬虫相关的知识,希望对你有一定的参考价值。

  1. # Many times when crawling we run into problems where content that is rendered on the page is generated with javascript and therefore scrapy is unable to crawl for it (eg. ajax requests, jQuery craziness). However, if you use Scrapy along with the web testing framework Selenium then we are able to crawl anything displayed in a normal web browser.
  2. #
  3. # Some things to note:
  4. # You must have the Python version of Selenium RC installed for this to work, and you must have set up Selenium properly. Also this is just a template crawler. You could get much crazier and more advanced with things but I just wanted to show the basic idea. As the code stands now you will be doing two requests for any given url. One request is made by Scrapy and the other is made by Selenium. I am sure there are ways around this so that you could possibly just make Selenium do the one and only request but I did not bother to implement that and by doing two requests you get to crawl the page with Scrapy too.
  5. #
  6. # This is quite powerful because now you have the entire rendered DOM available for you to crawl and you can still use all the nice crawling features in Scrapy. This will make for slower crawling of course but depending on how much you need the rendered DOM it might be worth the wait.
  7.  
  8. from scrapy.contrib.spiders import CrawlSpider, Rule
  9. from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
  10. from scrapy.selector import htmlXPathSelector
  11. from scrapy.http import Request
  12.  
  13. from selenium import selenium
  14.  
  15. class SeleniumSpider(CrawlSpider):
  16. name = "SeleniumSpider"
  17. start_urls = ["http://www.domain.com"]
  18.  
  19. rules = (
  20. Rule(SgmlLinkExtractor(allow=('.html', )), callback='parse_page',follow=True),
  21. )
  22.  
  23. def __init__(self):
  24. CrawlSpider.__init__(self)
  25. self.verificationErrors = []
  26. self.selenium = selenium("localhost", 4444, "*chrome", "http://www.domain.com")
  27. self.selenium.start()
  28.  
  29. def __del__(self):
  30. self.selenium.stop()
  31. print self.verificationErrors
  32. CrawlSpider.__del__(self)
  33.  
  34. def parse_page(self, response):
  35. item = Item()
  36.  
  37. hxs = HtmlXPathSelector(response)
  38. #Do some XPath selection with Scrapy
  39. hxs.select('//div').extract()
  40.  
  41. sel = self.selenium
  42. sel.open(response.url)
  43.  
  44. #Wait for javscript to load in Selenium
  45. time.sleep(2.5)
  46.  
  47. #Do some crawling of javascript created content with Selenium
  48. sel.get_text("//div")
  49. yield item
  50.  
  51. # Snippet imported from snippets.scrapy.org (which no longer works)
  52. # author: wynbennett
  53. # date : Jun 21, 2011
  54.  

以上是关于用Scrapy和Selenium RC呈现Javascript爬虫的主要内容,如果未能解决你的问题,请参考以下文章

用 Scrapy 和 Selenium 进行刮擦

爬虫(十七):Scrapy框架 对接selenium爬取京东商品数据

Selenium RC for Java 环境配置

selenium RC+JAVA 运行所遇到的问题

30.Scrapy 对接 Selenium

直接使用 Selenium RC 或 Selenium 与 Robot 框架