ruby 基于Poltergeist(PhantomJS)的Web Crawler Helper类。使用Capybara作为构建webcrawler的框架非常方便

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了ruby 基于Poltergeist(PhantomJS)的Web Crawler Helper类。使用Capybara作为构建webcrawler的框架非常方便相关的知识,希望对你有一定的参考价值。

require 'capybara/poltergeist'
require 'capybara/dsl'       

class PoltergeistCrawler
  include Capybara::DSL 

  def initialize
    Capybara.register_driver :poltergeist_crawler do |app| 
      Capybara::Poltergeist::Driver.new(app, {
        :js_errors => false,                  
        :inspector => false,                  
        phantomjs_logger: open('/dev/null') # if you don't care about JS errors/console.logs
      })
    end 
    Capybara.default_wait_time = 3    
    Capybara.run_server = false       
    Capybara.default_driver = :poltergeist_crawler
    page.driver.headers = {                       
      "DNT" => 1,                                 
      "User-Agent" => "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:22.0) Gecko/20100101 Firefox/22.0"
    }
  end

  # handy to peek into what the browser is doing right now
  def screenshot(name="screenshot")
    page.driver.render("public/#{name}.jpg",full: true)
  end
  
  # find("path") and all("path") work ok for most cases. Sometimes I need more control, like finding hidden fields
  def doc                    
    Nokogiri.parse(page.body)
  end 
end 
class ExampleCrawler < PoltergeistCrawler
  def crawl
    visit "https://news.ycombinator.com/"
    click_on "More"
    page.evaluate_script("window.location = '/'")
  end
end
ExampleCrawler.new.crawl

以上是关于ruby 基于Poltergeist(PhantomJS)的Web Crawler Helper类。使用Capybara作为构建webcrawler的框架非常方便的主要内容,如果未能解决你的问题,请参考以下文章

ruby 当Poltergeist随机失败时重新启动并重试。

ruby 使用Capybara w / Poltergeist(PhantomJS)从位于给定URL的HTML页面的主体中抓取文本内容。

如何使用 poltergeist webdriver 让 JS 点击?

Capybara / poltergeist / phantomjs double_click失踪

ruby 基于EM的爬虫

Xcode无法安装基于ruby的插件问题的解决