ruby 使用Mechanize的Hacky爬虫

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了ruby 使用Mechanize的Hacky爬虫相关的知识,希望对你有一定的参考价值。

#!/usr/bin/env ruby

require 'uri'
require 'nokogiri'
require 'mechanize'
require 'logger'

trap('INT') { @crawler.report; exit }

class Crawler
  attr_reader :url, :failures, :pages_crawled

  def initialize(url)
    @url      = url
    @failures = Hash.new([])
  end

  def report
    puts
    puts "Successful hits: #{pages_crawled.length}"
    puts "Errors: #{failures.values.map(&:length).reduce(:+) || 0}"
    puts

    failures.map do |error, urls|
      puts error
      puts urls.map { |url| "  #{url}" }
    end

    puts
  end

  def run
    crawl_page(url)
  end

  def pages_crawled
    @pages_crawled ||= []
  end

  private

  def good(message)
    puts "\e[32m==>\e[0m #{message}"
  end

  def bad(message)
    puts "\e[31m==>\e[0m #{message}"
  end

  def agent
    @agent ||= Mechanize.new do |a|
      a.log = Logger.new('log/crawler.log')
      a.user_agent_alias = 'Mac Safari'
    end
  end

  def crawl_page(url)
    address = simple_address_from_url(url)
    return if pages_crawled.include?(address)
    pages_crawled.push(address)

    begin
      page = agent.get(url)
      good "GET #{url}"

      page.links.map(&:href).each do |href|
        href.sub!(/\/?$/, '')  # Remove trailing slashes
        href.sub!(/#.*?$/, '') # Remove anchors
        crawl_page(href)
      end
    rescue => e
      failures[e.class.to_s] += [url]
      bad "GET #{url}"
    end
  end

  def simple_address_from_url(url)
    uri_parts = URI.split(url)
    "#{uri_parts[0]}://#{uri_parts[1..-4].join}"
  end
end

url = ARGV[0]

@crawler = Crawler.new(url)

@crawler.run
@crawler.report

以上是关于ruby 使用Mechanize的Hacky爬虫的主要内容,如果未能解决你的问题,请参考以下文章

Ruby登录使用mechanize进行spotify

ruby 使用CLI的Magento DB修复工具 - Ruby + Mechanize

无法使用 Ruby Mechanize 登录亚马逊

使用ruby获取商品信息并且做相应的解析处理

ruby 使用docs提供的代码测试Mechanize gem

如何配置 Ruby Mechanize 代理以通过 Charles Web 代理工作?