gxmatmars / 利用wget爬取网站内容并用ruby进行数据处理

Posted phpnetc

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了gxmatmars / 利用wget爬取网站内容并用ruby进行数据处理相关的知识,希望对你有一定的参考价值。

pachong.rb
 
URL = ‘bangumi.tv/character/‘
 
READY = []
Dir.glob(‘download/*‘).each do |f|
if f =~ /download\/(\d+)/
READY << $1.to_i
end
end
 
Dir.glob(‘fail/*‘).each do |f|
if f =~ /fail\/(\d+)/
READY << $1.to_i
end
end
 
Dir.glob(‘error/*‘).each do |f|
if f =~ /error\/(\d+)/
READY << $1.to_i
end
end
 
READY.uniq!
 
def download(i)
log = ‘‘
fn = i.to_s
system "wget #{URL}#{fn}"
 
lines = []
 
if !FileTest.exist?(fn)
return ‘‘
end
 
File.open(fn, ‘r‘) do |f|
lines = f.readlines
end
 
find = false
lines.each do |l|
if l =~ /<title>(.+)<\/title>/
name, description = $1.split(‘|‘).collect { |e| e.strip }
log << "#{i}: #{name}, #{description}\n"
end
if l =~ /href="(.+)" class="cover thickbox"/
url = ‘http:‘ + $1
url.slice!(/\?.+$/)
log << url + "\n"
system "wget #{url}"
system "rm #{fn}"
find = true
break
end
end
 
if !find
system "mv #{fn} fail\\"
log << "\n"
end
 
return log
end
 
i = ARGV[0].to_i
n = ARGV[1].to_i
 
log = ‘‘
 
n.times do
log << download(i) if !READY.include?(i)
i += 1
end
 
system "mv *.jpg download\\"
 
File.open(‘pachong.txt‘, ‘a‘) do |f|
f << log
end
readme.md
 

before running

  1. install wget and ruby.
  2. create folder download and fail
  3. modified forloop.bat,
    • line5, (start, step = 50, end = start + 1000). (20 threads).
    • line7, second parameter for pachong.rb should >= step
  4. run forloop.bat
  5. When mostly all pictures are downloaded, run ruby run.rb 50

tips

  1. This script may lose some picture. Just try more times, pictrue in folder would be ignored.
  2. If any cmd window get stuck, press enter to skip current wget command.
forloop.bat
 
@echo off
mkdir download
mkdir fail
mkdir error
for /l %%i in (30001,500,40000) do (
@ping 127.0.0.1 -n 1 >nul
start /min cmd /c ruby pachong.rb %%i 500
)
run.rb
 
Dir.glob(‘*‘).each do |f|
if f =~ /^\d+/
system "mv #{f} error\\"
end
end
system "mv *.jpg download\\"
 
Limit = ARGV[0]? ARGV[0].to_i : 50
 
READY = []
Dir.glob(‘download/*‘).each do |f|
if f =~ /download\/(\d+)/
READY << $1.to_i
end
end
 
Dir.glob(‘fail/*‘).each do |f|
if f =~ /fail\/(\d+)/
READY << $1.to_i
end
end
 
Dir.glob(‘error/*‘).each do |f|
if f =~ /error\/(\d+)/
READY << $1.to_i
end
end
 
r = READY.sort
show = true
j = 0
 
start = []
step = []
 
for i in 20001..40000
if show
if !r.include?(i)
start << i
show = !show
j = i
end
else
if r.include?(i)
step << i - j
print "#{j} -> #{i} : #{i-j}\n"
show = !show
end
end
end
 
print "total: #{step.sum}\n"
 
n = 0
i = 0
while start[i]
if step[i] > Limit
if step[i] > 2 * Limit
start << start[i] + 2 * Limit
step << step[i] - 2 * Limit
step[i] = 2 * Limit
end
start[i] += 1
printf "#{start[i]} + #{step[i]}\n"
system "start /min cmd /c ruby pachong.rb #{start[i]} #{step[i]}"
sleep(1)
n += 1
break if n > 20
end
i += 1
end

以上是关于gxmatmars / 利用wget爬取网站内容并用ruby进行数据处理的主要内容,如果未能解决你的问题,请参考以下文章

wget 爬取网站网页

urllib基础-利用网站结构爬取网页-百度搜索

python爬取百度搜索页面,得到内容不全,求教,why

如何利用python爬取网易云音乐

如何利用python爬取某个地方1年的天气

Request爬取网站(seo.chinaz.com)百度权重的查询结果