python 快速脚本从archive.org获取所有连接的页面,并在uploads文件夹中下载文件
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了python 快速脚本从archive.org获取所有连接的页面,并在uploads文件夹中下载文件相关的知识,希望对你有一定的参考价值。
import re, urllib2
site = 'dont ask me ill never tell'
start = '/web/20090130021320/' + site + '/'
base = 'http://web.archive.org'
dest_folder = '/Users/jyan/Desktop/OMG'
keep_substr = '/' + site + '/uploads/'
url_pattern = re.compile(r"(/web/[^\s]+?/" + site + "/[^\s]*?)[\"'#]")
def find_urls(markup):
return set(url_pattern.findall(markup))
def fetch_url(url):
reader = urllib2.urlopen(url)
markup = reader.read()
reader.close()
return markup
def save_file(url, data):
name = url.split(keep_substr)[1]
f = open(dest_folder + '/' + name, "w")
f.write(data)
f.close()
print "\tSuccess saving %s into %s" % (url, name)
def unique_key(url):
return url.split(site)[1]
def fetch_graph(start):
stack = [start]
added = set()
while len(stack) > 0:
next = stack.pop()
added.add(unique_key(next))
print "Processing %s... (%d URLs)" % (next, len(added))
try:
data = fetch_url(base + next)
print "Fetched %s: %d bytes" % (base + next, len(data))
if next.find(keep_substr) >= 0:
# download this guy instead
save_file(next, data)
else:
urls = find_urls(data)
print "\tFound %d URLs" % (len(urls))
for url in urls:
if unique_key(url) not in added:
stack.append(url)
except Exception:
print "\tFailed to fetch URL"
if __name__ == "__main__":
print "Starting"
fetch_graph(start)
print "Done"
以上是关于python 快速脚本从archive.org获取所有连接的页面,并在uploads文件夹中下载文件的主要内容,如果未能解决你的问题,请参考以下文章
json 从archive.org获取所有快照作为列表
从封面艺术档案 (archive.org) API 中获取专辑封面会由于重定向导致 CORS 错误
sh 一个bash脚本,通过todo.txt中列出的记录ID执行Internet Archive(archive.org)资料的批量下载
获取 archive.org 保存的文件的最新版本
从地图设计网站获取几何信息和名称
sh 从archive.org下载Wayback Machine