python 快速脚本从archive.org获取所有连接的页面,并在uploads文件夹中下载文件

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了python 快速脚本从archive.org获取所有连接的页面,并在uploads文件夹中下载文件相关的知识,希望对你有一定的参考价值。

import re, urllib2
 
site = 'dont ask me ill never tell'
 
start = '/web/20090130021320/' + site + '/'
base = 'http://web.archive.org'
dest_folder = '/Users/jyan/Desktop/OMG'
 
keep_substr = '/' + site + '/uploads/'
 
url_pattern = re.compile(r"(/web/[^\s]+?/" + site + "/[^\s]*?)[\"'#]")
 
def find_urls(markup):
  return set(url_pattern.findall(markup))
 
def fetch_url(url):
  reader = urllib2.urlopen(url)
  markup = reader.read()
  reader.close()
  return markup
  
def save_file(url, data):
  name = url.split(keep_substr)[1]
  f = open(dest_folder + '/' + name, "w")
  f.write(data)
  f.close()
  print "\tSuccess saving %s into %s" % (url, name)
  
def unique_key(url):
  return url.split(site)[1]
  
def fetch_graph(start):
  stack = [start]
  added = set()  
  while len(stack) > 0:
    next = stack.pop()
    added.add(unique_key(next))
    
    print "Processing %s... (%d URLs)" % (next, len(added))
    
    try:
      data = fetch_url(base + next)
      
      print "Fetched %s: %d bytes" % (base + next, len(data))
      
      if next.find(keep_substr) >= 0:
        # download this guy instead
        save_file(next, data)
        
      else:  
        urls = find_urls(data)
 
        print "\tFound %d URLs" % (len(urls))
        
        for url in urls:
          if unique_key(url) not in added:
            stack.append(url)
    
    except Exception:
      print "\tFailed to fetch URL"
 
if __name__ == "__main__":
  print "Starting"
  fetch_graph(start)
  print "Done"

以上是关于python 快速脚本从archive.org获取所有连接的页面,并在uploads文件夹中下载文件的主要内容,如果未能解决你的问题,请参考以下文章

json 从archive.org获取所有快照作为列表

从封面艺术档案 (archive.org) API 中获取专辑封面会由于重定向导致 CORS 错误

sh 一个bash脚本,通过todo.txt中列出的记录ID执行Internet Archive(archive.org)资料的批量下载

获取 archive.org 保存的文件的最新版本

从地图设计网站获取几何信息和名称

sh 从archive.org下载Wayback Machine