python 获取指向网站上所有页面的链接。

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了python 获取指向网站上所有页面的链接。相关的知识,希望对你有一定的参考价值。

import mechanize
import urllib2
import urlparse
import argparse
import json

p = argparse.ArgumentParser()
p.add_argument("-s", "--site", required=True)
p.add_argument("-d", "--domain_limit")

def get_all_links(br, links, visited=set(), recursion=0, domain_limit=None):
	if recursion:
		print "***** RECURSION %d *****" % recursion
	new_links = set()
	for link in links:
		if link not in visited:
			if domain_limit:
				link_parsed = urlparse.urlparse(link)
				dom = ".".join(link_parsed.netloc.split(".")[-2:])
				if dom != domain_limit:
					print "Skipping %s because it's not in the %s domain" % (link, domain_limit)
					continue
			print "Getting page: %s" % link
			visited.add(link)
			try:
				br.open(link)
				if not br.viewing_html():
					continue
			except urllib2.HTTPError, e:
				if e.getcode() == 403:
					print "Skipping %s because it's in robots.txt" % link
					continue
			except urllib2.URLError, e:
				print "URLError: %s" % e
				continue
			for l in br.links():
				if l.absolute_url not in links and l.absolute_url not in new_links:
					new_links.add(l.absolute_url)
	if new_links:
		recursion += 1
		links = links.union(get_all_links(br, new_links, links.union(visited), recursion, domain_limit))
	return links

if __name__ == "__main__":
	args = p.parse_args()
	br = mechanize.Browser()
	links = set()
	try:
		links = get_all_links(br, set([args.site]), domain_limit=args.domain_limit)
	except Exception, e:
		print e
	if links:
		print "Found %d links!" % len(links)
		url = urlparse.urlparse(args.site)
		with open("%s.json" % url.netloc, "w") as f:
			f.write(json.dumps(list(links), indent=2))

以上是关于python 获取指向网站上所有页面的链接。的主要内容,如果未能解决你的问题,请参考以下文章

大佬带你用 python爬虫获取指定网站所有连接下图片单线程

python 爬虫

无法让我的脚本仅从顽固网站的下一页获取链接

带有vuejs的wordpress如何链接到网站页面

Wordpress:如何仅在特定页面上显示链接

如何在没有 id/name 的情况下生成指向页面上特定点的超链接?