爬虫两种遍历策略的py实现:宽度优先和深度优先
Posted 互联网大数据处理技术与应用
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了爬虫两种遍历策略的py实现:宽度优先和深度优先相关的知识,希望对你有一定的参考价值。
宽度优先遍历BFS,采用队列来实现是自然的做法,即新抓取到的URL放置在队列的末尾,而即将抓取的URL从队列头部获得,即先进先出。
深度优先编列DFS,可以用递归来实现,直到不再有子链接时返回上一层。当然也可以用堆栈来实现非递归的算法,即先进后出。
可以看出,DFS和BFS具有高度的对称性,因此在Python中我们不需要将两种数据结构分开,只需要构建一个数据结构,即本程序中的UrlSequence。
在UrlSequenceUrl中设置的未访问序列self.unvisited可以完成出队列和栈的操作,前者通过 pop(0)来实现,后者自然是pop() 。
UrlSequenceUrl类存放了所有和URL序列有关的函数,维护了两个序列visited和unvisited,并实现了对URL的元素增加、删除、计数等操作。以下是该类的定义,有详细的备注。
class UrlSequence:
#初始化
def __init__(self):
# the set of visited URLs
self.visited = []
# the set of unvisited URLs
self.unvisited = []
# get "visited" list
def getVisitedUrl(self):
return self.visited
# get "unvisited" list
def getUnvisitedUrl(self):
return self.unvisited
# Add to visited URLs,当访问完一个url之后,将它加入到visited序列表示该url已经被访问过了。
def Visited_Add(self, url):
self.visited.append(url)
# Remove from visited URLs
def Visited_Remove(self, url):
self.visited.remove(url)
# Dequeue from unvisited URLs,pop函数在默认的情况下返回队列的最末尾的一个元素,因此要采用pop(0)
def Unvisited_Dequeue(self):
try:
return self.unvisited.pop(0)
except:
return None
# Pop from unvisited URLs, 出栈,适合于深度优先遍历
def Unvisited_Pop(self):
try:
return self.unvisited.pop()
except:
return None
# Add new URLs,解析出来的新的url,需要加入到未访问的URL序列中,但要保证它们不再visited和unvisited序列中,以免重复抓取。
def Unvisited_Add(self, url):
if url != "" and url not in self.visited and url not in self.unvisited:
self.unvisited.append(url)
# The count of visited URLs
def Visited_Count(self):
return len(self.visited)
# The count of unvisited URLs
def Unvisited_Count(self):
return len(self.unvisited)
# Determine whether "unvisited" queue is empty
def UnvisitedIsEmpty(self):
return len(self.unvisited) == 0
以下则是Crawler类,用于实现URL序列的初始化和爬取网页的过程,以及获取网页源码和链接的函数。它包括四个函数 ,分别是初始化函数、爬取函数、获取链接函数和获取网页源代码的函数。以下是该类的定义,有详细的备注。
class Crawler:
def __init__(self, base_url):
# Using base_url to initialize URL queue
self.UrlSequence = UrlSequence()
# Add base_url to the "unvisited" list
self.UrlSequence.Unvisited_Add(base_url)
print "Add the base url \"%s\" to the unvisited url list" % str(self.UrlSequence.unvisited)
def crawling(self, base_url, max_count, flag):
# Loop condition: the "unvisited" list is not empty & the number of visited URLs is not bigger than max_count
while self.UrlSequence.UnvisitedIsEmpty() is False and self.UrlSequence.Visited_Count() <= max_count:
# Dequeue or pop, according to the flag
if flag == 1: # using BFS
visitUrl = self.UrlSequence.Unvisited_Dequeue()
else: # using DFS
visitUrl = self.UrlSequence.Unvisited_Pop()
print "Pop out one url \"%s\" from unvisited url list" % visitUrl
if visitUrl in self.UrlSequence.visited or visitUrl is None or visitUrl == "":
continue
# Get the links
links = self.getLinks(visitUrl)
print "Get %d new links" % len(links)
# Now "visitUrl" has been visited. Add it to the "visited" list
self.UrlSequence.Visited_Add(visitUrl)
print "Visited url count: " + str(self.UrlSequence.Visited_Count())
# Add the unvisited URL to the "unvisited" list
for link in links:
self.UrlSequence.Unvisited_Add(link)
print "%d unvisited links:" % len(self.UrlSequence.getUnvisitedUrl())
# Get links from the source code
def getLinks(self, url):
links = []
data = self.getPageSource(url)
if data[0] == "200":
# Create a BeautifulSoup object
soup = BeautifulSoup(data[1])
# Find all html sentences <a href=".*">
# .* is a regular expression meaning getting the content between quotation marks(""), i.e. the URLs
a = soup.findAll("a", {"href": re.compile(".*")})
for i in a:
if i["href"].find("http://") != -1 or i["href"].find("https://") != -1:
links.append(i["href"])
return links
# Get the page source code
def getPageSource(self, url, timeout=100, coding=None):
try:
socket.setdefaulttimeout(timeout)
req = urllib2.Request(url)
# Add HTTP header
req.add_header('User-agent', 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)')
response = urllib2.urlopen(req)
if coding is None:
coding = response.headers.getparam("charset")
if coding is None:
page = response.read()
else:
page = response.read()
page = page.decode(coding).encode('utf-8')
# HTTP Status Code 200 means requests are accepted by the server
return ["200", page]
except Exception, e:
print str(e)
return [str(e), None]
最后的main函数就很简单了
def main(base_url, max_count, flag):
craw = Crawler(base_url)
craw.crawling(base_url, max_count, flag)
各种爬虫的技术(爬虫内核、普通爬虫、主题爬虫、DeepWeb爬虫、动态网页技术、反爬虫等)原理见《互联网大数据处理技术与应用》的第二章。
以上是关于爬虫两种遍历策略的py实现:宽度优先和深度优先的主要内容,如果未能解决你的问题,请参考以下文章