Web爬虫递归BeautifulSoup

Posted 2021-04-10

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了Web爬虫递归BeautifulSoup相关的知识，希望对你有一定的参考价值。

我试图递归抓取所有英文文章链接的维基百科网址。我想执行n的深度优先遍历，但由于某种原因，我的代码不会为每次传递重复。知道为什么吗？

def crawler(url, depth):
    if depth == 0:
        return None
    links = bs.find("div",{"id" : "bodyContent"}).findAll("a" , href=re.compile("(/wiki/)+([A-Za-z0-9_:()])+"))

    print ("Level ",depth," ",url)
    for link in links:
        if ':' not in link['href']:
            crawler("https://en.wikipedia.org"+link['href'], depth - 1)

这是对爬虫的调用

url = "https://en.wikipedia.org/wiki/Harry_Potter"
html = urlopen(url)
bs = BeautifulSoup(html, "html.parser")
crawler(url,3)

答案

您需要获取每个不同URL的页面源（向页面发送请求）。您在crawler()函数中缺少该部分。在函数外添加这些行不会递归调用它们。

def crawler(url, depth):
    if depth == 0:
        return None

    html = urlopen(url)                        # You were missing 
    soup = BeautifulSoup(html, 'html.parser')  # these lines.

    links = soup.find("div",{"id" : "bodyContent"}).findAll("a", href=re.compile("(/wiki/)+([A-Za-z0-9_:()])+"))

    print("Level ", depth, url)
    for link in links:
        if ':' not in link['href']:
            crawler("https://en.wikipedia.org"+link['href'], depth - 1)

url = "https://en.wikipedia.org/wiki/Big_data"
crawler(url, 3)

部分输出：

Level  3 https://en.wikipedia.org/wiki/Big_data
Level  2 https://en.wikipedia.org/wiki/Big_Data_(band)
Level  1 https://en.wikipedia.org/wiki/Brooklyn
Level  1 https://en.wikipedia.org/wiki/Electropop
Level  1 https://en.wikipedia.org/wiki/Alternative_dance
Level  1 https://en.wikipedia.org/wiki/Indietronica
Level  1 https://en.wikipedia.org/wiki/Indie_rock
Level  1 https://en.wikipedia.org/wiki/Warner_Bros._Records
Level  1 https://en.wikipedia.org/wiki/Joywave
Level  1 https://en.wikipedia.org/wiki/Electronic_music
Level  1 https://en.wikipedia.org/wiki/Dangerous_(Big_Data_song)
Level  1 https://en.wikipedia.org/wiki/Joywave
Level  1 https://en.wikipedia.org/wiki/Billboard_(magazine)
Level  1 https://en.wikipedia.org/wiki/Alternative_Songs
Level  1 https://en.wikipedia.org/wiki/2.0_(Big_Data_album)

以上是关于Web爬虫递归BeautifulSoup的主要内容，如果未能解决你的问题，请参考以下文章

web爬虫，BeautifulSoup

python 爬虫 requests+BeautifulSoup 爬取简单网页代码示例

爬虫BeautifulSoup库基本使用，案例解析（附源代码）

python 爬虫 requests+BeautifulSoup 爬取巨潮资讯公司概况代码实例

Python 爬虫-BeautifulSoup

python爬虫之beautifulsoup的使用