python 抓取维基百科,从随机文章开始。点击每篇文章中的第一个链接,看看我们结束的地方!扰流警报:可能在t

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了python 抓取维基百科,从随机文章开始。点击每篇文章中的第一个链接,看看我们结束的地方!扰流警报:可能在t相关的知识,希望对你有一定的参考价值。

import time
import urllib

import bs4
import requests


start_url = "https://en.wikipedia.org/wiki/Special:Random"
target_url = "https://en.wikipedia.org/wiki/Philosophy"

def find_first_link(url):
    response = requests.get(url)
    html = response.text
    soup = bs4.BeautifulSoup(html, "html.parser")

    # This div contains the article's body
    content_div = soup.find(id="mw-content-text")

    # stores the first link found in the article, if the article contains no
    # links this value will remain None
    article_link = None

    # Find all the direct children of content_div that are paragraphs
    for element in content_div.find_all("p", recursive=False):
        # Find the first anchor tag that's a direct child of a paragraph.
        # It's important to only look at direct children, because other types
        # of link, e.g. footnotes and pronunciation, could come before the
        # first link to an article. Those other link types aren't direct
        # children though, they're in divs of various classes.
        if element.find("a", recursive=False):
            article_link = element.find("a", recursive=False).get('href')
            break

    if not article_link:
        return

    # Build a full url from the relative article_link url
    first_link = urllib.parse.urljoin('https://en.wikipedia.org/', article_link)

    return first_link

def continue_crawl(search_history, target_url, max_steps=25):
    if search_history[-1] == target_url:
        print("We've found the target article!")
        return False
    elif len(search_history) > max_steps:
        print("The search has gone on suspiciously long, aborting search!")
        return False
    elif search_history[-1] in search_history[:-1]:
        print("We've arrived at an article we've already seen, aborting search!")
        return False
    else:
        return True

article_chain = [start_url]

while continue_crawl(article_chain, target_url):
    print(article_chain[-1])

    first_link = find_first_link(article_chain[-1])
    if not first_link:
        print("We've arrived at an article with no links, aborting search!")
        break

    article_chain.append(first_link)

    time.sleep(2) # Slow things down so as to not hammer Wikipedia's servers

以上是关于python 抓取维基百科,从随机文章开始。点击每篇文章中的第一个链接,看看我们结束的地方!扰流警报:可能在t的主要内容,如果未能解决你的问题,请参考以下文章

Python - BS4 - 仅使用表头+保存为字典从维基百科表中提取子表

python 解析维基百科字符串中的文章链接

python 从维基百科页面中截取所有表格标题

随机维基百科阅读器

python 脚本我曾经重命名所有F.R.I.E.N.D.S.通过从维基百科中获取名称来获取epsiodes

在 Wiki 标记中添加无序列表