python 基于Python的链接提取器

Posted 2021-05-10

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了python 基于Python的链接提取器相关的知识，希望对你有一定的参考价值。

def extract_links(url):
    # extracts all links from a URL and returns them as a list
    # by: Cody Kochmann
    def curl(link):
        from urllib2 import urlopen
        response = urlopen(link)
        return(response.read())

    def check_if_link(s,req_http=True):
        # Checks at the input is a legitimate link.
        allowed_chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-._~:/?#[]@!$&'()*+,;=%"
        if req_http and "http" not in s:
            return(False)
        if "://" in s:
            for i in s:
                if i not in allowed_chars:
                    return(False)
            return(True)
        return(False)
    
    collected_links = []
    link_being_built = ""
    allowed_chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-._~:/?#[]@!$&()*+,;=%"
    collected_html= curl(url)
    for i in collected_html:
        if i in allowed_chars:
            link_being_built+=i
        else:
            if link_being_built not in collected_links:
                if check_if_link(link_being_built):
                    collected_links.append(link_being_built)
            link_being_built=""
    return(collected_links)

以上是关于python 基于Python的链接提取器的主要内容，如果未能解决你的问题，请参考以下文章