python 基于Python的链接提取器

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了python 基于Python的链接提取器相关的知识,希望对你有一定的参考价值。

def extract_links(url):
    # extracts all links from a URL and returns them as a list
    # by: Cody Kochmann
    def curl(link):
        from urllib2 import urlopen
        response = urlopen(link)
        return(response.read())

    def check_if_link(s,req_http=True):
        # Checks at the input is a legitimate link.
        allowed_chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-._~:/?#[]@!$&'()*+,;=%"
        if req_http and "http" not in s:
            return(False)
        if "://" in s:
            for i in s:
                if i not in allowed_chars:
                    return(False)
            return(True)
        return(False)
    
    collected_links = []
    link_being_built = ""
    allowed_chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-._~:/?#[]@!$&()*+,;=%"
    collected_html= curl(url)
    for i in collected_html:
        if i in allowed_chars:
            link_being_built+=i
        else:
            if link_being_built not in collected_links:
                if check_if_link(link_being_built):
                    collected_links.append(link_being_built)
            link_being_built=""
    return(collected_links)

以上是关于python 基于Python的链接提取器的主要内容,如果未能解决你的问题,请参考以下文章

基于 Python 的 Scrapy 爬虫入门:页面提取

Python - Apache Tika 单页解析器

Python开发的Markdown目录提取器,快速将md转思维导图(附gui,可直接下载)

基于出行住宿评论数据的情感分析研究(民宿篇,含python代码)

python网络爬虫——CrawlSpider

基于文本密度的新闻正文抽取方法之Python实现