在 Python 中提取 URL

Posted 2023-02-22

技术标签:

【中文标题】在 Python 中提取 URL【英文标题】：Extracting a URL in Python 【发布时间】：2010-10-24 19:26:17 【问题描述】：

关于：Find Hyperlinks in Text using Python (twitter related)

如何仅提取 url 以便将其放入列表/数组中？

编辑

让我澄清一下，我不想将 URL 解析成碎片。我想从字符串文本中提取 URL 以将其放入数组中。谢谢！

【问题讨论】：

其他帖子的答案有什么问题？它使用正则表达式在文本中查找 URL。什么不起作用？什么坏了？为什么要重复这个问题？ ***.com/questions/720113/…的答案有什么问题？ 【参考方案1】：

被误解的问题：

>>> from urllib.parse import urlparse
>>> urlparse('http://www.ggogle.com/test?t')
ParseResult(scheme='http', netloc='www.ggogle.com', path='/test',
        params='', query='t', fragment='')

or py2.* version:

>>> from urlparse import urlparse
>>> urlparse('http://www.cwi.nl:80/%7Eguido/Python.html')
ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
        params='', query='', fragment='')

ETA：正则表达式确实是这里的最佳选择：

>>> s = 'This is my tweet check it out http://tinyurl.com/blah and http://blabla.com'
>>> re.findall(r'(https?://\S+)', s)
['http://tinyurl.com/blah', 'http://blabla.com']

【讨论】：

我最喜欢这个解决方案，因为它允许提取多个 url【参考方案2】：

为了响应 OP 的编辑，我劫持了 Find Hyperlinks in Text using Python (twitter related) 并想出了这个：

import re

myString = "This is my tweet check it out http://example.com/blah"

print(re.search("(?P<url>https?://[^\s]+)", myString).group("url"))

【讨论】：

最后一行出现“无效语法”。好的，由于某种原因，它可以在没有打印语句的情况下工作好点 - 我只是复制/粘贴了原始的正则表达式。我将其修复为更强大，并包含您的建议 - 谢谢！如果您在 print 语句中遇到语法错误，您可能使用的是 Python 3.0，它删除了 print 语句，而是简单地提供了一个 print("Hello, world.") 函数。修改以上内容以考虑大多数 URL 的尾随引号，尤其是在解析 HTML 时：re.search("(?Phttps?://[^\s'\"] +)", myString).group("url")【参考方案3】：

关于这个：

import re
myString = "This is my tweet check it out http:// tinyurl.com/blah"
print re.search("(?P<url>https?://[^\s]+)", myString).group("url")

如果字符串中有多个 url，它将无法正常工作。如果字符串看起来像：

myString = "This is my tweet check it out http:// tinyurl.com/blah and http:// blabla.com"

你可以这样做：

myString_list = [item for item in myString.split(" ")]
for item in myString_list:
    try:
        print re.search("(?P<url>https?://[^\s]+)", item).group("url")
    except:
        pass

【讨论】：

我修复了你的帖子，请不要再搞砸了。或者你可以这样做： print re.findall("(?Phttps?://[^\s]+)", myString)【参考方案4】：

不要忘记检查搜索是否返回值 None——我发现上面的帖子很有帮助，但浪费了时间处理 None 结果。

见Python Regex "object has no attribute"。

即

import re
myString = "This is my tweet check it out http://tinyurl.com/blah"
match = re.search("(?P<url>https?://[^\s]+)", myString)
if match is not None: 
    print match.group("url")

【讨论】：

【参考方案5】：

[注意：假设您在 Twitter 数据上使用它（如问题所示），最简单的方法是使用他们的 API，它将从推文中提取的 url 作为字段返回]

【讨论】：

【参考方案6】：

这是一个包含大量正则表达式的文件：

#!/usr/bin/python
# -*- coding: utf-8 -*-
"""
the web url matching regex used by markdown
http://daringfireball.net/2010/07/improved_regex_for_matching_urls
https://gist.github.com/gruber/8891611
"""
URL_REGEX = r"""(?i)\b((?:https?:(?:/1,3|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\];:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))"""

我将该文件称为urlmarker.py，当我需要它时，我只需导入它，例如。

import urlmarker
import re
re.findall(urlmarker.URL_REGEX,'some text news.yahoo.com more text')

参见。 http://daringfireball.net/2010/07/improved_regex_for_matching_urls 和 What's the cleanest way to extract URLs from a string using Python?

【讨论】：

有用的正则表达式，但相当模糊。例如，假设我想放弃对 TLD .ni 的支持。我在正则表达式中看到了两个 .ni 实例（我只期待一个实例）。为什么要重复？我应该删除两者还是只删除第一次出现？对于我们所有人来说，获得有关根据我们的需要进行编辑的次要说明会很有用。它没有得到带有端口yahoo.com.br:8080/path的url【参考方案7】：

如果你想从任何文本中提取 URL，你可以使用我的 urlextract。它根据在文本中找到的 TLD 查找 URL。它从 TLD 位置扩展到两侧并获取整个 URL。它易于使用。检查它：https://github.com/lipoja/URLExtract

    from urlextract import URLExtract

    extractor = URLExtract()
    urls = extractor.find_urls("Text with URLs: ***.com.")

【讨论】：

【参考方案8】：

您可以使用以下可怕的正则表达式：

\b((?:https?://)?(?:(?:www\.)?(?:[\da-z\.-]+)\.(?:[a-z]2,6)|(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.)3(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)|(?:(?:[0-9a-fA-F]1,4:)7,7[0-9a-fA-F]1,4|(?:[0-9a-fA-F]1,4:)1,7:|(?:[0-9a-fA-F]1,4:)1,6:[0-9a-fA-F]1,4|(?:[0-9a-fA-F]1,4:)1,5(?::[0-9a-fA-F]1,4)1,2|(?:[0-9a-fA-F]1,4:)1,4(?::[0-9a-fA-F]1,4)1,3|(?:[0-9a-fA-F]1,4:)1,3(?::[0-9a-fA-F]1,4)1,4|(?:[0-9a-fA-F]1,4:)1,2(?::[0-9a-fA-F]1,4)1,5|[0-9a-fA-F]1,4:(?:(?::[0-9a-fA-F]1,4)1,6)|:(?:(?::[0-9a-fA-F]1,4)1,7|:)|fe80:(?::[0-9a-fA-F]0,4)0,4%[0-9a-zA-Z]1,|::(?:ffff(?::01,4)0,1:)0,1(?:(?:25[0-5]|(?:2[0-4]|10,1[0-9])0,1[0-9])\.)3,3(?:25[0-5]|(?:2[0-4]|10,1[0-9])0,1[0-9])|(?:[0-9a-fA-F]1,4:)1,4:(?:(?:25[0-5]|(?:2[0-4]|10,1[0-9])0,1[0-9])\.)3,3(?:25[0-5]|(?:2[0-4]|10,1[0-9])0,1[0-9])))(?::[0-9]1,4|[1-5][0-9]4|6[0-4][0-9]3|65[0-4][0-9]2|655[0-2][0-9]|6553[0-5])?(?:/[\w\.-]*)*/?)\b

Demo regex101

此正则表达式将接受以下格式的网址：

输入：

add1 http://mit.edu.com abc
add2 https://facebook.jp.com.2. abc
add3 www.google.be. uvw
add4 https://www.google.be. 123
add5 www.website.gov.us test2
Hey bob on www.test.com. 
another test with ipv4 http://192.168.1.1/test.jpg. toto2
website with different port number www.test.com:8080/test.jpg not port 80
www.website.gov.us/login.html
test with ipv4 192.168.1.1/test.jpg.
search at google.co.jp/maps.
test with ipv6 2001:0db8:0000:85a3:0000:0000:ac1f:8001/test.jpg.

输出：

http://mit.edu.com
https://facebook.jp.com
www.google.be
https://www.google.be
www.website.gov.us
www.test.com
http://192.168.1.1/test.jpg
www.test.com:8080/test.jpg
www.website.gov.us/login.html
192.168.1.1/test.jpg
google.co.jp/maps
2001:0db8:0000:85a3:0000:0000:ac1f:8001/test.jpg

说明：

\b 用于单词边界以分隔 URL 和文本的其余部分 (?:https?://)? 匹配 http:// 或 https:// （如果存在） (?:(?:www\.)?(?:[\da-z\.-]+)\.(?:[a-z]2,6) 匹配标准网址（可能以www. 开头（我们称之为STANDARD_URL） (?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.)3(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?) 匹配标准 Ipv4（我们称之为 IPv4）匹配 IPv6 URL：

(?:(?:[0-9a-fA-F]1,4:)7,7[0-9a-fA-F]1,4|(?:[0-9a-fA-F]1,4:)1,7:|(?:[0-9a-fA-F]1,4:)1,6:[0-9a-fA-F]1,4|(?:[0-9a-fA-F]1,4:)1,5(?::[0-9a-fA-F]1,4)1,2|(?:[0-9a-fA-F]1,4:)1,4(?::[0-9a-fA-F]1,4)1,3|(?:[0-9a-fA-F]1,4:)1,3(?::[0-9a-fA-F]1,4)1,4|(?:[0-9a-fA-F]1,4:)1,2(?::[0-9a-fA-F]1,4)1,5|[0-9a-fA-F]1,4:(?:(?::[0-9a-fA-F]1,4)1,6)|:(?:(?::[0-9a-fA-F]1,4)1,7|:)|fe80:(?::[0-9a-fA-F]0,4)0,4%[0-9a-zA-Z]1,|::(?:ffff(?::01,4)0,1:)0,1(?:(?:25[0-5]|(?:2[0-4]|10,1[0-9])0,1[0-9])\.)3,3(?:25[0-5]|(?:2[0-4]|10,1[0-9])0,1[0-9])|(?:[0-9a-fA-F]1,4:)1,4:(?:(?:25[0-5]|(?:2[0-4]|10,1[0-9])0,1[0-9])\.)3,3(?:25[0-5]|(?:2[0-4]|10,1[0-9])0,1[0-9]))

（我们称之为IPv6）匹配端口部分（我们称之为PORT）如果存在：(?::[0-9]1,4|[1-5][0-9]4|6[0-4][0-9]3|65[0-4][0-9]2|655[0-2][0-9]|6553[0-5]) 匹配 url 的 (?:/[\w\.-]*)*/?) 目标对象部分（html 文件、jpg、...）（我们称之为 RESSOURCE_PATH）

这给出了以下正则表达式：

\b((?:https?://)?(?:STANDARD_URL|IPv4|IPv6)(?:PORT)?(?:RESSOURCE_PATH)\b

来源：

IPv6：Regular expression that matches valid IPv6 addresses

IPv4：https://www.safaribooksonline.com/library/view/regular-expressions-cookbook/9780596802837/ch07s16.html

端口：https://***.com/a/12968117/8794221

其他来源： https://code.tutsplus.com/tutorials/8-regular-expressions-you-should-know--net-6149

$ more url.py

import re

inputString = """add1 http://mit.edu.com abc
add2 https://facebook.jp.com.2. abc
add3 www.google.be. uvw
add4 https://www.google.be. 123
add5 www.website.gov.us test2
Hey bob on www.test.com. 
another test with ipv4 http://192.168.1.1/test.jpg. toto2
website with different port number www.test.com:8080/test.jpg not port 80
www.website.gov.us/login.html
test with ipv4 (192.168.1.1/test.jpg).
search at google.co.jp/maps.
test with ipv6 2001:0db8:0000:85a3:0000:0000:ac1f:8001/test.jpg."""

regex=ur"\b((?:https?://)?(?:(?:www\.)?(?:[\da-z\.-]+)\.(?:[a-z]2,6)|(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.)3(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)|(?:(?:[0-9a-fA-F]1,4:)7,7[0-9a-fA-F]1,4|(?:[0-9a-fA-F]1,4:)1,7:|(?:[0-9a-fA-F]1,4:)1,6:[0-9a-fA-F]1,4|(?:[0-9a-fA-F]1,4:)1,5(?::[0-9a-fA-F]1,4)1,2|(?:[0-9a-fA-F]1,4:)1,4(?::[0-9a-fA-F]1,4)1,3|(?:[0-9a-fA-F]1,4:)1,3(?::[0-9a-fA-F]1,4)1,4|(?:[0-9a-fA-F]1,4:)1,2(?::[0-9a-fA-F]1,4)1,5|[0-9a-fA-F]1,4:(?:(?::[0-9a-fA-F]1,4)1,6)|:(?:(?::[0-9a-fA-F]1,4)1,7|:)|fe80:(?::[0-9a-fA-F]0,4)0,4%[0-9a-zA-Z]1,|::(?:ffff(?::01,4)0,1:)0,1(?:(?:25[0-5]|(?:2[0-4]|10,1[0-9])0,1[0-9])\.)3,3(?:25[0-5]|(?:2[0-4]|10,1[0-9])0,1[0-9])|(?:[0-9a-fA-F]1,4:)1,4:(?:(?:25[0-5]|(?:2[0-4]|10,1[0-9])0,1[0-9])\.)3,3(?:25[0-5]|(?:2[0-4]|10,1[0-9])0,1[0-9])))(?::[0-9]1,4|[1-5][0-9]4|6[0-4][0-9]3|65[0-4][0-9]2|655[0-2][0-9]|6553[0-5])?(?:/[\w\.-]*)*/?)\b"

matches = re.findall(regex, inputString)
print(matches)

输出：

$ python url.py 
['http://mit.edu.com', 'https://facebook.jp.com', 'www.google.be', 'https://www.google.be', 'www.website.gov.us', 'www.test.com', 'http://192.168.1.1/test.jpg', 'www.test.com:8080/test.jpg', 'www.website.gov.us/login.html', '192.168.1.1/test.jpg', 'google.co.jp/maps', '2001:0db8:0000:85a3:0000:0000:ac1f:8001/test.jpg']

【讨论】：

请不要对多个问题发布相同的答案。发布一个好的答案，然后投票/标记以关闭其他问题作为重复问题。如果问题不是重复的，调整您对该问题的回答。 这部分的第二个字符(?:25[0-5]|(?:2[0-4]|10,1[0-9])0,1[0-9])) 出现invalid syntax 错误。 @CarlosOliveira regex=ur"..." 应该是 regex = r"..."，至少在 Python 3 中是这样。【参考方案9】：

如果从 HTML 源中提取：

from urlextract import URLExtract
from requests import get

url = "sample.com/samplepage/"
req = requests.get(url)
text = req.text
# or if you already have the html source:
# text2 = "This is html for ex <a href='http://google.com/'>Google</a> <a href='http://yahoo.com/'>Yahoo</a>"
text = text.replace(' ', '').replace('=','')
extractor = URLExtract()
print(extractor.find_urls(text))

输出（文本2）：

['http://google.com/', 'http://yahoo.com/']

【讨论】：

【参考方案10】：

只需按照下面的代码并享受......！！！！

import requests
from bs4 import BeautifulSoup
url = "your url"//Any url that you want to fetch.
r = requests.get(url)
htmlContent = r.content
soup = BeautifulSoup(htmlContent, 'html.parser')

anchors = soup.find_all('a')
all_links = set()
for link in anchors:
    if(link.get('href') != '#'): 
        linkText = url+str(link.get('href'))
        all_links.add(link)
        print(linkText)

【讨论】：

以上是关于在 Python 中提取 URL的主要内容，如果未能解决你的问题，请参考以下文章