如何从 URL 中提取***域名 (TLD)

Posted 2023-02-22

技术标签:

【中文标题】如何从 URL 中提取***域名 (TLD)【英文标题】：How to extract top-level domain name (TLD) from URL 【发布时间】：2010-11-07 04:56:06 【问题描述】：

如何从 URL 中提取域名，不包括任何子域？

我最初的简单尝试是：

'.'.join(urlparse.urlparse(url).netloc.split('.')[-2:])

这适用于http://www.foo.com，但不适用于http://www.foo.com.au。有没有办法在不使用有关有效 TLD（***域）或国家代码（因为它们会发生变化）的特殊知识的情况下正确执行此操作。

谢谢

【问题讨论】：

以前在 Stack Overflow 上的一个相关问题：***.com/questions/569137/… +1：这个问题中的“简单化尝试”对我来说效果很好，即使具有讽刺意味的是它对作者不起作用。类似问题：***.com/questions/14406300/… 【参考方案1】：

不，没有“内在”方式知道（例如）zap.co.it 是一个子域（因为意大利的注册商确实出售诸如 co.it 之类的域）而 zap.co.uk 不是 > （因为英国的注册商不出售co.uk 之类的域名，而只出售zap.co.uk 之类的域名。

您只需要使用辅助表（或在线资源）来告诉您哪些 TLD 的行为与英国和澳大利亚的行为特别相似 - 如果没有额外的语义知识（或当然它最终会改变，但如果你能找到一个好的在线资源，那么这个来源也会相应地改变，希望！-)。

【讨论】：

【参考方案2】：

有很多很多的***域名。这是列表：

http://data.iana.org/TLD/tlds-alpha-by-domain.txt

这是另一个列表

http://en.wikipedia.org/wiki/List_of_Internet_top-level_domains

这是另一个列表

http://www.iana.org/domains/root/db/

【讨论】：

这无济于事，因为它不会告诉您哪些具有“额外级别”，例如 co.uk。 Lennart：这很有帮助，你可以将它们包装成可选的，在一个正则表达式中。【参考方案3】：

使用在 Mozilla 网站上找到的 this file of effective tlds someone else：

from __future__ import with_statement
from urlparse import urlparse

# load tlds, ignore comments and empty lines:
with open("effective_tld_names.dat.txt") as tld_file:
    tlds = [line.strip() for line in tld_file if line[0] not in "/\n"]

def get_domain(url, tlds):
    url_elements = urlparse(url)[1].split('.')
    # url_elements = ["abcde","co","uk"]

    for i in range(-len(url_elements), 0):
        last_i_elements = url_elements[i:]
        #    i=-3: ["abcde","co","uk"]
        #    i=-2: ["co","uk"]
        #    i=-1: ["uk"] etc

        candidate = ".".join(last_i_elements) # abcde.co.uk, co.uk, uk
        wildcard_candidate = ".".join(["*"] + last_i_elements[1:]) # *.co.uk, *.uk, *
        exception_candidate = "!" + candidate

        # match tlds: 
        if (exception_candidate in tlds):
            return ".".join(url_elements[i:]) 
        if (candidate in tlds or wildcard_candidate in tlds):
            return ".".join(url_elements[i-1:])
            # returns "abcde.co.uk"

    raise ValueError("Domain not in global list of TLDs")

print get_domain("http://abcde.co.uk", tlds)

结果：

abcde.co.uk

如果有人让我知道上面的哪些部分可以用更 Python 的方式重写，我将不胜感激。例如，必须有一种更好的方法来遍历last_i_elements 列表，但我想不出一个。我也不知道ValueError 是否是最好的筹码。评论？

【讨论】：

如果你需要在实践中经常调用 getDomain()，比如从一个大的日志文件中提取域，我建议你将 tlds 设置为一个集合，例如tlds = set([line.strip() for line in tldFile if line[0] not in "/\n"])。这使您可以不断地查找每个项目是否在 tlds 中的检查。我看到查找（集合与列表）和从大约 2000 万行日志文件中提取域的整个操作提高了大约 1500 倍，大约提高了 60 倍（从 6 小时缩短了 6 分钟）。这太棒了！还有一个问题：effective_tld_names.dat 文件是否也针对.amsterdam、.vodka 和.wtf 等新域进行了更新？ Mozilla 公共后缀列表得到定期维护，是的，现在有多个包含它的 Python 库。请参阅publicsuffix.org 和此页面上的其他答案。为了在 2021 年实现这一点的一些更新：该文件现在称为 public_suffix_list.dat，如果您未指定它应该以 UTF8 格式读取文件，Python 会报错。明确指定编码：with open("public_suffix_list.dat", encoding="utf8") as tld_file【参考方案4】：

看到这个问题后，有人写了一个很棒的python模块来解决这个问题： https://github.com/john-kurkowski/tldextract

该模块在 Public Suffix List 中查找 TLD，由 Mozilla 志愿者管理

引用：

另一方面，tldextract 知道所有 gTLD [通用***域] 和 ccTLD [国家代码***域] 看起来像通过根据Public Suffix List 查找当前活着的人。因此，给定一个 URL，它从它的域中知道它的子域，并且它的来自其国家/地区代码的域。

【讨论】：

这对我有用，tld 失败（它将有效的 URL 标记为无效）。浪费了太多时间思考这个问题，应该从一开始就知道并使用它。【参考方案5】：

我是这样处理的：

if not url.startswith('http'):
    url = 'http://'+url
website = urlparse.urlparse(url)[1]
domain = ('.').join(website.split('.')[-2:])
match = re.search(r'((www\.)?([A-Z0-9.-]+\.[A-Z]2,4))', domain, re.I)
if not match:
    sys.exit(2)
elif not match.group(0):
    sys.exit(2)

【讨论】：

有一个域名叫.travel。它不适用于上面的代码。【参考方案6】：

使用pythontld

https://pypi.python.org/pypi/tld

安装

pip install tld

从给定的 URL 中获取 TLD 名称作为字符串

from tld import get_tld
print get_tld("http://www.google.co.uk")

co.uk

或无协议

from tld import get_tld

get_tld("www.google.co.uk", fix_protocol=True)

co.uk

将 TLD 作为对象获取

from tld import get_tld

res = get_tld("http://some.subdomain.google.co.uk", as_object=True)

res
# 'co.uk'

res.subdomain
# 'some.subdomain'

res.domain
# 'google'

res.tld
# 'co.uk'

res.fld
# 'google.co.uk'

res.parsed_url
# SplitResult(
#     scheme='http',
#     netloc='some.subdomain.google.co.uk',
#     path='',
#     query='',
#     fragment=''
# )

从给定的 URL 中获取一级域名作为字符串

from tld import get_fld

get_fld("http://www.google.co.uk")
# 'google.co.uk'

【讨论】：

新 gTLD 将变得更加不可靠。嘿，谢谢你指点这个。我想，当谈到实际使用新 gTLD 时，tld 包中可能会出现适当的修复。谢谢@ArturBarseghyan！它非常易于与 Python 一起使用。但是我现在将它用于企业级产品，即使 gTLD 不受支持，继续使用它是否是个好主意？如果是，您认为何时会支持 gTLD？再次感谢您。 @Akshay Patil：如上所述，当涉及到 gTLD 被大量使用时，适当的修复（如果可能）将包含在包中。同时，如果您非常关心 gTLD，您可以随时捕获 tld.exceptions.TldDomainNotFound 异常并继续执行您正在执行的操作，即使尚未找到域也是如此。只是我，还是tld.get_tld()实际上返回的是完全限定的域名，而不是***域？【参考方案7】：

在所有新的 get_tld 更新之前，我从错误中提取 tld。当然这是糟糕的代码，但它可以工作。

def get_tld():
  try:
    return get_tld(self.content_url)
  except Exception, e:
    re_domain = re.compile("Domain ([^ ]+) didn't match any existing TLD name!");
    matchObj = re_domain.findall(str(e))
    if matchObj:
      for m in matchObj:
        return m
    raise e

【讨论】：

【参考方案8】：

在 Python 中，我曾经使用 tldextract 直到它失败，并使用像 www.mybrand.sa.com 这样的 url 将其解析为 subdomain='order.mybrand', domain='sa', suffix='com'!!

所以最后，我决定写这个方法

重要提示：这仅适用于其中包含子域的 url。这并不是要替换更高级的库，例如 tldextract

def urlextract(url):
  url_split=url.split(".")
  if len(url_split) <= 2:
      raise Exception("Full url required with subdomain:",url)
  return 'subdomain': url_split[0], 'domain': url_split[1], 'suffix': ".".join(url_split[2:])

【讨论】：

以上是关于如何从 URL 中提取***域名 (TLD)的主要内容，如果未能解决你的问题，请参考以下文章