Python + BeautifulSoup：如何获取“a”元素的“href”属性？

Posted 2023-02-23

技术标签:

【中文标题】Python + BeautifulSoup：如何获取“a”元素的“href”属性？【英文标题】：Python + BeautifulSoup: How to get ‘href’ attribute of ‘a’ element? 【发布时间】：2017-10-04 12:20:53 【问题描述】：

我有以下几点：

  html =
  '''<div class=“file-one”>
    <a href=“/file-one/additional” class=“file-link">
      <h3 class=“file-name”>File One</h3>
    </a>
    <div class=“location”>
      Down
    </div>
  </div>'''

并且只想获取href 的文本，即/file-one/additional。所以我做了：

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

link_text = “”

for a in soup.find_all(‘a’, href=True, text=True):
    link_text = a[‘href’]

print “Link: “ + link_text

但它只是打印一个空白，什么都没有。只需Link:。所以我在另一个网站上测试了它，但使用了不同的 HTML，它工作正常。

我做错了什么？还是该站点有可能故意编程为不返回href？

提前感谢您，一定会支持/接受答案！

【问题讨论】：

你的 HTML 中真的有花引号吗？就此而言，为什么您的代码中有弯引号？你在编码什么？您需要使用文本编辑器。如果您删除参数text=True，您的代码对我有效如果您需要更多关于报价的信息，请查看这篇文章：blogs.msdn.microsoft.com/oldnewthing/20090225-00/?p=19033 @downshift text=True 做什么？以为它以文本形式返回 【参考方案1】：

您的 html 中的“a”标签没有直接包含任何文本，但它包含一个带有文本的“h3”标签。这意味着text 为None，并且.find_all() 无法选择标签。如果标签包含除文本内容之外的任何其他 html 元素，一般不要使用text 参数。

如果您仅使用标签的名称（和 href 关键字参数）来选择元素，则可以解决此问题。然后在循环中添加一个条件来检查它们是否包含文本。

soup = BeautifulSoup(html, 'html.parser')
links_with_text = []
for a in soup.find_all('a', href=True): 
    if a.text: 
        links_with_text.append(a['href'])

如果您更喜欢单行，您也可以使用列表推导式。

links_with_text = [a['href'] for a in soup.find_all('a', href=True) if a.text]

或者您可以将lambda 传递给.find_all()。

tags = soup.find_all(lambda tag: tag.name == 'a' and tag.get('href') and tag.text)

如果您想收集所有链接，无论它们是否有文本，只需选择所有具有“href”属性的“a”标签。锚标签通常有链接，但这不是必需的，所以我认为最好使用href 参数。

使用.find_all()。

links = [a['href'] for a in soup.find_all('a', href=True)]

将 .select() 与 CSS 选择器一起使用。

links = [a['href'] for a in soup.select('a[href]')]

【讨论】：

想告诉你一个我自己搞不清楚的问题。如果您试一试this post，我将非常高兴。谢谢。【参考方案2】：

你也可以使用 attrs 通过正则表达式搜索来获取 href 标签

soup.find('a', href = re.compile(r'[/]([a-z]|[A-Z])\w+')).attrs['href']

【讨论】：

您知道为什么直接调用.href 不起作用，但.attrs['href'] 可以正常工作吗？我只花了 15 分钟来调试这个:(【参考方案3】：

首先，使用不使用大引号的其他文本编辑器。

其次，从soup.find_all中删除text=True标志

【讨论】：

【参考方案4】：

你可以用几行gazpacho来解决这个问题：


from gazpacho import Soup

html = """\
<div class="file-one">
    <a href="/file-one/additional" class="file-link">
      <h3 class="file-name">File One</h3>
    </a>
    <div class="location">
      Down
    </div>
  </div>
"""

soup = Soup(html)
soup.find("a", "class": "file-link").attrs['href']

哪个会输出：

'/file-one/additional'

【讨论】：

以上是关于Python + BeautifulSoup：如何获取“a”元素的“href”属性？的主要内容，如果未能解决你的问题，请参考以下文章