使用美丽的汤从标签中提取“href”

Posted 2023-03-05

技术标签:

【中文标题】使用美丽的汤从标签中提取“href”【英文标题】：Extract 'href' from tag using beautiful soup 【发布时间】：2020-03-17 18:37:37 【问题描述】：

我在 html 中有以下标签，我想只提取 href 内容，即 Quatermass_2_Vintage_Movie_Poster-61-10782 和 Hard Day's Night

<span class="small">
                                Ref.No:10782<br/>
<a href="Quatermass_2_Vintage_Movie_Poster-61-10782" title="Click for more details and a larger picture of Quatermass 2">
                                Click for more details and a larger picture of <b>Quatermass 2</b>
</a>
</span>, <span class="small">
                                Ref.No:10781<br/>
<a href="Hard_Day__039_s_Night_Vintage_Movie_Poster-61-10781" title="Click for more details and a larger picture of Hard Day's Night">
                                Click for more details and a larger picture of <b>Hard Day's Night</b>
</a>
</span>

下面的python代码让我只能找到整个标签

html = ['table2.html']

with open("table2.html", "r") as f:
    contents = f.read()


soup = BeautifulSoup(contents, "lxml")

for name in soup.find_all("span", "class": "small"):
    print(name)

但是不能只选择href。我试过了

for name in soup.find_all("span", "class": "small".get(href)):
    print(name)

我也尝试将 href 引用放在打印语句中

for name in soup.find_all("span", "class": "small":
    print(name.get('href'))

有人可以帮忙吗？

【问题讨论】：

【参考方案1】：

获取span标签后，需要找到a标签，然后获取href属性。

这样的事情会起作用：

for name in soup.find_all("span", "class": "small"):
    print(name.find("a").get("href"))

【讨论】：

【参考方案2】：

您可以使用正则表达式来提取值，如下所示：

import re

input = "adde <a href=\"coedd.com\" > algo</a>";

patt= "href=\"[a-zA-Z0-9_\-\.]+\""

search = re.findall(patt, input, re.I)

print search

这将返回一个包含所有巧合的数组。

希望有用。

问候。

【讨论】：

以上是关于使用美丽的汤从标签中提取“href”的主要内容，如果未能解决你的问题，请参考以下文章

如何使用美丽的汤从 kick starter 中获取以下数据？

如何用python和漂亮的汤从html代码中提取一个小时

从美丽的汤标签中提取href [重复]

美丽的汤 KeyError 'href' 但肯定存在

提取两个不同标签之间的文本美丽的汤

美丽的汤找不到标签