使用 BeautifulSoup 从 img 标签中提取 src 属性

Posted 2023-02-23

技术标签:

【中文标题】使用 BeautifulSoup 从 img 标签中提取 src 属性【英文标题】：Extract src attribute from img tag using BeautifulSoup 【发布时间】：2017-10-14 09:16:58 【问题描述】：

<div class="someClass">
    <a href="href">
        <img  src="some"/>
    </a>
</div>

我想使用 BeautifulSoup 从图像（即 img）标签中提取源（即 src）属性。我使用 bs4，我无法使用 a.attrs['src'] 获取 src，但我可以获取 href。我该怎么办？

【问题讨论】：

您好，您的帖子有点难以阅读——添加一些标点符号和换行符。报告您收到的确切错误消息以及您期望/希望发生的情况也会有所帮助。 @patrick 我已经编辑了问题您为什么希望a.attrs['src'] 工作？您显示的 sn-p 中没有带有 src 属性的 <a> 标记。这也是一个与以前完全不同的问题，现在标题毫无意义。 @patrick 我用正则表达式得到src。还有什么其他问题？ 【参考方案1】：

链接没有src 属性，您必须定位实际的img 标记。

import bs4

html = """<div class="someClass">
    <a href="href">
        <img  src="some"/>
    </a>
</div>"""

soup = bs4.BeautifulSoup(html, "html.parser")

# this will return src attrib from img tag that is inside 'a' tag
soup.a.img['src']

>>> 'some'

# if you have more then one 'a' tag
for a in soup.find_all('a'):
    if a.img:
        print(a.img['src'])

>>> 'some'

【讨论】：

【参考方案2】：

您可以使用BeautifulSoup 提取html img 标记的src 属性。在我的示例中，htmlText 本身包含 img 标记，但这也可以与urllib2 一起用于 URL。

对于网址

from BeautifulSoup import BeautifulSoup as BSHTML
import urllib2
page = urllib2.urlopen('http://www.youtube.com/')
soup = BSHTML(page)
images = soup.findAll('img')
for image in images:
    #print image source
    print image['src']
    #print alternate text
    print image['alt']

对于带有 img 标签的文本

from BeautifulSoup import BeautifulSoup as BSHTML
htmlText = """<img src="https://src1.com/" <img src="https://src2.com/" /> """
soup = BSHTML(htmlText)
images = soup.findAll('img')
for image in images:
    print image['src']

【讨论】：

如何从 id="my_img" 的 img 标签中提取图片标题，只有一张特定图片由于ID 不是image 标记的默认属性，因此您无法获得类似image['id'] 的任何内容。但是，如果您打印 image 值，您将看到所有属性和值。也许您可以然后对其进行标记并使用您正在寻找的 id 找到您想要的图像。【参考方案3】：

您可以使用 BeautifulSoup 来提取 html img 标签的 src 属性。在我的示例中，htmlText 包含 img 标签本身，但这也可以与 urllib2 一起用于 URL。

评价最高的答案提供的解决方案不再适用于 python3。这是正确的实现：

对于网址

from bs4 import BeautifulSoup as BSHTML
import urllib3

http = urllib3.PoolManager()
url = 'your_url'

response = http.request('GET', url)
soup = BSHTML(response.data, "html.parser")
images = soup.findAll('img')

for image in images:
    #print image source
    print(image['src'])
    #print alternate text
    print(image['alt'])

对于带有 img 标签的文本

from bs4 import BeautifulSoup as BSHTML
htmlText = """<img src="https://src1.com/" <img src="https://src2.com/" /> """
soup = BSHTML(htmlText)
images = soup.findAll('img')
for image in images:
    print(image['src'])

【讨论】：

【参考方案4】：

这里有一个解决方案，在img标签没有src属性的情况下不会触发KeyError：

from urllib.request import urlopen
from bs4 import BeautifulSoup

site = "[insert name of the site]"
html = urlopen(site)
bs = BeautifulSoup(html, 'html.parser')

images = bs.find_all('img')
for img in images:
    if img.has_attr('src'):
        print(img['src'])

【讨论】：

以上是关于使用 BeautifulSoup 从 img 标签中提取 src 属性的主要内容，如果未能解决你的问题，请参考以下文章