Beautifulsoup：解析 html – 获取部分 href

Posted 2023-02-23

技术标签:

【中文标题】Beautifulsoup：解析 html – 获取部分 href【英文标题】：Beautifulsoup: parsing html – get part of href 【发布时间】：2017-06-02 22:02:57 【问题描述】：

我正在尝试解析

<td  class="listtable_1"><a href="http://steamcommunity.com/profiles/76561198134729239" target="_blank">76561198134729239</a></td>

对于 76561198134729239。我不知道该怎么做。我尝试了什么：

import requests
from lxml import html
from bs4 import BeautifulSoup
r = requests.get("http://ppm.rep.tf/index.php?p=banlist&page=154")
content = r.content
soup = BeautifulSoup(content, "html.parser")
element = soup.find("td", 

    "class":"listtable_1",
    "target":"_blank"
)
print(element.text)

【问题讨论】：

【参考方案1】：

在那个 HTML 中有很多这样的条目。要获得所有这些，您可以使用以下内容：

import requests
from lxml import html
from bs4 import BeautifulSoup

r = requests.get("http://ppm.rep.tf/index.php?p=banlist&page=154")
soup = BeautifulSoup(r.content, "html.parser")

for td in soup.findAll("td", class_="listtable_1"):
    for a in td.findAll("a", href=True, target="_blank"):
        print(a.text)

这将返回：

76561198143466239
76561198094114508
76561198053422590
76561198066478249
76561198107353289
76561198043513442
76561198128253254
76561198134729239
76561198003749039
76561198091968935
76561198071376804
76561198068375438
76561198039625269
76561198135115106
76561198096243060
76561198067255227
76561198036439360
76561198026089333
76561198126749681
76561198008927797
76561198091421170
76561198122328638
76561198104586244
76561198056032796
76561198059683068
76561197995961306
76561198102013044

【讨论】：

【参考方案2】：

"target":"_blank" 是td 标记内的一类锚标记a。不是td标签的类。

你可以这样得到它：

from bs4 import BeautifulSoup

html="""
<td  class="listtable_1">
    <a href="http://steamcommunity.com/profiles/76561198134729239" target="_blank">
        76561198134729239
    </a>
</td>"""

soup = BeautifulSoup(html, 'html.parser')

print(soup.find('td', 'class': "listtable_1").find('a', "target":"_blank").text)

输出：

76561198134729239

【讨论】：

【参考方案3】：

正如其他人提到的，您正在尝试检查单个 find() 中不同元素的属性。相反，您可以按照 MYGz 的建议链接 find() 调用，或使用单个 CSS selector：

soup.select_one("td.listtable_1 a[target=_blank]").get_text()

如果您需要以这种方式定位多个元素，请使用select()：

for elm in soup.select("td.listtable_1 a[target=_blank]"):
    print(elm.get_text())

【讨论】：

【参考方案4】：

"class":"listtable_1"属于td标签，target="_blank"属于a标签，不能一起使用。

您应该使用Steam Community 作为锚点来查找它后面的数字。

或使用网址，网址包含您需要的信息并且很容易找到，您可以找到网址并将其拆分为/：

for a in soup.find_all('a', href=re.compile(r'steamcommunity')):
    num = a['href'].split('/')[-1]
    print(num)

代码：

import requests
from lxml import html
from bs4 import BeautifulSoup
r = requests.get("http://ppm.rep.tf/index.php?p=banlist&page=154")
content = r.content
soup = BeautifulSoup(content, "html.parser")
for td in soup.find_all('td', string="Steam Community"):
    num = td.find_next_sibling('td').text
    print(num)

出来：

76561198143466239
76561198094114508
76561198053422590
76561198066478249
76561198107353289
76561198043513442
76561198128253254
76561198134729239
76561198003749039
76561198091968935
76561198071376804
76561198068375438
76561198039625269
76561198135115106
76561198096243060
76561198067255227
76561198036439360
76561198026089333
76561198126749681
76561198008927797
76561198091421170
76561198122328638
76561198104586244
76561198056032796
76561198059683068
76561197995961306
76561198102013044

【讨论】：

【参考方案5】：

您可以将gazpacho 中的两个发现链接在一起来解决此问题：

from gazpacho import Soup

html = """<td  class="listtable_1"><a href="http://steamcommunity.com/profiles/76561198134729239" target="_blank">76561198134729239</a></td>"""
soup = Soup(html)
soup.find("td", "class": "listtable_1").find("a", "target": "_blank").text

这个输出：

'76561198134729239'

【讨论】：

以上是关于Beautifulsoup：解析 html – 获取部分 href的主要内容，如果未能解决你的问题，请参考以下文章