Beautifulsoup4 没有返回页面上的所有链接

Posted 2023-02-24

技术标签:

【中文标题】Beautifulsoup4 没有返回页面上的所有链接【英文标题】：Beautifulsoup4 not returning all links on the page 【发布时间】：2016-01-27 16:07:13 【问题描述】：

我正在使用 Python 3.5 开发网络爬虫。使用请求和 Beautifulsoup4。我正在尝试获取论坛首页上所有主题的链接。并将它们添加到列表中。

我有 2 个问题：

1) 不确定如何使用 beautifulsoup 获取链接，我无法进入链接本身，只有 div 2) Beautifulsoup 似乎只返回了几个主题，而不是全部。

def getTopics():
topics = []
url = 'http://forum.jogos.uol.com.br/pc_f_40'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')

for link in soup.select('[class="topicos"]'):
    a = link.find_all('a href')
    print (a)

getTopics()

【问题讨论】：

【参考方案1】：

首先，它实际上循环了页面上呈现的所有 38 个主题。

实际问题在于您如何提取每个主题的链接 - link.find_all('a href') 将找不到任何东西，因为页面上没有 a href 元素。将其替换为 link.select('a[href]') - 它会找到所有具有 href 属性的 a 元素。

嗯，你甚至可以用一个列表理解来解决它：

topics = [a["href"] for a in soup.select('.topicos a[href]')]

【讨论】：

以上是关于Beautifulsoup4 没有返回页面上的所有链接的主要内容，如果未能解决你的问题，请参考以下文章