如何使用 Python 3.5 和 BeautifulSoup 抓取 href [重复]

Posted 2023-02-23

技术标签:

【中文标题】如何使用 Python 3.5 和 BeautifulSoup 抓取 href [重复]【英文标题】：How to scrape href with Python 3.5 and BeautifulSoup [duplicate] 【发布时间】：2016-11-28 22:51:44 【问题描述】：

我想使用 Python 3.5 和 BeautifulSoup 从网站 https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=magic&seed=2449064&page=1 抓取每个项目的 href。

这是我的代码

#Loading Libraries
import urllib
import urllib.request
from bs4 import BeautifulSoup

#define URL for scraping
theurl = "https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=magic&seed=2449064&page=1"
thepage = urllib.request.urlopen(theurl)

#Cooking the Soup
soup = BeautifulSoup(thepage,"html.parser")


#Scraping "Link" (href)
project_ref = soup.findAll('h6', 'class': 'project-title')
project_href = [project.findChildren('a')[0].href for project in project_ref if project.findChildren('a')]
print(project_href)

我得到 [None, None, .... None, None] 回复。我需要一个包含类中所有 href 的列表。

有什么想法吗？

【问题讨论】：

【参考方案1】：

试试这样的：

import urllib.request
from bs4 import BeautifulSoup

theurl = "https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=magic&seed=2449064&page=1"
thepage = urllib.request.urlopen(theurl)

soup = BeautifulSoup(thepage)

project_href = [i['href'] for i in soup.find_all('a', href=True)]
print(project_href)

这将返回所有href 实例。正如我在您的链接中看到的那样，很多href 标签里面都有#。您可以使用简单的正则表达式来避免这些问题，或者只是忽略 # 符号。

project_href = [i['href'] for i in soup.find_all('a', href=True) if i['href'] != "#"]

这仍然会为您提供一些垃圾链接，例如 /discover?ref=nav，因此如果您想缩小范围，请为您需要的链接使用适当的正则表达式。

编辑：

解决你在cmets中提到的问题：

soup = BeautifulSoup(thepage)
for i in soup.find_all('div', attrs='class' : 'project-card-content'):
    print(i.a['href'])

【讨论】：

哦，是的。这样可行。 Thx...是否可以仅从类中获取hrefs？好的，我一上班就会编辑我的帖子请更新代码。谢谢你... 谢谢。现在我得到一个包含正确hrefs的列表。那很好。你知道我必须编写什么代码才能被刺痛吗？我的意思是这样的结果： ['href1', 'href2', 'href3',...., 'href10'] 因为我的其他数据看起来像这样，我想将数据导出到 csv 并将它们拆分为单独的行。非常感谢我提供的代码让您逐行获取链接。您可以使用[i.a['href'] for i in soup.find_all('div', attrs='class' : 'project-card-content')] 将其作为列表返回。

以上是关于如何使用 Python 3.5 和 BeautifulSoup 抓取 href [重复]的主要内容，如果未能解决你的问题，请参考以下文章