Beautiful Soup 解析 url 以获取另一个 urls 数据

Posted 2023-02-23

技术标签:

【中文标题】Beautiful Soup 解析 url 以获取另一个 urls 数据【英文标题】：Beautiful Soup to parse url to get another urls data 【发布时间】：2011-05-26 14:41:59 【问题描述】：

我需要解析一个 url 以获取链接到详细信息页面的 url 列表。然后从该页面我需要从该页面获取所有详细信息。我需要这样做，因为详细页面 url 不会定期递增和更改，但事件列表页面保持不变。

基本上：

example.com/events/
    <a href="http://example.com/events/1">Event 1</a>
    <a href="http://example.com/events/2">Event 2</a>

example.com/events/1
    ...some detail stuff I need

example.com/events/2
    ...some detail stuff I need

【问题讨论】：

您尝试了什么，为什么没有奏效？还没试过。我知道如何解析详细页面，但不知道如何通过列表页面获取数据。是的，好吧，如果您阅读文档并至少先努力，这是一个很好的接触，IMO。 【参考方案1】：

使用urllib2获取页面，然后使用美汤获取链接列表，也可以试试scraperwiki.com

编辑：

近期发现：通过lxml搭配BeautifulSoup使用

from lxml.html.soupparser import fromstring

比 BeautifulSoup 好几英里。它可以让你做 dom.cssselect('your selector') 这是一个救生员。只要确保你安装了一个好的版本的 BeautifulSoup。 3.2.1 是一种享受。

dom = fromstring('<html... ...')
navigation_links = [a.get('href') for a in htm.cssselect('#navigation a')]

【讨论】：

【参考方案2】：

import urllib2
from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen('http://yahoo.com').read()
soup = BeautifulSoup(page)
soup.prettify()
for anchor in soup.findAll('a', href=True):
    print anchor['href']

它将为您提供网址列表。现在您可以遍历这些 url 并解析数据。

inner_div = soup.findAll("div", "id": "y-shade") 这是一个例子。您可以阅读 BeautifulSoup 教程。

【讨论】：

这看起来可行。你能把结果缩小到特定的 div 或 ul 吗？是的..您可以指定 div。为此，您可以使用该类。我会在一段时间内更新我的答案。你能详细说明 soup.findAll('a', href=True): 中的锚点是什么吗？查找所有带有 href 的标签。实际上 soup.findAll('a', href=True) 与上述情况下的 soup.findAll(href=True) 相同。在最新版本中 findAll 已被 find_all 取代。对于 Pyhon3，它是：from urllib.request import urlopen 和 from bs4 import BeautifulSoup【参考方案3】：

对于遇到此问题的下一组人，BeautifulSoup 已升级到 v4，因为 v3 不再更新。

$ easy_install beautifulsoup4

$ pip install beautifulsoup4

在 Python 中使用...

import bs4 as BeautifulSoup

【讨论】：

我现在还建议使用 Python 请求而不是 urllib2。是的，它是一个非核心模块，但使用它会让您省去很多麻烦。它被提议成为核心的一部分，但最终决定反对它。简短介绍 - gist.github.com/bradmontgomery/1872970 文档 - docs.python-requests.org/en/master【参考方案4】：

完整的 Python 3 示例

包

# pip3 install urllib
# pip3 install beautifulsoup4

例子：

import urllib.request
from bs4 import BeautifulSoup

with urllib.request.urlopen('https://www.wikipedia.org/') as f:
    data = f.read().decode('utf-8')

d = BeautifulSoup(data)

d.title.string

上面应该打印出'Wikipedia'

【讨论】：

你好 - 非常感谢分享这个 - 这太棒了！我很高兴你在这里提供了python3版本。！！！

以上是关于Beautiful Soup 解析 url 以获取另一个 urls 数据的主要内容，如果未能解决你的问题，请参考以下文章

Python 3.6 Beautiful Soup - 在网页抓取期间无法获取嵌入式视频 URL

2017.08.11 Python网络爬虫实战之Beautiful Soup爬虫

使用 Beautiful Soup 在 python 中解析网页

Python爬虫学习实践基于Beautiful Soup的网站解析及数据可视化

爬虫 requests 和 beautiful soup 提取内容

Beautiful Soup的使用