如何从 <h2 class=section-heading> 中的 <a> 中提取链接：BeautifulSoup [重复]

Posted 2023-02-23

技术标签:

【中文标题】如何从 <h2 class=section-heading> 中的 <a> 中提取链接：BeautifulSoup [重复]【英文标题】：How to extract link from <a> inside the <h2 class=section-heading>:BeautifulSoup [duplicate] 【发布时间】：2016-05-23 05:05:47 【问题描述】：

我正在尝试提取这样写的链接：

<h2 class="section-heading">
    <a href="http://www.nytimes.com/pages/arts/index.html">Arts »</a>
</h2>

我的代码是：

from bs4 import BeautifulSoup
import requests, re

def get_data():
    url='http://www.nytimes.com/'
    s_code=requests.get(url)
    plain_text = s_code.text
    soup = BeautifulSoup(plain_text)
    head_links=soup.findAll('h2', 'class':'section-heading')

    for n in head_links :
       a = n.find('a')
       print a
       print n.get['href'] 
       #print a['href']
       #print n.get('href')
       #headings=n.text
       #links = n.get('href')
       #print headings, links

get_data()

类似的“print a”只是在<h2 class=section-heading>内打印出整个<a>行，即

<a href="http://www.nytimes.com/pages/world/index.html">World »</a>

但是当我执行“print n.get['href']”时，它会抛出一个错误；

print n.get['href'] 
TypeError: 'instancemethod' object has no attribute '__getitem__'

我在这里做错了吗？请帮忙

我在这里找不到一些类似的案例问题，我的问题在这里有点独特，我正在尝试提取特定类名部分标题内的链接。

【问题讨论】：

另外，我认为你的意思是 a.get('href') 而不是 n.get @cricket_007 重复的问题并没有回答这个确切的错误，尽管它很有用；它也适用于早期版本的库。 @AnttiHaapala - 我正在解决问题的最终目标，而不是错误，但是是的，我明白你在说什么 【参考方案1】：

首先，您要获取a 元素的href，因此您应该在该行访问a 而不是n。其次，应该是

a.get('href')

或

a['href']

如果没有找到这样的属性，后一种形式将抛出，而前者将返回None，就像通常的字典/映射接口一样。由于.get是一个方法，所以应该调用它（.get(...)）；索引/元素访问对它不起作用 (.get[...])，这就是这个问题的意义所在。

注意，find 还不如在那里失败，返回 None，也许你想迭代 n.find_all('a', href=True)：

for n in head_links:
   for a in n.find_all('a', href=True):
       print(a['href'])

比使用find_all 更容易的是使用带有CSS 选择器的select 方法。在这里，通过一次操作，我们只能像使用 JQuery 一样轻松地获取 href 内的 href 属性的那些 <h2 class="section-heading"> 元素。

soup = BeautifulSoup(plain_text)
for a in soup.select('h2.section-heading a[href]'):
    print(a['href'])

（另外，请使用lower-case method names in any new code that you write）。

【讨论】：

是的，我也试过了。如果您查看我的代码“links = n.get('href')”中的那些注释行，通过这种方式，它会在所有循环迭代中返回“None”... 0.o ！和 n['href'] 给我一个错误' return self.attrs[key] KeyError: 'href' ' @AnumSheraz 仔细阅读我的更新答案感谢 Antti，帮助了 (y)

以上是关于如何从 <h2 class=section-heading> 中的 <a> 中提取链接：BeautifulSoup [重复]的主要内容，如果未能解决你的问题，请参考以下文章