从 beautifulsoup4 网络爬取结果中删除特定的 <h2 class>

Posted

技术标签:

【中文标题】从 beautifulsoup4 网络爬取结果中删除特定的 <h2 class>【英文标题】:Removing specific <h2 class> from beautifulsoup4 web crawling results 【发布时间】:2022-01-21 20:44:30 【问题描述】:

我目前正在尝试从https://7news.com.au/news/coronavirus-sa 抓取新闻文章的标题。

发现所有头条都在h2类下后,我写了如下代码:

import requests
from bs4 import BeautifulSoup


url = f'https://7news.com.au/news/coronavirus-sa'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
titles = soup.find('body').find_all('h2')

for i in titles:
    print(i.text.strip())

这段代码的结果是:

News
Discover
Connect
SA COVID cases surge into triple digit figures for first time
Massive headaches at South Australian testing clinics as COVID cases surge
Revellers forced into isolation after SA teen goes clubbing while infectious with COVID
COVID scare hits Ashes Test in Adelaide after two media members test positive
SA to ease restrictions despite record number of COVID cases
‘We’re going to have cases every day’: SA records biggest COVID spike in 18 MONTHS
Fears for Adelaide nursing homes after COVID infections creep detected
Families in pre-Christmas quarantine after COVID alert for Adelaide school
South Australia records a JUMP in new COVID-19 cases - including infections in children
‘LOCK IT IN’: Mark McGowan to reveal date of WA’s long-awaited reopening to Australia
BOOSTER BOOST-UP: Australia makes change to COVID-19 vaccinations amid Omicron concern
Frydenberg calls for Aussies to ‘keep calm and carry on’ in the face of COVID-19 Omicron strain
News Just In
Our Network
Our Partners
Connect with 7NEWS

其中包含不必要的文本,例如“News”、“Discover”和“News Just In”。

发生这种情况是因为这些文本也属于 h2 类。因此,我添加了以下代码以从结果中删除它们:

soup.find('h2', id='css-1oh2gv-StyledHeading.e1fp214b7').decompose()

原来有属性错误。

AttributeError: 'NoneType' object has no attribute 'decompose'

我也尝试了 clear() 方法,但它没有给出我想要的结果。

是否有其他方法可以删除不需要的文本?

【问题讨论】:

【参考方案1】:

会发生什么?

您的选择太笼统了,因为它选择了所有&lt;h2&gt;,并且不需要.decompose() 来解决问题。

如何解决?

选择更具体的标题:

soup.select('h2.Card-Headline')

示例

import requests
from bs4 import BeautifulSoup


url = f'https://7news.com.au/news/coronavirus-sa'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
for h2 in soup.select('h2.Card-Headline'):
    print(h2.text)

输出

SA COVID cases surge into triple digit figures for first time 
Massive headaches at South Australian testing clinics as COVID cases surge
Revellers forced into isolation after SA teen goes clubbing while infectious with COVID
COVID scare hits Ashes Test in Adelaide after two media members test positive
SA to ease restrictions despite record number of COVID cases
‘We’re going to have cases every day’: SA records biggest COVID spike in 18 MONTHS
Fears for Adelaide nursing homes after COVID infections creep detected
Families in pre-Christmas quarantine after COVID alert for Adelaide school
South Australia records a JUMP in new COVID-19 cases - including infections in children
‘LOCK IT IN’: Mark McGowan to reveal date of WA’s long-awaited reopening to Australia
BOOSTER BOOST-UP: Australia makes change to COVID-19 vaccinations amid Omicron concern
Frydenberg calls for Aussies to ‘keep calm and carry on’ in the face of COVID-19 Omicron strain

只是为了回答这个问题

也可以decompose() 选择更具体的选择 - 但如前所述,没有必要这样做:

for i in titles:
    if 'Heading' in ' '.join(i['class']):
        i.decompose()

【讨论】:

以上是关于从 beautifulsoup4 网络爬取结果中删除特定的 <h2 class>的主要内容,如果未能解决你的问题,请参考以下文章

Python3使用BeautifulSoup4爬取《三国演义》

用requests库和BeautifulSoup4库爬取新闻列表

Python爬虫初探 - selenium+beautifulsoup4+chromedriver爬取需要登录的网页信息

用requests库和BeautifulSoup4库爬取新闻列表

用requests库和BeautifulSoup4库爬取新闻列表

用requests库和BeautifulSoup4库爬取新闻列表