如何使用 BeautifulSoup 提取多个 H2 标签

Posted 2023-03-31

技术标签:

【中文标题】如何使用 BeautifulSoup 提取多个 H2 标签【英文标题】：How to Extract Multiple H2 Tags Using BeautifulSoup 【发布时间】：2021-10-08 13:03:32 【问题描述】：

import requests
from bs4 import BeautifulSoup
import pandas as pd

articlelist = []
url = 'https://www.angrybirds.com/blog/get-ready-angry-birds-movie-2-premiere-new-game-events/'

r = requests.get(url)
#print(r.status_code)

soup = BeautifulSoup(r.content, features='lxml')
articles = soup.find_all('div', class_ = 'post-body__container')
#print(articles)

for item in articles:
  #h2_headings = item.find('h2').text
  h2_headings = item.find_all('h2')

  article = 
    'H2_Heading': h2_headings,
  

  print('Added article:', article)
  articlelist.append(article)

df = pd.DataFrame(articlelist)
#df.to_csv('articlelist.csv', index=False)
#print('Saved to csv')

脚本中使用的网页有多个我要抓取的 H2 标题标签。

我正在寻找一种方法来简单地抓取所有 H2 标题文本，如下所示：

愤怒的小鸟 2、愤怒的小鸟梦想爆炸、愤怒的小鸟朋友、愤怒的小鸟比赛、愤怒的小鸟爆炸、愤怒的小鸟流行

问题

当我使用语法 h2_headings = item.find('h2').text 时，它会按预期要求第一个 h2 标题文本。

但是，我需要捕获 H2 标记的所有实例。当我使用h2_headings = item.find_all('h2') 时，它会返回以下结果：

'H2_Heading': [<h2>Angry Birds 2</h2>, <h2>Angry Birds Dream Blast</h2>, <h2>Angry Birds Friends</h2>, <h2>Angry Birds Match</h2>, <h2>Angry Birds Blast</h2>, <h2>Angry Birds POP</h2>]

将语句修改为h2_headings = item.find_all('h2').text.strip()会返回以下错误：

AttributeError：ResultSet 对象没有属性“文本”。您可能将元素列表视为单个元素。当你调用 find_all() 打算调用 find()？

任何帮助将不胜感激。

【问题讨论】：

【参考方案1】：

你可以这样做：

import requests
from bs4 import BeautifulSoup
import pandas as pd

articlelist = []
url = 'https://www.angrybirds.com/blog/get-ready-angry-birds-movie-2-premiere-new-game-events/'

r = requests.get(url)
#print(r.status_code)

soup = BeautifulSoup(r.content, features='lxml')
articles = soup.find_all('div', class_ = 'post-body__container')


for item in articles:
    h2=', '.join([x.get_text() for x in item.find_all('h2')])
    print(h2)
  

#   print('Added article:', article)
#   articlelist.append(article)

# df = pd.DataFrame(articlelist)
#df.to_csv('articlelist.csv', index=False)
#print('Saved to csv')

输出：

Angry Birds 2, Angry Birds Dream Blast, Angry Birds Friends, Angry Birds Match, Angry Birds Blast, Angry Birds POP

【讨论】：

完美，正是我想要的。谢谢。【参考方案2】：

关注这个答案How to remove h2 tag from html data using beautifulsoup4?

希望对你有帮助。

for item in articles:
  #h2_headings = item.find('h2').text
  h2_headings = item.find_all('h2')

  for h in h2_headings:
    articlelist.append(h.string)

【讨论】：

以上是关于如何使用 BeautifulSoup 提取多个 H2 标签的主要内容，如果未能解决你的问题，请参考以下文章