BeautifulSoup：获取文字，创建字典

Posted 2021-04-05

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了BeautifulSoup：获取文字，创建字典相关的知识，希望对你有一定的参考价值。

我正在搜集中央银行研究出版物的信息，到目前为止，对于美联储，我有以下Python代码：

START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'
page = requests.get(START_URL)
soup = BeautifulSoup(page.text, 'html.parser')
for paper in soup.findAll("li",class_="list-group-item downfree"):
    print(paper.text)

这为许多出版物中的第一个出版物产生以下内容：

2018-070可靠地计算非线性动态随机模型解决方案：一种带有误差公式的算法，来自Gary S. Anderson

我现在想将其转换为Python字典，最终将包含大量论文：

Papers = {
  'Date': 2018 - 070,
  'Title': 'Reliably Computing Nonlinear Dynamic Stochastic Model Solutions: An Algorithm with Error Formulas',
  'Author/s': 'Gary S. Anderson'
  }

答案

我得到了很好的结果，提取所有的后代，只挑选那些NavigableStrings。确保从bs4导入NavigableString。我也使用numpy列表理解，但你也可以使用for循环。

START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'
page = requests.get(START_URL)
soup = BeautifulSoup(page.text, 'html.parser')

papers = []
for paper in soup.findAll("li",class_="list-group-item downfree"):
    info = [desc.strip() for desc in paper.descendants if type(desc) == NavigableString]
    papers.append({'Date': info[0], 'Title': info[1], 'Author': info[3]})

print(papers[1])

{'Date': '2018-069',
 'Title': 'The Effect of Common Ownership on Profits : Evidence From the U.S. Banking Industry',
 'Author': 'Jacob P. Gramlich & Serafin J. Grundl'}

另一答案

您可以使用正则表达式匹配字符串的每个部分。

[-d]+字符串只有数字和-
(?<=s).*?(?=by)字符串以空白开头并以by结尾（以作者开头）
(?<=bys).*作者，其余整个字符串

完整代码

import requests 
from bs4 import BeautifulSoup
import re

START_URL = 'https://ideas.repec.org/s/fip/fedgfe.html'
page = requests.get(START_URL,verify=False)
soup = BeautifulSoup(page.text, 'html.parser')
datas = []
for paper in soup.findAll("li",class_="list-group-item downfree"):
    data = dict()
    data["date"] = re.findall(r"[-d]+",paper.text)[0]
    data["Title"] = re.findall(r"(?<=s).*?(?=by)",paper.text)[0]
    data["Author(s)"] = re.findall(r"(?<=bys).*",paper.text)[0]
    print(data)
    datas.append(data)

以上是关于BeautifulSoup：获取文字，创建字典的主要内容，如果未能解决你的问题，请参考以下文章

python BeautifulSoup获取网页链接的文字内容

Python 遍历网页代码抓取文字和图片

Python如何简单爬取腾讯新闻网前五页文字内容？

在 python BeautifulSoup 上获取带有特定前缀的超链接

如何使用 BeautifulSoup 从 JSON 数据制作 Python 字典

使用PyQt5和BeautifulSoup将JavaScript变量提取到Python字典中