使用 BeautifulSoup 抓取包含 JavaScript 的网页

Posted 2023-02-23

技术标签:

【中文标题】使用 BeautifulSoup 抓取包含 JavaScript 的网页【英文标题】：Scraping a webpage that has JavaScript with BeautifulSoup 【发布时间】：2017-08-24 12:44:37 【问题描述】：

伙计们！我再次向你申请。我可以用标签抓取简单的网站，但最近我遇到了一个非常复杂的网站，它有 javascript。因此，我想以表格（csv）的格式获取页面底部的所有估计值。像“用户”、“收入估算”、“每股收益估算”。

我希望自己解决，但有点失败。

这是我的代码：

from urllib import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://www.estimize.com/jpm/fq3-2016?sort=rank&direction=asc&estimates_per_page=142&show_confirm=false")
soup = BeautifulSoup(html.read(), "html.parser")
print(soup.findAll('script')[11].string.encode('utf8'))

输出的格式很奇怪，我不知道如何以适当的形式提取数据。如有任何帮助，我将不胜感激！

【问题讨论】：

使用 selenium 通过 javascript 报废网页 【参考方案1】：

这就是我解决问题的方法，使用了上面的一些提示：

from bs4 import BeautifulSoup
from urllib import urlopen
import json
import csv

f = csv.writer(open("estimize.csv", "a"))
f.writerow(["User Name", "Revenue Estimate", "EPS Estimate"])

html = "https://www.estimize.com/jpm/fq3-2016?sort=rank&direction=asc&estimates_per_page=142&show_confirm=false"
html = urlopen(html)
soup = BeautifulSoup(html.read(), "html.parser").encode('utf8')
data_string = soup.split("\"allEstimateRows\":")[1]
data_string = data_string.split(",\"tableSortDirection")[0]
data = json.loads(data_string)

for item in data:
    f.writerow([item["userName"], item["revenue"], item["eps"]])

【讨论】：

嗨，我想问一下这个答案是否仍然有效？我正在尝试制作具有类似结构的代码，但是在 data_string = soup.split(insert my split condition here)[1] 处出现错误，因为 soup 变量正在返回一个字节，如对象并且 python 期望拆分为类似字节的对象，而不是字符串我使用 str() 将汤字节对象转换为 str 以便我可以解析它，现在一切正常，尽管这可能不是正确的方法？【参考方案2】：

看起来您尝试提取的数据在数据模型中，这意味着它在 JSON 中。如果您使用以下内容进行少量解析：

import json
import re

data_string = soup.findAll('script')[11].string.encode('utf8')
data_string = data_string.split("DataModel.parse(")[1]
data_string = data_string.split(");")[0]

// parse out erroneous html
while re.search('\<[^\>]*\>', datastring):
    data_string = ''.join(datastring.split(re.search('\<[^\>]*\>', datastring).group(0)))

// parse out other function parameters, leaving you with the json
data_you_want = json.loads(data_string.split(re.search('\[^",\\]]+,', data_string).group(0))[0]+'')

print(data_you_want["estimate"])
>>> 'shares': 'shares_hash': 'twitter': None, 'stocktwits': None, 'linkedin': None, 'lastRevised': None, 'id': None, 'revenue_points': None, 'sector': 'financials', 'persisted': False, 'points': None, 'instrumentSlug': 'jpm', 'wallstreetRevenue': 23972, 'revenue': 23972, 'createdAt': None, 'username': None, 'isBlind': False, 'releaseSlug': 'fq3-2016', 'statement': '', 'errorRanges': 'revenue': 'low': 21247.3532016398, 'high': 26820.423240734, 'eps': 'low': 1.02460526459765, 'high': 1.81359679579922, 'eps_points': None, 'rank': None, 'instrumentId': 981, 'eps': 1.4, 'season': '2016-fall', 'releaseId': 52773

DataModel.parse 是一个 javascript 方法，这意味着它以括号和冒号结尾。该函数的参数是您想要的 JSON 对象。通过将其加载到json.loads，您可以像访问字典一样访问它。

从那里你将数据重新映射到你希望它在 csv 中出现的形式。

【讨论】：

以上是关于使用 BeautifulSoup 抓取包含 JavaScript 的网页的主要内容，如果未能解决你的问题，请参考以下文章