从 html 中的 <script> 标记中提取 var
Posted
技术标签:
【中文标题】从 html 中的 <script> 标记中提取 var【英文标题】:Extracting a var from <script> tag in html 【发布时间】:2022-01-06 22:52:04 【问题描述】:我正在尝试从 page 中抓取产品评论,但我不确定如何在 <script>
标签中提取 var。
这是我的python代码:
import requests
from bs4 import BeautifulSoup
import csv
a_file = open("ProductReviews.csv", "a")
writer = csv.writer(a_file)
# Write the titles of the columns to the CSV file
writer.writerow(["created_at", "reviewer_name", "rating", "content", "source"])
url = 'https://www.lazada.com.my/products/iron-gym-total-upper-body-workout-bar-i467342383.html'
# Connect to the URL
response = requests.get(url)
# Parse HTML and save to BeautifulSoup object
soup = BeautifulSoup(response.content, "html.parser")
data = soup.findAll('script')[123]
if 'var __moduleData__' in data.string:
print("Yes")
这是页面源代码(我删除了不必要的代码):
<html>
<head>
<title></title>
</head>
<body>
<script>
var __moduleData__ =
"data":
"root":
"fields":
"review":
"reviews": [
"rating": 5,
"reviewContent": "tq barang dah sampai",
"reviewTime": "24 May 2021",
"reviewer": "Jaharinbaharin",
,
"rating": 5,
"reviewContent": "Beautiful quality????????????",
"reviewTime": "08 Sep 2021",
"reviewer": "M***.",
,
"rating": 5,
"reviewContent": "the box was badly dented but the item was intact...just that my door frame is shallow and slippery....I can't pull up without worrying of falling down",
"reviewTime": "25 Aug 2021",
"reviewer": "David S.",
,
"rating": 5,
"reviewContent": "Haven’t really opened it yet but please put some effort on the packaging for future improvement thanks it was really fast",
"reviewTime": "14 Dec 2020",
"reviewer": "Yasir A.",
,
"rating": 5,
"reviewContent": "Seems to be ok, good quality.. No weight restriction mentioned on the box.. I'm about 90kg, it could handle my weight so far..",
"reviewTime": "22 May 2020",
"reviewer": "Kevin",
]
,
,
,
;
</script>
</body>
</html>
我只想获取评论数据,所以我想知道如何提取var __moduleData__
的值。
【问题讨论】:
你想要哪个变量 @anarchy 哦,我的错,我想要__moduleData__
我会编辑问题
嗯,我没有看到与您相同的部分
@anarchy 你的意思是在网站源代码中?或者你试过我的python代码?
我尝试了你的 python 代码来获取你的汤品,但我没有看到同样的东西
【参考方案1】:
您可以使用正则表达式来选择您的变量:
json.loads(re.search(r'var __moduleData__ = (.*)', response.text).group(1))
示例
from bs4 import BeautifulSoup
import json,re,requests
url = 'https://www.lazada.com.my/products/iron-gym-total-upper-body-workout-bar-i467342383.html'
response = requests.get(url)
d = json.loads(re.search(r'var __moduleData__ = (.*)', response.text).group(1))
d['data']['root']['fields']['seller']
输出
'chatResponsiveRate': 'labelText': 'Chat Response', 'value': '100%',
'chatUrl': 'https://pages.lazada.com.my/wow/i/my/im/chat?brandId=21411',
'hideAllMetrics': False,
'imEnable': True,
'imUserId': '100285367',
'name': 'MR SIX PACK',
'newSeller': False,
'percentRate': '96%',
'positiveSellerRating': 'labelText': 'Seller Ratings', 'value': '96%',
'rate': 0.96,
'rateLevel': 3,
'sellerId': '1000052649',
'shipOnTime': 'labelText': 'Ship On Time', 'value': '97%',
'shopId': 255007,
'size': 5,
'time': 2,
'type': '4',
'unit': 'years',
'url': '//www.lazada.com.my/shop/mr-six-pack/?itemId=467342383&channelSource=pdp'
【讨论】:
以上是关于从 html 中的 <script> 标记中提取 var的主要内容,如果未能解决你的问题,请参考以下文章
Beautiful Soup 如何解码 <script> 对象中的 html json 数据
我应该将 <script> 标签放在 HTML 标记中的啥位置?
使用 VS Code 从 HTML `<script>` 引用 TypeScript 定义文件