如何使用美丽的汤从脚本标签中提取 json?

Posted

技术标签:

【中文标题】如何使用美丽的汤从脚本标签中提取 json?【英文标题】:How to extract json from script tag using beautiful soup? 【发布时间】:2020-07-27 18:17:47 【问题描述】:

我想使用美丽的汤从脚本标签中提取reviewCount。尝试了不同的方法,但没有成功。

<script type="application/json" data-initial-state="review-filter">
"languages":["isoCode":"all","displayName":"Toutes les langues","reviewCount":"573","isoCode":"fr","displayName":"français","reviewCount":"567","isoCode":"en","displayName":"English","reviewCount":"6"],"selectedLanguages":["all"],"selectedStars":null,"selectedLocationId":null
</script>

【问题讨论】:

尝试了不同的方法,但没有成功。你能分享一下这些尝试吗?从您共享的标签中,您似乎只需要获取标签的内容并解析结果即可。如果您正在努力从元素中提取内容,这是Extract content within a tag with BeautifulSoup 的副本。如果问题是解析 JSON,这是 How to parse JSON in Python? 的副本。 【参考方案1】:

这应该可行,我绝对肯定有更优雅的方法:

import json
from bs4 import BeautifulSoup

html = '''
<script type="application/json" data-initial-state="review-filter">
"languages":["isoCode":"all","displayName":"Toutes les langues","reviewCount":"573","isoCode":"fr","displayName":"français","reviewCount":"567","isoCode":"en","displayName":"English","reviewCount":"6"],"selectedLanguages":["all"],"selectedStars":null,"selectedLocationId":null
</script>
'''

soup = BeautifulSoup(html, 'html.parser')
res = soup.find('script')
json_object = json.loads(res.contents[0])

for language in json_object['languages']:
    print(': '.format(language['displayName'], language['reviewCount']))

输出:

Toutes les langues: 573
français: 567
English: 6

【讨论】:

谢谢詹姆斯。我试过你上面说的方法。我的主要问题是获取 reviewCount 数。 TypeError: 'Response' 类型的对象没有 len()【参考方案2】:

导入json并将数据加载到json,然后迭代获取所有reviewCount

import json
html='''<script type="application/json" data-initial-state="review-filter">
"languages":["isoCode":"all","displayName":"Toutes les langues","reviewCount":"573","isoCode":"fr","displayName":"français","reviewCount":"567","isoCode":"en","displayName":"English","reviewCount":"6"],"selectedLanguages":["all"],"selectedStars":null,"selectedLocationId":null
</script>'''

soup=BeautifulSoup(html,"html.parser")
item=soup.select_one('script[data-initial-state="review-filter"]').text
jsondata=json.loads(item)
for item in jsondata['languages']:
    print(item['reviewCount'])

输出

573
567
6

【讨论】:

【参考方案3】:
import re

html = '''<script type="application/json" data-initial-state="review-filter">
"languages":["isoCode":"all","displayName":"Toutes les langues","reviewCount":"573","isoCode":"fr","displayName":"français","reviewCount":"567","isoCode":"en","displayName":"English","reviewCount":"6"],"selectedLanguages":["all"],"selectedStars":null,"selectedLocationId":null
</script>'''


match = [item.group(1) for item in re.finditer('reviewCount":"(.+?)"', html)]

print(match)

输出:

['573', '567', '6']

【讨论】:

以上是关于如何使用美丽的汤从脚本标签中提取 json?的主要内容,如果未能解决你的问题,请参考以下文章

如何使用美丽的汤从 kick starter 中获取以下数据?

如何用python和漂亮的汤从html代码中提取一个小时

提取两个不同标签之间的文本 美丽的汤

从美丽的汤标签中提取href [重复]

试图用漂亮的汤从***上刮下一个季后赛支架。如何识别正确的列?

美丽的汤找不到标签