如何使用美丽的汤从脚本标签中提取 json?
Posted
技术标签:
【中文标题】如何使用美丽的汤从脚本标签中提取 json?【英文标题】:How to extract json from script tag using beautiful soup? 【发布时间】:2020-07-27 18:17:47 【问题描述】:我想使用美丽的汤从脚本标签中提取reviewCount
。尝试了不同的方法,但没有成功。
<script type="application/json" data-initial-state="review-filter">
"languages":["isoCode":"all","displayName":"Toutes les langues","reviewCount":"573","isoCode":"fr","displayName":"français","reviewCount":"567","isoCode":"en","displayName":"English","reviewCount":"6"],"selectedLanguages":["all"],"selectedStars":null,"selectedLocationId":null
</script>
【问题讨论】:
尝试了不同的方法,但没有成功。你能分享一下这些尝试吗?从您共享的标签中,您似乎只需要获取标签的内容并解析结果即可。如果您正在努力从元素中提取内容,这是Extract content within a tag with BeautifulSoup 的副本。如果问题是解析 JSON,这是 How to parse JSON in Python? 的副本。 【参考方案1】:这应该可行,我绝对肯定有更优雅的方法:
import json
from bs4 import BeautifulSoup
html = '''
<script type="application/json" data-initial-state="review-filter">
"languages":["isoCode":"all","displayName":"Toutes les langues","reviewCount":"573","isoCode":"fr","displayName":"français","reviewCount":"567","isoCode":"en","displayName":"English","reviewCount":"6"],"selectedLanguages":["all"],"selectedStars":null,"selectedLocationId":null
</script>
'''
soup = BeautifulSoup(html, 'html.parser')
res = soup.find('script')
json_object = json.loads(res.contents[0])
for language in json_object['languages']:
print(': '.format(language['displayName'], language['reviewCount']))
输出:
Toutes les langues: 573
français: 567
English: 6
【讨论】:
谢谢詹姆斯。我试过你上面说的方法。我的主要问题是获取 reviewCount 数。 TypeError: 'Response' 类型的对象没有 len()【参考方案2】:导入json并将数据加载到json
,然后迭代获取所有reviewCount
。
import json
html='''<script type="application/json" data-initial-state="review-filter">
"languages":["isoCode":"all","displayName":"Toutes les langues","reviewCount":"573","isoCode":"fr","displayName":"français","reviewCount":"567","isoCode":"en","displayName":"English","reviewCount":"6"],"selectedLanguages":["all"],"selectedStars":null,"selectedLocationId":null
</script>'''
soup=BeautifulSoup(html,"html.parser")
item=soup.select_one('script[data-initial-state="review-filter"]').text
jsondata=json.loads(item)
for item in jsondata['languages']:
print(item['reviewCount'])
输出:
573
567
6
【讨论】:
【参考方案3】:import re
html = '''<script type="application/json" data-initial-state="review-filter">
"languages":["isoCode":"all","displayName":"Toutes les langues","reviewCount":"573","isoCode":"fr","displayName":"français","reviewCount":"567","isoCode":"en","displayName":"English","reviewCount":"6"],"selectedLanguages":["all"],"selectedStars":null,"selectedLocationId":null
</script>'''
match = [item.group(1) for item in re.finditer('reviewCount":"(.+?)"', html)]
print(match)
输出:
['573', '567', '6']
【讨论】:
以上是关于如何使用美丽的汤从脚本标签中提取 json?的主要内容,如果未能解决你的问题,请参考以下文章
如何使用美丽的汤从 kick starter 中获取以下数据?