如何从 BeautifulSoup 对象中提取 JSON?
Posted
技术标签:
【中文标题】如何从 BeautifulSoup 对象中提取 JSON?【英文标题】:How to extract JSON from a BeautifulSoup Object? 【发布时间】:2021-06-18 17:07:11 【问题描述】:我已经使用 python-requests 下载了网页的 html。我现在需要从这个内容中提取一个 JSON 对象。我用一些 BS4 方法找到了 JSON 对象。但是,我不知道如何从 BS4 对象中提取它。这是我的代码
from bs4 import BeautifulSoup
import requests
import json
url = "https://matmatch.com/materials?materialPath=mitf1194-astm-b196-grade-c17200-tb00"
html_content = requests.get(url).text
soup = BeautifulSoup(html_content,features="html.parser")
body = soup.find('body')
the_contents_of_body_without_body_tags = body.findChildren(recursive=False)
#print(the_contents_of_body_without_body_tags)
element = soup.find_all("script",type="application/ld+json")
print(element[2])
#print(type(soup.find_all("script", "type":"application/ld+json")[2]))
js = json.loads(element[2])
这是这段代码的输出:
<script type="application/ld+json">
"@context": ["https://schema.org", "csvw": "http://www.w3.org/ns/csvw#"],
"@type": "Dataset",
"name":"ASTM B196 Grade C17200 TB00",
"description": "Chemical composition and material properties of ASTM B196 Grade C17200 TB00. Also available for download in XLSX and PDF. Data provided by MakeItFrom.com,Matmatch,Materion Brush GmbH",
"license": "https://matmatch.com/imprint",
"publisher":
"@type": "Organization",
"name": "Matmatch"
,
"mainEntity" :
"@type" : "csvw:Table",
"csvw:tableSchema":
"csvw:columns": [
"csvw:name": "Property Name",
"csvw:datatype": "string",
"csvw:cells": ["csvw:value":"Density","csvw:primaryKey":"Density","csvw:value":"Outside diameter","csvw:primaryKey":"Outside diameter","csvw:value":"Thickness","csvw:primaryKey":"Thickness","csvw:value":"Width","csvw:primaryKey":"Width","csvw:value":"Bendability 90°, bw","csvw:primaryKey":"Bendability 90°, bw","csvw:value":"Bendability 90°, gw","csvw:primaryKey":"Bendability 90°, gw","csvw:value":"Elastic modulus","csvw:primaryKey":"Elastic modulus","csvw:value":"Elongation","csvw:primaryKey":"Elongation","csvw:value":"Hardness, Rockwell C","csvw:primaryKey":"Hardness, Rockwell C","csvw:value":"Hardness, Vickers","csvw:primaryKey":"Hardness, Vickers","csvw:value":"Shear modulus","csvw:primaryKey":"Shear modulus","csvw:value":"Tensile strength","csvw:primaryKey":"Tensile strength","csvw:value":"Yield strength","csvw:primaryKey":"Yield strength","csvw:value":"Yield strength Rp0.2","csvw:primaryKey":"Yield strength Rp0.2","csvw:value":"Coefficient of thermal expansion","csvw:primaryKey":"Coefficient of thermal expansion","csvw:value":"Melting point","csvw:primaryKey":"Melting point","csvw:value":"Specific heat capacity","csvw:primaryKey":"Specific heat capacity","csvw:value":"Thermal conductivity","csvw:primaryKey":"Thermal conductivity","csvw:value":"Electrical resistivity","csvw:primaryKey":"Electrical resistivity","csvw:value":"Specific Electrical conductivity","csvw:primaryKey":"Specific Electrical conductivity","csvw:value":"Relative magnetic permeability","csvw:primaryKey":"Relative magnetic permeability"]
,
"csvw:name": "Value",
"csvw:datatype": "string",
"csvw:cells": ["csvw:value":8.26,"csvw:primaryKey":"Density","csvw:value":19.1,"csvw:primaryKey":"Outside diameter","csvw:value":0.05,"csvw:primaryKey":"Thickness","csvw:value":1.27,"csvw:primaryKey":"Width","csvw:value":0,"csvw:primaryKey":"Bendability 90°, bw","csvw:value":0,"csvw:primaryKey":"Bendability 90°, gw","csvw:value":130,"csvw:primaryKey":"Elastic modulus","csvw:value":1,"csvw:primaryKey":"Elongation","csvw:value":36,"csvw:primaryKey":"Hardness, Rockwell C","csvw:value":210,"csvw:primaryKey":"Hardness, Vickers","csvw:value":50,"csvw:primaryKey":"Shear modulus","csvw:value":410,"csvw:primaryKey":"Tensile strength","csvw:value":220,"csvw:primaryKey":"Yield strength","csvw:value":130,"csvw:primaryKey":"Yield strength Rp0.2","csvw:value":0.0000175,"csvw:primaryKey":"Coefficient of thermal expansion","csvw:value":870,"csvw:primaryKey":"Melting point","csvw:value":360,"csvw:primaryKey":"Specific heat capacity","csvw:value":84,"csvw:primaryKey":"Thermal conductivity","csvw:value":6.2e-8,"csvw:primaryKey":"Electrical resistivity","csvw:value":17,"csvw:primaryKey":"Specific Electrical conductivity","csvw:value":1.0006,"csvw:primaryKey":"Relative magnetic permeability"]
,
"csvw:name": "Unit",
"csvw:datatype": "string",
"csvw:cells": ["csvw:value":"g/cm³","csvw:primaryKey":"Density","csvw:value":"mm","csvw:primaryKey":"Outside diameter","csvw:value":"mm","csvw:primaryKey":"Thickness","csvw:value":"mm","csvw:primaryKey":"Width","csvw:value":"[-]","csvw:primaryKey":"Bendability 90°, bw","csvw:value":"[-]","csvw:primaryKey":"Bendability 90°, gw","csvw:value":"GPa","csvw:primaryKey":"Elastic modulus","csvw:value":"%","csvw:primaryKey":"Elongation","csvw:value":"[-]","csvw:primaryKey":"Hardness, Rockwell C","csvw:value":"[-]","csvw:primaryKey":"Hardness, Vickers","csvw:value":"GPa","csvw:primaryKey":"Shear modulus","csvw:value":"MPa","csvw:primaryKey":"Tensile strength","csvw:value":"MPa","csvw:primaryKey":"Yield strength","csvw:value":"MPa","csvw:primaryKey":"Yield strength Rp0.2","csvw:value":"1/K","csvw:primaryKey":"Coefficient of thermal expansion","csvw:value":"°C","csvw:primaryKey":"Melting point","csvw:value":"J/(kg·K)","csvw:primaryKey":"Specific heat capacity","csvw:value":"W/(m·K)","csvw:primaryKey":"Thermal conductivity","csvw:value":"Ω·m","csvw:primaryKey":"Electrical resistivity","csvw:value":" % IACS","csvw:primaryKey":"Specific Electrical conductivity","csvw:value":"[-]","csvw:primaryKey":"Relative magnetic permeability"]
]
</script>
最后一行代码返回这个错误:
TypeError: the JSON object must be str, bytes or bytearray, not 'Tag'
我曾尝试在 BS4 对象上使用 .text
和 .content
方法,但这也会导致错误。
如何从该输出中提取 JSON 对象?
【问题讨论】:
【参考方案1】:调用.string
方法:
如果一个标签只有一个孩子,而那个孩子是
NavigableString
,则 孩子被提供为.string
在你的例子中:
from bs4 import BeautifulSoup
import requests
import json
url = "https://matmatch.com/materials?materialPath=mitf1194-astm-b196-grade-c17200-tb00"
html_content = requests.get(url).text
soup = BeautifulSoup(html_content,features="html.parser")
body = soup.find('body')
the_contents_of_body_without_body_tags = body.findChildren(recursive=False)
element = soup.find_all("script",type="application/ld+json")
js = json.loads(element[2].string) # <- Calling `.string` to get the JSON
print(js)
示例输出(截断):
'@context': ['https://schema.org', 'csvw': 'http://www.w3.org/ns/csvw#'], '@type': 'Dataset', 'name': 'ASTM B196 Grade C17200 TB00', ...., 'csvw:value': '[-]', 'csvw:primaryKey': 'Relative magnetic permeability']]
【讨论】:
以上是关于如何从 BeautifulSoup 对象中提取 JSON?的主要内容,如果未能解决你的问题,请参考以下文章
如何使用 BeautifulSoup 从 Metacritic 网站中提取电影类型
如何使用 BeautifulSoup 从内联样式中提取 CSS 属性
如何通过 Python Selenium BeautifulSoup 从网站中提取证券价格作为文本
如何从 <h2 class=section-heading> 中的 <a> 中提取链接:BeautifulSoup [重复]