如何从 BeautifulSoup 对象中提取 JSON?

Posted

技术标签:

【中文标题】如何从 BeautifulSoup 对象中提取 JSON?【英文标题】:How to extract JSON from a BeautifulSoup Object? 【发布时间】:2021-06-18 17:07:11 【问题描述】:

我已经使用 python-requests 下载了网页的 html。我现在需要从这个内容中提取一个 JSON 对象。我用一些 BS4 方法找到了 JSON 对象。但是,我不知道如何从 BS4 对象中提取它。这是我的代码

from bs4 import BeautifulSoup
import requests
import json

url = "https://matmatch.com/materials?materialPath=mitf1194-astm-b196-grade-c17200-tb00"

html_content = requests.get(url).text
soup = BeautifulSoup(html_content,features="html.parser")
body = soup.find('body')
the_contents_of_body_without_body_tags = body.findChildren(recursive=False)
#print(the_contents_of_body_without_body_tags)


element = soup.find_all("script",type="application/ld+json")
print(element[2])
#print(type(soup.find_all("script", "type":"application/ld+json")[2]))
js = json.loads(element[2])

这是这段代码的输出:

<script type="application/ld+json">
      "@context": ["https://schema.org", "csvw": "http://www.w3.org/ns/csvw#"],
      "@type": "Dataset",
      "name":"ASTM B196 Grade C17200 TB00",
      "description": "Chemical composition and material properties of ASTM B196 Grade C17200 TB00. Also available for download in XLSX and PDF. Data provided by MakeItFrom.com,Matmatch,Materion Brush GmbH",
      "license": "https://matmatch.com/imprint",
      "publisher": 
        "@type": "Organization",
        "name": "Matmatch"
      ,
      "mainEntity" : 
        "@type" : "csvw:Table",
        "csvw:tableSchema": 
          "csvw:columns": [
            
              "csvw:name": "Property Name",
              "csvw:datatype": "string",
              "csvw:cells": ["csvw:value":"Density","csvw:primaryKey":"Density","csvw:value":"Outside diameter","csvw:primaryKey":"Outside diameter","csvw:value":"Thickness","csvw:primaryKey":"Thickness","csvw:value":"Width","csvw:primaryKey":"Width","csvw:value":"Bendability 90°, bw","csvw:primaryKey":"Bendability 90°, bw","csvw:value":"Bendability 90°, gw","csvw:primaryKey":"Bendability 90°, gw","csvw:value":"Elastic modulus","csvw:primaryKey":"Elastic modulus","csvw:value":"Elongation","csvw:primaryKey":"Elongation","csvw:value":"Hardness, Rockwell C","csvw:primaryKey":"Hardness, Rockwell C","csvw:value":"Hardness, Vickers","csvw:primaryKey":"Hardness, Vickers","csvw:value":"Shear modulus","csvw:primaryKey":"Shear modulus","csvw:value":"Tensile strength","csvw:primaryKey":"Tensile strength","csvw:value":"Yield strength","csvw:primaryKey":"Yield strength","csvw:value":"Yield strength Rp0.2","csvw:primaryKey":"Yield strength Rp0.2","csvw:value":"Coefficient of thermal expansion","csvw:primaryKey":"Coefficient of thermal expansion","csvw:value":"Melting point","csvw:primaryKey":"Melting point","csvw:value":"Specific heat capacity","csvw:primaryKey":"Specific heat capacity","csvw:value":"Thermal conductivity","csvw:primaryKey":"Thermal conductivity","csvw:value":"Electrical resistivity","csvw:primaryKey":"Electrical resistivity","csvw:value":"Specific Electrical conductivity","csvw:primaryKey":"Specific Electrical conductivity","csvw:value":"Relative magnetic permeability","csvw:primaryKey":"Relative magnetic permeability"]
            ,
            
              "csvw:name": "Value",
              "csvw:datatype": "string",
              "csvw:cells": ["csvw:value":8.26,"csvw:primaryKey":"Density","csvw:value":19.1,"csvw:primaryKey":"Outside diameter","csvw:value":0.05,"csvw:primaryKey":"Thickness","csvw:value":1.27,"csvw:primaryKey":"Width","csvw:value":0,"csvw:primaryKey":"Bendability 90°, bw","csvw:value":0,"csvw:primaryKey":"Bendability 90°, gw","csvw:value":130,"csvw:primaryKey":"Elastic modulus","csvw:value":1,"csvw:primaryKey":"Elongation","csvw:value":36,"csvw:primaryKey":"Hardness, Rockwell C","csvw:value":210,"csvw:primaryKey":"Hardness, Vickers","csvw:value":50,"csvw:primaryKey":"Shear modulus","csvw:value":410,"csvw:primaryKey":"Tensile strength","csvw:value":220,"csvw:primaryKey":"Yield strength","csvw:value":130,"csvw:primaryKey":"Yield strength Rp0.2","csvw:value":0.0000175,"csvw:primaryKey":"Coefficient of thermal expansion","csvw:value":870,"csvw:primaryKey":"Melting point","csvw:value":360,"csvw:primaryKey":"Specific heat capacity","csvw:value":84,"csvw:primaryKey":"Thermal conductivity","csvw:value":6.2e-8,"csvw:primaryKey":"Electrical resistivity","csvw:value":17,"csvw:primaryKey":"Specific Electrical conductivity","csvw:value":1.0006,"csvw:primaryKey":"Relative magnetic permeability"]
            ,
            
              "csvw:name": "Unit",
              "csvw:datatype": "string",
              "csvw:cells": ["csvw:value":"g/cm³","csvw:primaryKey":"Density","csvw:value":"mm","csvw:primaryKey":"Outside diameter","csvw:value":"mm","csvw:primaryKey":"Thickness","csvw:value":"mm","csvw:primaryKey":"Width","csvw:value":"[-]","csvw:primaryKey":"Bendability 90°, bw","csvw:value":"[-]","csvw:primaryKey":"Bendability 90°, gw","csvw:value":"GPa","csvw:primaryKey":"Elastic modulus","csvw:value":"%","csvw:primaryKey":"Elongation","csvw:value":"[-]","csvw:primaryKey":"Hardness, Rockwell C","csvw:value":"[-]","csvw:primaryKey":"Hardness, Vickers","csvw:value":"GPa","csvw:primaryKey":"Shear modulus","csvw:value":"MPa","csvw:primaryKey":"Tensile strength","csvw:value":"MPa","csvw:primaryKey":"Yield strength","csvw:value":"MPa","csvw:primaryKey":"Yield strength Rp0.2","csvw:value":"1/K","csvw:primaryKey":"Coefficient of thermal expansion","csvw:value":"°C","csvw:primaryKey":"Melting point","csvw:value":"J/(kg·K)","csvw:primaryKey":"Specific heat capacity","csvw:value":"W/(m·K)","csvw:primaryKey":"Thermal conductivity","csvw:value":"Ω·m","csvw:primaryKey":"Electrical resistivity","csvw:value":" % IACS","csvw:primaryKey":"Specific Electrical conductivity","csvw:value":"[-]","csvw:primaryKey":"Relative magnetic permeability"]
            ]
        
      
    </script>

最后一行代码返回这个错误:

TypeError: the JSON object must be str, bytes or bytearray, not 'Tag'

我曾尝试在 BS4 对象上使用 .text.content 方法,但这也会导致错误。

如何从该输出中提取 JSON 对象?

【问题讨论】:

【参考方案1】:

调用.string方法:

如果一个标签只有一个孩子,而那个孩子是NavigableString,则 孩子被提供为.string


在你的例子中:

from bs4 import BeautifulSoup
import requests
import json

url = "https://matmatch.com/materials?materialPath=mitf1194-astm-b196-grade-c17200-tb00"

html_content = requests.get(url).text
soup = BeautifulSoup(html_content,features="html.parser")
body = soup.find('body')
the_contents_of_body_without_body_tags = body.findChildren(recursive=False)

element = soup.find_all("script",type="application/ld+json")

js = json.loads(element[2].string) # <- Calling `.string` to get the JSON
print(js)

示例输出(截断):

 '@context': ['https://schema.org', 'csvw': 'http://www.w3.org/ns/csvw#'], '@type': 'Dataset', 'name': 'ASTM B196 Grade C17200 TB00', ...., 'csvw:value': '[-]', 'csvw:primaryKey': 'Relative magnetic permeability']]

【讨论】:

以上是关于如何从 BeautifulSoup 对象中提取 JSON?的主要内容,如果未能解决你的问题,请参考以下文章

如何使用 BeautifulSoup 从 Metacritic 网站中提取电影类型

如何使用 BeautifulSoup 从内联样式中提取 CSS 属性

如何通过 Python Selenium BeautifulSoup 从网站中提取证券价格作为文本

使用 Beautifulsoup 从网站中提取数据

如何从 <h2 class=section-heading> 中的 <a> 中提取链接:BeautifulSoup [重复]

beautifulsoup 对象如何能够将标签作为属性?