从 html <script> 中提取 JSON 对象

Posted

技术标签:

【中文标题】从 html <script> 中提取 JSON 对象【英文标题】:Extracting JSON object from html <script> 【发布时间】:2015-12-19 23:36:00 【问题描述】:

我有 JSON 对象:


  "review_body": "Beef noodles realism weathered modem tanto hotdog dolphin long-chain hydrocarbons 8-bit euro-pop tank-traps Tokyo narrative.-space j-pop franchise otaku faded RAF girl artisanal hotdog denim ablative systemic smart-Kowloon. Man construct dome smart-computer pen monofilament beef noodles rain garage geodesic bicycle San Francisco wonton soup dissident nodal point tower. Boat uplink film dead man modem warehouse. Nodal point jeans euro-pop render-farm nano-fetishism semiotics hacker gang. Futurity narrative youtube otaku Kowloon free-market drugs. Fluidity assassin Tokyo bicycle media assault concrete industrial grade ablative lights boat BASE jump A.I. post-stimulate carbon. Physical computer narrative city youtube math-neural assassin modem.",
  "link": "http://www.getlost.com/store/acme/review/10607787#comment10607787",
  "seller_id": "104523",
  "survey_id": "9933447",
  "loggedin_user": 0,
  "store_rating": "8.02",
  "store_thumb": "http://www.getlost.com/store/thumbnail/acme.jpg",
  "store_name": "acme",
  "username": "ronin666",
  "rating": "1",
  "ref": "RR,acme,104523"

嵌入

<script LANGUAGE="javascript">
window.commentShare = $.extend((window.commentShare || ), 
    10607787: 
        "review_body": "Beef noodles realism weathered modem tanto hotdog dolphin long-chain hydrocarbons 8-bit euro-pop tank-traps Tokyo narrative.-space j-pop franchise otaku faded RAF girl artisanal hotdog denim ablative systemic smart-Kowloon. Man construct dome smart-computer pen monofilament beef noodles rain garage geodesic bicycle San Francisco wonton soup dissident nodal point tower. Boat uplink film dead man modem warehouse. Nodal point jeans euro-pop render-farm nano-fetishism semiotics hacker gang. Futurity narrative youtube otaku Kowloon free-market drugs. Fluidity assassin Tokyo bicycle media assault concrete industrial grade ablative lights boat BASE jump A.I. post-stimulate carbon. Physical computer narrative city youtube math-neural assassin modem.",
        "link": "http:\/\/www.getlost.com\/store\/acme\/review\/10607787#comment10607787",
        "seller_id": "104523",
        "survey_id": "9933447",
        "loggedin_user": 0,
        "store_rating": "8.02",
        "store_thumb": "http:\/\/www.getlost.com\/store\/thumbnail\/acme.jpg",
        "store_name": "acme",
        "username": "ronin666",
        "rating": "1",
        "ref": "RR,acme,104523"
    
);
</script>

我想提取上述 JSON 对象。如何做到这一点?我应该使用正则表达式吗?

如何获得这种类型的对象(通过Ipython,python 2.7):

我基本上是在使用 BeautifulSoup 抓取评论网站 resellerratings.com 的任意商店。我获得了 soup 对象,并注意到有一些有用的 JSON 对象包含所选商店中每条评论的信息。但是,在调用 soup.find("script", language = "javascript") 时,我仍然保留了嵌入在脚本标签中的 JSON 对象。

from mechanize import Browser
import bs4
from bs4 import BeautifulSoup

br = Browser()
br.set_handle_robots(False)
br.set_handle_refresh(False)

example_url = 'http://www.resellerratings.com/store/My_Digital_Palace'

response = br.open(example_url)
soup = BeautifulSoup(response)
soup.find("script", language = "javascript")

这应该返回:

<script language="javascript">
window.commentShare = $.extend(
    (window.commentShare || ), 
        375015: 
            "review_body": "I bought a Kodak LS443 form My Digital Palace in 2004.  I also purchased a 5 year warranty.  Now the camera does not work and I am unable to contact them.  What do I do???  Am I just screwed???<br><br>Margaret Fuller<br>margaret_fuller@sbcglobal.net",
            "link": "http:\/\/www.resellerratings.com\/store\/My_Digital_Palace\/review\/375015#comment375015",
            "seller_id": "6930",
            "survey_id": "385176",
            "loggedin_user": 0,
            "store_rating": "1.00",
            "store_thumb": "http:\/\/www.resellerratings.com\/store\/thumbnail\/My_Digital_Palace.jpg",
            "store_name": "My Digital Palace",
            "username": "maf1059",
            "rating": "1",
            "ref": "RR,My_Digital_Pala,6930"
        
    
);
</script>

【问题讨论】:

您是否将其视为您要解析的常规文本文件,还是包含在您的网络应用程序中? @ergonaut 这实际上是我在 python 中使用美丽的汤刮的东西。所以这是我想要解析的东西。 为什么不简单地通过变量访问对象? 更多信息会有所帮助...您想在 python 中解析?还是javascript?是 $.extend 调用 jquery 还是什么? 就目前而言,这听起来像是您正在尝试做一些非常错误的事情。 【参考方案1】:

很简单,只需去掉包装和多余的行,就可以得到多汁、多汁的 JSON 本身。下面删除了您的 javscript sn-p 的前四行和最后三行(同时还将最初的 放回丢失的位置):

import json

raw = "" + "\n".join(str(soup.find("script")).split("\n")[4:-3])

如果页面上的 &lt;script&gt; 对象没有以统一的方式编写(也就是说,前四行和后两行并不总是无关紧要的),您可能不得不求助于正则表达式或其他匹配。之后,您可以继续访问 JSON。

json_obj = json.loads(raw)

您的问题只是一个正则表达式/拆分问题。我认为人们对 Javascript 有点失望。 :)

【讨论】:

【参考方案2】:

如果您的页面上有此 JOSN 并希望通过 javascript 访问它,您可以通过遍历 window.commentShare 对象中的对象来实现。

这里有一个小测试功能供您添加到您的页面,以便您了解它是如何工作的。它会提醒您的 JSON 值之一。为了完整起见,我已将其添加到您示例的末尾。

<script  LANGUAGE="javascript">
window.commentShare = $.extend((window.commentShare || ), 
    10607787: 
        "review_body": "Beef noodles realism weathered modem tanto hotdog dolphin long-chain hydrocarbons 8-bit euro-pop tank-traps Tokyo narrative.-space j-pop franchise otaku faded RAF girl artisanal hotdog denim ablative systemic smart-Kowloon. Man construct dome smart-computer pen monofilament beef noodles rain garage geodesic bicycle San Francisco wonton soup dissident nodal point tower. Boat uplink film dead man modem warehouse. Nodal point jeans euro-pop render-farm nano-fetishism semiotics hacker gang. Futurity narrative youtube otaku Kowloon free-market drugs. Fluidity assassin Tokyo bicycle media assault concrete industrial grade ablative lights boat BASE jump A.I. post-stimulate carbon. Physical computer narrative city youtube math-neural assassin modem.",
        "link": "http:\/\/www.getlost.com\/store\/acme\/review\/10607787#comment10607787",
        "seller_id": "104523",
        "survey_id": "9933447",
        "loggedin_user": 0,
        "store_rating": "8.02",
        "store_thumb": "http:\/\/www.getlost.com\/store\/thumbnail\/acme.jpg",
        "store_name": "acme",
        "username": "ronin666",
        "rating": "1",
        "ref": "RR,acme,104523"
    
);

function test()

for (var i in window.commentShare) 
    var myObj = window.commentShare[i];
    alert(myObj.review_body);
 


test();

</script>

【讨论】:

以上是关于从 html <script> 中提取 JSON 对象的主要内容,如果未能解决你的问题,请参考以下文章

如何使用 Beautiful Soup 从 <script> 中提取内容

使用 Scrapy 从 <script> 标签中提取多行 javascript 内容

从生成的 <script> 中提取数据并处理结果

使用 Scrapy 从 HTML 中的 <script> 标签获取数据

在 Python 中使用 BeautifulSoup 从 HTML 脚本标签中提取 JSON

从龙卷风模板中提取翻译