如何循环 python 读取一组 HTML 文件并转储到 JSON

Posted 2023-02-23

技术标签:

【中文标题】如何循环 python 读取一组 HTML 文件并转储到 JSON【英文标题】：How to loop python to read a set of HTML files and dump into JSON 【发布时间】：2014-06-11 19:37:36 【问题描述】：

我有一个程序可以从一组 20 个 html 文件中提取某些变量。有人可以给我建议如何循环程序以从目录中读取所有 html 文件并打印单个 json 文档中的信息吗？

from bs4 import BeautifulSoup

#opens data file
get_data = open("book1.html",'r').read()


#parses the html
soup = BeautifulSoup(get_data)

# finds title and author

title = soup.find("span", id="btAsinTitle")
author = title.find_next("a", href=True)

# finds price
for definition in soup.findAll('span', "class":'bb_price'):
    definition = definition.renderContents()

#finds ISBN, Shipping Weight, Product Dimensions
print soup.find('b', text='ISBN-10:').next_sibling
print soup.find('b', text='Shipping Weight:').next_sibling


#prints all the information

print definition
print title.get_text()
print author.get_text()

【问题讨论】：

【参考方案1】：

您可以使用glob.iglob 循环访问目录中的所有html 文件。对于每个文件名，将类文件对象传递给BeautifulSoup 构造函数，获取您需要的元素并构造一个字典：

import glob
from bs4 import BeautifulSoup

for filename in glob.iglob('*.html'):
    with open(filename) as f:
        soup = BeautifulSoup(f)

        title = soup.find("span", id="btAsinTitle")
        author = title.find_next("a", href=True)
        isbn = soup.find('b', text='ISBN-10:').next_sibling
        weight = soup.find('b', text='Shipping Weight:').next_sibling

        print 'title': title.get_text(),
               'author': author.get_text(),
               'isbn': isbn,
               'weight': weight

【讨论】：

这是一个很好的答案，但您知道如何将这些变量导出为 json 吗？ @user2962786 是的，您可以使用json.dumps() 或json.dump() 来完成。【参考方案2】：

处理某个目录中的文件集：

from glob import glob
fnames = glob("datadir/*.html")
for fname in fnames:
  html2json(fname)

现在我们需要函数 html2json，它会获取 html 文件的名称，并将 json 字符串写入与 html 同名但添加了 json 扩展名的文件。

import json
from bs4 import BeautifulSoup

def html2json(fname):
  resdct = 
  with open(fname) as f:
    soup = BeautifulSoup(f)

    title = soup.find("span", id="btAsinTitle")
    resdct["title"] = title.get_text()
    resdct["author"] = title.find_next("a", href=True).get_text()
    resdct["isbn"] = soup.find('b', text='ISBN-10:').next_sibling.get_text()
    resdct["weight"] = soup.find('b', text='Shipping Weight:').next_sibling.get_text()

  outfname = fname + ".json"
  with open(outfname, "w") as f:
    json.dump(resdct, f)

【讨论】：

非常感谢您的回复，但它会将新文件转储到目录中吗？我找不到输出文件 @user2962786 很好，我有一个错误，上次打开时缺少“w”模式。已更正。再试一次。

以上是关于如何循环 python 读取一组 HTML 文件并转储到 JSON的主要内容，如果未能解决你的问题，请参考以下文章