遍历多个 html 文件并转换为 csv

Posted

技术标签:

【中文标题】遍历多个 html 文件并转换为 csv【英文标题】:Iterating through multiple html files and converting to csv 【发布时间】:2021-03-14 02:24:43 【问题描述】:

我有 32 个单独的 html 文件,其中的数据采用表格格式,包含 8 列数据。每个文件都针对特定种类的真菌。

我需要将 32 个 html 文件转换为 32 个 csv 文件和数据。我有单个文件的脚本,但不知道如何使用几个命令系统地执行此操作,而不是运行我有 32 次的命令。

这是我正在使用的脚本,试图让它遍历所有 32 个文件:

directory = r'../html/species'
data = []
for filename in os.listdir(directory):
    if filename.endswith('.html'):
        fname = os.path.join(directory,filename)
        with open(fname, 'r') as f:
            soup = BeautifulSoup(f.read(),'html.parser')
            HTML_data = soup.find_all("table")[0].find_all("tr")[1:] 
            for element in HTML_data: 
                sub_data = [] 
                for sub_element in element: 
                    try: 
                        sub_data.append(sub_element.get_text())
                    except: 
                        continue
                data.append(sub_data) 
data

以下是出于复制目的而简化的上述脚本的一些输出数据:

[['\n\t\t\t\t\t\tAfrica\n\t\t\t\t\t'],
 ['Kenya',
  'Present',
  '',
  'Introduced',
  '',
  '',
  'Shomari (1996); Ohler (1979); Mniu (1998); Nayar (1998)',
  ''],
 ['Malawi',
  'Present',
  '',
  '',
  '',
  '',
  'Malawi, Ministry of Agriculture (1990)',
  ''],
 ['Mozambique',
  'Present',
  '',
  'Introduced',
  '',
  '',
  'Ohler (1979); Shomari (1996); Mniu (1998); CABI (Undated)',
  ''],
 ['Nigeria',
  'Present',
  '',
  'Introduced',
  '',
  '',
  'Ohler (1979); Shomari (1996); Nayar (1998); CABI (Undated)',
  ''],
 ['South Africa', 'Present', '', '', '', '', 'Swart (2004)', ''],
 ['Tanzania',
  'Present',
  '',
  '',
  '',
  '',
  'Casulli (1979); Martin et al. (1997)',
  ''],
 ['Zambia',
  'Present',
  '',
  'Introduced',
  '',
  '',
  'Ohler (1979); Shomari (1996); Mniu (1998); Nayar (1998)',
  ''],
 ['\n\t\t\t\t\t\tAsia\n\t\t\t\t\t'],
 ['India', 'Present', '', 'Introduced', '', '', 'Intini (1987)', ''],
 ['\n\t\t\t\t\t\tSouth America\n\t\t\t\t\t'],
 ['Brazil', 'Present', '', '', '', '', 'Ponte (1986)', ''],
 ['-Sao Paulo',
  'Present',
  '',
  'Native',
  '',
  '',
  'Waller et al. (1992); Shomari (1996)',
  ''],
 ['\n\t\t\t\t\t\tAfrica\n\t\t\t\t\t'],
 ['Egypt',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Ethiopia',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Libya',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Malawi',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Morocco',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Mozambique',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['South Africa',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Sudan',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Tanzania',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Tunisia', 'Present', '', '', '', '', 'Djébali et al. (2009)', ''],
 ['Uganda',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['\n\t\t\t\t\t\tAsia\n\t\t\t\t\t'],
 ['Afghanistan',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Armenia',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Azerbaijan',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Bhutan', 'Present', '', '', '', '', 'CABI and EPPO (2010)', '']]

认为我需要的是每个物种的格式都更像这样.. [[info_species1],[info_species1],[info_species1]], [[info_species2],[info_species2],[ info_species2]] 或者在我的输出中我需要:

['-Sao Paulo',
  'Present',
  '',
  'Native',
  '',
  '',
  'Waller et al. (1992); Shomari (1996)',
  '']], # AN EXTRA SQUARE BRACKET RIGHT HERE
 ['\n\t\t\t\t\t\tAfrica\n\t\t\t\t\t'],
 ['Egypt',
  'Present',

【问题讨论】:

额外的方括号是列表的结束。您只想将列表添加到另一个列表吗?您能否向我们展示原始 HTML 数据和所需的输出? 是的,我想关闭每个文件,使其成为列表中的一部分。 【参考方案1】:

您是否考虑过只用 pandas 读取表格标签?

import pandas as pd
import os

directory = r'../html/species'

for filename in os.listdir(directory):
    if filename.endswith('.html'):
        csv_filename = filename.replace('.html','.csv')
        fname = os.path.join(directory,filename)
        with open(fname, 'r') as f:
            table = pd.read_html(f.read())[0]
            table.to_csv(csv_filename, index=False)

print(data)

【讨论】:

这给了我与我的脚本相同的输出:一张表中的所有 32 个文件。您知道如何修改您的输出以提供 32 个单独的输出(每个 html 文件一个)或每个列表的列表吗?

以上是关于遍历多个 html 文件并转换为 csv的主要内容,如果未能解决你的问题,请参考以下文章

使用 OpenOffice Calc 打开 Excel 文件并转换为 CSV 或制表符分隔

Python编程快速上手——Excel到CSV的转换程序案例分析

如何在 Python 中将多个 .txt 文件转换为 .csv 文件

使用 Jupyter notebook 将具有多个工作表的 Excel 文件转换为多个 csv 文件

将数据集/数据表转换为 CSV

将多个 YAML 文件转换为 CSV