遍历多个 html 文件并转换为 csv

Posted 2023-02-23

技术标签:

【中文标题】遍历多个 html 文件并转换为 csv【英文标题】：Iterating through multiple html files and converting to csv 【发布时间】：2021-03-14 02:24:43 【问题描述】：

我有 32 个单独的 html 文件，其中的数据采用表格格式，包含 8 列数据。每个文件都针对特定种类的真菌。

我需要将 32 个 html 文件转换为 32 个 csv 文件和数据。我有单个文件的脚本，但不知道如何使用几个命令系统地执行此操作，而不是运行我有 32 次的命令。

这是我正在使用的脚本，试图让它遍历所有 32 个文件：

directory = r'../html/species'
data = []
for filename in os.listdir(directory):
    if filename.endswith('.html'):
        fname = os.path.join(directory,filename)
        with open(fname, 'r') as f:
            soup = BeautifulSoup(f.read(),'html.parser')
            HTML_data = soup.find_all("table")[0].find_all("tr")[1:] 
            for element in HTML_data: 
                sub_data = [] 
                for sub_element in element: 
                    try: 
                        sub_data.append(sub_element.get_text())
                    except: 
                        continue
                data.append(sub_data) 
data

以下是出于复制目的而简化的上述脚本的一些输出数据：

[['\n\t\t\t\t\t\tAfrica\n\t\t\t\t\t'],
 ['Kenya',
  'Present',
  '',
  'Introduced',
  '',
  '',
  'Shomari (1996); Ohler (1979); Mniu (1998); Nayar (1998)',
  ''],
 ['Malawi',
  'Present',
  '',
  '',
  '',
  '',
  'Malawi, Ministry of Agriculture (1990)',
  ''],
 ['Mozambique',
  'Present',
  '',
  'Introduced',
  '',
  '',
  'Ohler (1979); Shomari (1996); Mniu (1998); CABI (Undated)',
  ''],
 ['Nigeria',
  'Present',
  '',
  'Introduced',
  '',
  '',
  'Ohler (1979); Shomari (1996); Nayar (1998); CABI (Undated)',
  ''],
 ['South Africa', 'Present', '', '', '', '', 'Swart (2004)', ''],
 ['Tanzania',
  'Present',
  '',
  '',
  '',
  '',
  'Casulli (1979); Martin et al. (1997)',
  ''],
 ['Zambia',
  'Present',
  '',
  'Introduced',
  '',
  '',
  'Ohler (1979); Shomari (1996); Mniu (1998); Nayar (1998)',
  ''],
 ['\n\t\t\t\t\t\tAsia\n\t\t\t\t\t'],
 ['India', 'Present', '', 'Introduced', '', '', 'Intini (1987)', ''],
 ['\n\t\t\t\t\t\tSouth America\n\t\t\t\t\t'],
 ['Brazil', 'Present', '', '', '', '', 'Ponte (1986)', ''],
 ['-Sao Paulo',
  'Present',
  '',
  'Native',
  '',
  '',
  'Waller et al. (1992); Shomari (1996)',
  ''],
 ['\n\t\t\t\t\t\tAfrica\n\t\t\t\t\t'],
 ['Egypt',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Ethiopia',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Libya',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Malawi',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Morocco',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Mozambique',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['South Africa',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Sudan',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Tanzania',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Tunisia', 'Present', '', '', '', '', 'Djébali et al. (2009)', ''],
 ['Uganda',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['\n\t\t\t\t\t\tAsia\n\t\t\t\t\t'],
 ['Afghanistan',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Armenia',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Azerbaijan',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Bhutan', 'Present', '', '', '', '', 'CABI and EPPO (2010)', '']]

我认为我需要的是每个物种的格式都更像这样.. [[info_species1],[info_species1],[info_species1]], [[info_species2],[info_species2],[ info_species2]] 或者在我的输出中我需要：

['-Sao Paulo',
  'Present',
  '',
  'Native',
  '',
  '',
  'Waller et al. (1992); Shomari (1996)',
  '']], # AN EXTRA SQUARE BRACKET RIGHT HERE
 ['\n\t\t\t\t\t\tAfrica\n\t\t\t\t\t'],
 ['Egypt',
  'Present',

【问题讨论】：

额外的方括号是列表的结束。您只想将列表添加到另一个列表吗？您能否向我们展示原始 HTML 数据和所需的输出？是的，我想关闭每个文件，使其成为列表中的一部分。 【参考方案1】：

您是否考虑过只用 pandas 读取表格标签？

import pandas as pd
import os

directory = r'../html/species'

for filename in os.listdir(directory):
    if filename.endswith('.html'):
        csv_filename = filename.replace('.html','.csv')
        fname = os.path.join(directory,filename)
        with open(fname, 'r') as f:
            table = pd.read_html(f.read())[0]
            table.to_csv(csv_filename, index=False)

print(data)

【讨论】：

这给了我与我的脚本相同的输出：一张表中的所有 32 个文件。您知道如何修改您的输出以提供 32 个单独的输出（每个 html 文件一个）或每个列表的列表吗？

以上是关于遍历多个 html 文件并转换为 csv的主要内容，如果未能解决你的问题，请参考以下文章