遍历多个 html 文件并转换为 csv
Posted
技术标签:
【中文标题】遍历多个 html 文件并转换为 csv【英文标题】:Iterating through multiple html files and converting to csv 【发布时间】:2021-03-14 02:24:43 【问题描述】:我有 32 个单独的 html 文件,其中的数据采用表格格式,包含 8 列数据。每个文件都针对特定种类的真菌。
我需要将 32 个 html 文件转换为 32 个 csv 文件和数据。我有单个文件的脚本,但不知道如何使用几个命令系统地执行此操作,而不是运行我有 32 次的命令。
这是我正在使用的脚本,试图让它遍历所有 32 个文件:
directory = r'../html/species'
data = []
for filename in os.listdir(directory):
if filename.endswith('.html'):
fname = os.path.join(directory,filename)
with open(fname, 'r') as f:
soup = BeautifulSoup(f.read(),'html.parser')
HTML_data = soup.find_all("table")[0].find_all("tr")[1:]
for element in HTML_data:
sub_data = []
for sub_element in element:
try:
sub_data.append(sub_element.get_text())
except:
continue
data.append(sub_data)
data
以下是出于复制目的而简化的上述脚本的一些输出数据:
[['\n\t\t\t\t\t\tAfrica\n\t\t\t\t\t'],
['Kenya',
'Present',
'',
'Introduced',
'',
'',
'Shomari (1996); Ohler (1979); Mniu (1998); Nayar (1998)',
''],
['Malawi',
'Present',
'',
'',
'',
'',
'Malawi, Ministry of Agriculture (1990)',
''],
['Mozambique',
'Present',
'',
'Introduced',
'',
'',
'Ohler (1979); Shomari (1996); Mniu (1998); CABI (Undated)',
''],
['Nigeria',
'Present',
'',
'Introduced',
'',
'',
'Ohler (1979); Shomari (1996); Nayar (1998); CABI (Undated)',
''],
['South Africa', 'Present', '', '', '', '', 'Swart (2004)', ''],
['Tanzania',
'Present',
'',
'',
'',
'',
'Casulli (1979); Martin et al. (1997)',
''],
['Zambia',
'Present',
'',
'Introduced',
'',
'',
'Ohler (1979); Shomari (1996); Mniu (1998); Nayar (1998)',
''],
['\n\t\t\t\t\t\tAsia\n\t\t\t\t\t'],
['India', 'Present', '', 'Introduced', '', '', 'Intini (1987)', ''],
['\n\t\t\t\t\t\tSouth America\n\t\t\t\t\t'],
['Brazil', 'Present', '', '', '', '', 'Ponte (1986)', ''],
['-Sao Paulo',
'Present',
'',
'Native',
'',
'',
'Waller et al. (1992); Shomari (1996)',
''],
['\n\t\t\t\t\t\tAfrica\n\t\t\t\t\t'],
['Egypt',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Ethiopia',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Libya',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Malawi',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Morocco',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Mozambique',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['South Africa',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Sudan',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Tanzania',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Tunisia', 'Present', '', '', '', '', 'Djébali et al. (2009)', ''],
['Uganda',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['\n\t\t\t\t\t\tAsia\n\t\t\t\t\t'],
['Afghanistan',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Armenia',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Azerbaijan',
'Present',
'',
'',
'',
'',
'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
''],
['Bhutan', 'Present', '', '', '', '', 'CABI and EPPO (2010)', '']]
我认为我需要的是每个物种的格式都更像这样.. [[info_species1],[info_species1],[info_species1]], [[info_species2],[info_species2],[ info_species2]] 或者在我的输出中我需要:
['-Sao Paulo',
'Present',
'',
'Native',
'',
'',
'Waller et al. (1992); Shomari (1996)',
'']], # AN EXTRA SQUARE BRACKET RIGHT HERE
['\n\t\t\t\t\t\tAfrica\n\t\t\t\t\t'],
['Egypt',
'Present',
【问题讨论】:
额外的方括号是列表的结束。您只想将列表添加到另一个列表吗?您能否向我们展示原始 HTML 数据和所需的输出? 是的,我想关闭每个文件,使其成为列表中的一部分。 【参考方案1】:您是否考虑过只用 pandas 读取表格标签?
import pandas as pd
import os
directory = r'../html/species'
for filename in os.listdir(directory):
if filename.endswith('.html'):
csv_filename = filename.replace('.html','.csv')
fname = os.path.join(directory,filename)
with open(fname, 'r') as f:
table = pd.read_html(f.read())[0]
table.to_csv(csv_filename, index=False)
print(data)
【讨论】:
这给了我与我的脚本相同的输出:一张表中的所有 32 个文件。您知道如何修改您的输出以提供 32 个单独的输出(每个 html 文件一个)或每个列表的列表吗?以上是关于遍历多个 html 文件并转换为 csv的主要内容,如果未能解决你的问题,请参考以下文章
使用 OpenOffice Calc 打开 Excel 文件并转换为 CSV 或制表符分隔
Python编程快速上手——Excel到CSV的转换程序案例分析
如何在 Python 中将多个 .txt 文件转换为 .csv 文件