Python Pandas - 读取包含多个表的 csv 文件

Posted 2023-02-23

技术标签:

【中文标题】Python Pandas - 读取包含多个表的 csv 文件【英文标题】：Python Pandas - Read csv file containing multiple tables 【发布时间】：2016-03-15 01:58:17 【问题描述】：

我有一个包含多个表的 .csv 文件。

使用 Pandas，从这个文件中获取两个 DataFrame inventory 和 HPBladeSystemRack 的最佳策略是什么？

输入 .csv 如下所示：

Inventory       
System Name            IP Address    System Status
dg-enc05                             Normal
dg-enc05_vc_domain                   Unknown
dg-enc05-oa1           172.20.0.213  Normal

HP BladeSystem Rack         
System Name               Rack Name   Enclosure Name
dg-enc05                  BU40  
dg-enc05-oa1              BU40        dg-enc05
dg-enc05-oa2              BU40        dg-enc05

到目前为止，我想出的最好方法是将这个 .csv 文件转换为 Excel 工作簿 (xlxs)，将表格拆分为工作表并使用：

inventory = read_excel('path_to_file.csv', 'sheet1', skiprow=1)
HPBladeSystemRack = read_excel('path_to_file.csv', 'sheet2', skiprow=2)

但是：

这种方法需要xlrd 模块。必须实时分析这些日志文件，因此最好找到一种方法来分析它们，因为它们来自日志。真正的日志比这两个表多得多。

【问题讨论】：

表格的行数是否固定？是的，但是每个表的这个数字是不同的。而且我想避免按行号选择的方法，因为下一个日志文件可能有更多行... 两个表之间是否有空行？一些坚实的东西，总是:) 您可以在 pd.read_csv() 中使用 'nrows' 和 'skiprows' 的组合来抓取特定的表。您必须知道每个表从哪一行开始以及每个表中有多少行。 @WoodChopper 是的，每张桌子之间有一个空行。 【参考方案1】：

如果您事先知道表名，则如下所示：

df = pd.read_csv("jahmyst2.csv", header=None, names=range(3))
table_names = ["Inventory", "HP BladeSystem Rack", "Network Interface"]
groups = df[0].isin(table_names).cumsum()
tables = g.iloc[0,0]: g.iloc[1:] for k,g in df.groupby(groups)

应该生成一个字典，其中键作为表名，值作为子表。

>>> list(tables)
['HP BladeSystem Rack', 'Inventory']
>>> for k,v in tables.items():
...     print("table:", k)
...     print(v)
...     print()
...     
table: HP BladeSystem Rack
              0          1               2
6   System Name  Rack Name  Enclosure Name
7      dg-enc05       BU40             NaN
8  dg-enc05-oa1       BU40        dg-enc05
9  dg-enc05-oa2       BU40        dg-enc05

table: Inventory
                    0             1              2
1         System Name    IP Address  System Status
2            dg-enc05           NaN         Normal
3  dg-enc05_vc_domain           NaN        Unknown
4        dg-enc05-oa1  172.20.0.213         Normal

完成后，您可以将列名设置为第一行等。

【讨论】：

请问names=range(3) 是做什么的？是列数吗？如果是这样，那将是每个表的不同数字。对于具有比这更多的表（每个具有不同列数）的完整数据集，该代码会引发异常。 @JahMyst：当您有可变数量的列时，pandas 可能会对有多少列感到困惑。你可以通过告诉它有多少来帮助它； names=range(some_number) 告诉它。事后您总是可以使用dropna 删除多余的全NaN 列，所以我倾向于使用names=range(some_big_number) 开始。 @DSM，它就像一个魅力。不过，我很难理解其中的逻辑。我喜欢这个解决方案。我发现如果我稍微修改一下代码，它会自动为我所有的表生成一个唯一的ID（因为我事先不知道他们的名字）tables = g.iloc[0,0]: g.iloc[1:] for k,g in df.groupby(groups)【参考方案2】：

我假设您知道要从 csv 文件中解析出的表的名称。如果是这样，您可以检索每个的index 位置，并相应地选择相关切片。作为草图，这可能看起来像：

df = pd.read_csv('path_to_file')    
index_positions = []
for table in table_names:
    index_positions.append(df[df['col_with_table_names']==table].index.tolist()[0])

## Include end of table for last slice, omit for iteration below
index_positions.append(df.index.tolist()[-1])

tables = 
for position in index_positions[:-1]:
    table_no = index_position.index(position)
    tables[table_names[table_no] = df.loc[position:index_positions[table_no+10]]

当然有更优雅的解决方案，但这应该会给你一个dictionary，表名为keys，对应的表为values。

【讨论】：

我假设您会将 DataFrame 加载为 df = read_csv('path_to_file.csv') ？上面的代码有一些问题：首先DataFrame没有get_loc()，所以需要修改为df.index.get_loc()。其次，我在尝试将“Inventory”或“HP BladeSystem Rack”传递给 get_loc 时收到“KeyError”。已更改，以便您过滤相关列 - 具有表名 - 并获取相应的 index 值，假设它是第一个重要的值。然后使用.loc进行切片。【参考方案3】：

Pandas 似乎还没有准备好轻松地做到这一点，所以我最终做了自己的 split_csv 函数。它只需要表名，并会输出以每个表命名的.csv文件。

import csv
from os.path import dirname # gets parent folder in a path
from os.path import join # concatenate paths

table_names = ["Inventory", "HP BladeSystem Rack", "Network Interface"]

def split_csv(csv_path, table_names):
    tables_infos = detect_tables_from_csv(csv_path, table_names)
    for table_info in tables_infos:
        split_csv_by_indexes(csv_path, table_info)

def split_csv_by_indexes(csv_path, table_info):
    title, start_index, end_index = table_info
    print title, start_index, end_index
    dir_ = dirname(csv_path)
    output_path = join(dir_, title) + ".csv"
    with open(output_path, 'w') as output_file, open(csv_path, 'rb') as input_file:
        writer = csv.writer(output_file)
        reader = csv.reader(input_file)
        for i, line in enumerate(reader):
            if i < start_index:
                continue
            if i > end_index:
                break
            writer.writerow(line)

def detect_tables_from_csv(csv_path, table_names):
    output = []
    with open(csv_path, 'rb') as csv_file:
        reader = csv.reader(csv_file)
        for idx, row in enumerate(reader):
            for col in row:
                match = [title for title in table_names if title in col]
                if match:
                    match = match[0] # get the first matching element
                    try:
                        end_index = idx - 1
                        start_index
                    except NameError:
                        start_index = 0
                    else:
                        output.append((previous_match, start_index, end_index))
                    print "Found new table", col
                    start_index = idx
                    previous_match = match
                    match = False

        end_index = idx  # last 'end_index' set to EOF
        output.append((previous_match, start_index, end_index))
        return output


if __name__ == '__main__':
    csv_path = 'switch_records.csv'
    try:
        split_csv(csv_path, table_names)
    except IOError as e:
        print "This file doesn't exist. Aborting."
        print e
        exit(1)

【讨论】：

以上是关于Python Pandas - 读取包含多个表的 csv 文件的主要内容，如果未能解决你的问题，请参考以下文章