当标头是动态的时,避免批量数据导出到csv

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了当标头是动态的时,避免批量数据导出到csv相关的知识,希望对你有一定的参考价值。

我偶然发现了一个非常简单的情况,我似乎无法找到解决方案。

我想做的很简单:将一些数据写入包含以下内容的.csv文件:

  • 动态标题
  • 一些数据

我现在这样做的方式似乎是我能想到的唯一解决方案:

  • 将我需要的数据存储在词典列表中
  • 获取上面列表中每个字典的keys()并将它们添加到set()(这将是标题)
  • 使用writer.writerows(data)将数据写入文件

基本上,简单的MCVE可能如下所示:

from csv import DictWriter

RESULT_FILE = 'test_result.csv'


def get_fieldnames(data):
    fieldnames = set()
    for item in data:
        fieldnames.update(item.keys())
    return fieldnames


def main(data):
    fieldnames = get_fieldnames(data)

    with open(RESULT_FILE, 'a', newline='', encoding='utf-8') as f:
        writer = DictWriter(f, fieldnames=fieldnames, delimiter=',')
        writer.writeheader()
        writer.writerows(data)


if __name__ == '__main__': 
    data_ = [
        {
            'a': '1',
            'b': '2',
            'c': '3',
        },
        {
            'a': '6',
            'd': '1',
            'b': '3',
        },
        {
            'c': '2',
            'e': '1',
            'f': '9',
        }
    ]
    main(data_)

现在,我不喜欢这个:

  • 该列表可能会变得非常大(~100k dicts /每个dict包含大约10个字段)。
  • 如果程序在将66666 dict添加到列表时崩溃,则一切都会丢失,并且我在csv中也没有任何数据。因为我必须等待将所有数据添加到列表中以获取所有可能的标头,所以我无法避免这种情况。

当标题是动态时,如何避免在csv中一次性导出所有数据?


根据要求,真实数据如下所示:

{'Catalog link': '',
 'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
             'Accessories / Sander Accessories',
 'Description': 'Exclusive single-piece hub design reduces pad vibration and '
                'ensures smooth performance.',
 'Each': '$ 24.70',
 'Info': '',
 'Line art': '',
 'Name': '(5") Non-Vacuum Disc Pad Vinyl-Face',
 'Product number': '91456106T',
 'Technical specifications': '',
 'image_1': 'https://www.richelieu.com/documents/docsGr/120/107/6/1201076/1419675_700.jpg'}

{'Catalog link': '',
 'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
             'Accessories / Sander Accessories',
 'Description': '',
 'Each': '$ 8.19',
 'Info': '<p><strong>material: </strong>Cork</p>',
 'Line art': '',
 'Name': 'Replacement Plate for MKT9924DB Belt Sander',
 'Product number': 'MKT4230358',
 'Technical specifications': '<p><strong>brand: </strong>Makita</p>',
 'image_1': 'https://www.richelieu.com/documents/docsGr/116/631/4/1166314/1281513_700.jpg',
 'xa0': '$ 257.80'}

{'Catalog link': '',
 'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
             'Accessories / Sander Accessories',
 'Description': '',
 'Each': '$ 8.19',
 'Info': '<p><strong>material: </strong>Graphite</p>',
 'Line art': '',
 'Name': 'Replacement Plate for MKT9924DB Belt Sander',
 'Product number': 'MKT4230366',
 'Technical specifications': '<p><strong>brand: </strong>Makita</p>',
 'image_1': 'https://www.richelieu.com/documents/docsPr/MK/T4/23/03/66/MKT4230366/1281514_700.jpg',
 'xa0': '$ 257.80'}

{'Catalog link': '',
 'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
             'Accessories / Sander Accessories',
 'Description': '- Exclusive single-piece hub design reduces pad vibration and '
                'ensures smooth performance.',
 'Each': '$ 38.47',
 'Info': '',
 'Line art': '',
 'Name': 'Non-Grip Vacuum Pads',
 'Product number': '9154325',
 'Technical specifications': '<p><strong>thickness: </strong>3/8 '
                             'in</p><p><strong>density: '
                             '</strong>Medium</p><p><strong>nap: '
                             '</strong>Short</p>',
 'image_1': 'https://www.richelieu.com/documents/docsPr/91/54/32/5/9154325/1213330_700.jpg',
 'image_2': 'https://www.richelieu.com/documents/docsPr/91/54/32/5/9154325/1213331_700.jpg'}

{'Catalog link': '',
 'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
             'Accessories / Sander Accessories',
 'Description': '- Exclusive single-piece hub design reduces pad vibration and '
                'ensures smooth performance.',
 'Each': '$ 52.92',
 'Info': '',
 'Line art': '',
 'Name': 'Non-Grip Vacuum Pads',
 'Product number': '9154327',
 'Technical specifications': '<p><strong>thickness: </strong>3/8 '
                             'in</p><p><strong>density: '
                             '</strong>Medium</p><p><strong>nap: '
                             '</strong>Short</p>',
 'image_1': 'https://www.richelieu.com/documents/docsGr/105/122/1/1051221/1213328_700.jpg',
 'image_2': 'https://www.richelieu.com/documents/docsPr/91/54/32/7/9154327/1213332_700.jpg'}

{'Catalog link': '',
 'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
             'Accessories / Sander Accessories',
 'Description': '- Unique one-piece hub design reduces pad vibration and '
                'ensures smooth performance.',
 'Each': '$ 26.84',
 'Info': '',
 'Line art': '',
 'Name': 'Stick-on Non-Vacuum Pads',
 'Product number': '9156106',
 'Technical specifications': '<p><strong>thickness: </strong>3/8 '
                             'in</p><p><strong>density: </strong>Medium</p>',
 'image_1': 'https://www.richelieu.com/documents/docsGr/105/122/4/1051224/1213343_700.jpg',
 'image_2': 'https://www.richelieu.com/documents/docsPr/91/56/10/6/9156106/1213345_700.jpg'}

{'Catalog link': '',
 'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
             'Accessories / Sander Accessories',
 'Description': '- Unique one-piece hub design reduces pad vibration and '
                'ensures smooth performance.',
 'Each': '$ 51.70',
 'Info': '',
 'Line art': '',
 'Name': 'Stick-on Non-Vacuum Pads',
 'Product number': '9156107',
 'Technical specifications': '<p><strong>thickness: </strong>3/8 '
                             'in</p><p><strong>density: </strong>Medium</p>',
 'image_1': 'https://www.richelieu.com/documents/docsPr/91/56/10/7/9156107/1213344_700.jpg',
 'image_2': 'https://www.richelieu.com/documents/docsPr/91/56/10/7/9156107/1213346_700.jpg'}

{'Catalog link': '',
 'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
             'Accessories / Sander Accessories',
 'Description': 'Size: 2-1/2" x 14".',
 'Each': '$ 12.36',
 'Info': '',
 'Line art': '',
 'Name': 'Sandpaper Belt 2½ " x 14" for Compact Belt Sander PC371 or PC371K',
 'Product number': 'PC371K060',
 'Technical specifications': '',
 'image_1': 'https://www.richelieu.com/documents/docsPr/PC/37/1K/06/0/PC371K060/1263523_700.jpg',
 'xa0': '$ 148.18'}

{'Catalog link': '',
 'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
             'Accessories / Sander Accessories',
 'Description': 'Size: 2-1/2" x 14".',
 'Each': '$ 12.36',
 'Info': '',
 'Line art': '',
 'Name': 'Sandpaper Belt 2½ " x 14" for Compact Belt Sander PC371 or PC371K',
 'Product number': 'PC371K080',
 'Technical specifications': '',
 'image_1': 'https://www.richelieu.com/documents/docsPr/PC/37/1K/08/0/PC371K080/1263524_700.jpg',
 'xa0': '$ 148.18'}

{'Catalog link': '',
 'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
             'Accessories / Sander Accessories',
 'Description': 'Size: 2-1/2" x 14".',
 'Each': '$ 12.36',
 'Info': '',
 'Line art': '',
 'Name': 'Sandpaper Belt 2½ " x 14" for Compact Belt Sander PC371 or PC371K',
 'Product number': 'PC371K120',
 'Technical specifications': '',
 'image_1': 'https://www.richelieu.com/documents/docsPr/PC/37/1K/12/0/PC371K120/1263526_700.jpg',
 'xa0': '$ 148.18'}

{'Catalog link': '',
 'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
             'Accessories / Sander Accessories',
 'Description': 'Size: 2-1/2" x 14".',
 'Each': '$ 12.36',
 'Info': '',
 'Line art': '',
 'Name': 'Sandpaper Belt 2½ " x 14" for Compact Belt Sander PC371 or PC371K',
 'Product number': 'PC371K100',
 'Technical specifications': '',
 'image_1': 'https://www.richelieu.com/documents/docsPr/PC/37/1K/10/0/PC371K100/1263525_700.jpg',
 'xa0': '$ 148.18'}

{'Catalog link': '',
 'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
             'Accessories / Sander Accessories',
 'Description': 'Exclusive single-piece hub design reduces pad vibration and '
                'ensures smooth performance.',
 'Each': '$ 25.22',
 'Info': '',
 'Line art': '',
 'Name': '5" Non-Vacuum Disc Pad Hook-Face',
 'Product number': '91454325T',
 'Technical specifications': '',
 'image_1': 'https://www.richelieu.com/documents/docsGr/120/107/7/1201077/1419678_700.jpg'}

{'Catalog link': '',
 'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
             'Accessories / Sander Accessories',
 'Description': '- Pads mount with screws.',
 'Each': '$ 31.80',
 'Info': '',
 'Line art': '',
 'Name': 'Plates for Non-Vacuum (Grip-On) Dynabug II Disc Pads - 7.62 cm x '
         '10.79 cm (3" x 4-1/4")',
 'Product number': '9156315',
 'Technical specifications': '<p><strong>thickness: </strong>3/8 '
                             'in</p><p><strong>density: </strong>Medium</p>',
 'image_1': 'https://www.richelieu.com/documents/docsGr/116/625/4/1166254/1280825_700.jpg',
 'xa0': '$ 179.95'}
答案

临时保存数据

由于您的数据来自抓取,因此可能会将其视为流。为了模仿流,我使用data_.pop()to一次获取一个项目。以下解决方案添加了来自流的每个项目。 csv的标题和正文存储在不同的文件中。标题随着时间的推移可能会长度增加。在这样的增长步骤之前保存的行自然不能知道这种增长,因此可能缺少一些尾随逗号来表示缺少的项目。

import csv
import os

class StreamCSV:  # Python 3
    def __init__(self, header_file_name, body_file_name):
        self.header_file_name = header_file_name
        self.fbody = open(body_file_name, 'a', newline='', encoding='utf-8')
        self.csv_body = csv.writer(self.fbody)

    def add_item(self, item):
        if os.path.exists(self.header_file_name):
            with open(self.header_file_name, 'r', newline='', encoding='utf-8') as fobj:
                reader = csv.reader(fobj)
                try:
                    current_header = next(reader)
                except StopIteration:
                    current_header = []
        else:
            current_header = []
        header_set = set(current_header)
        for key in item:
            if key not in header_set:
                current_header.append(key)
        if len(header_set) < len(current_header):
            with open(self.header_file_name, 'w', newline='', encoding='utf-8') as fobj:
                writer = csv.writer(fobj)
                writer.writerow(current_header)
        item_data = [item.get(head, '') for head in current_header]
        self.csv_body.writerow(item_data)
        self.fbody.flush()  # allows peeing into the file


if __name__ == '__main__':

    data_ = [
        {
            'a': '1',
            'b': '2',
            'c': '3',
        },
        {
            'a': '6',
            'd': '1',
            'b': '3',
        },
        {
            'c': '2',
            'e': '1',
            'f': '9',
        }
    ]

    def show_saved(file_names):
        for name in file_names:
            with open(name) as fobj:
                print(name)
                print(fobj.read())

    header_file_name, body_file_name = 'header.csv', 'body.csv'
    stream_writer = StreamCSV(header_file_name, body_file_name)

    for x in range(1, 4):
        print('step:', x)
        stream_writer.add_item(data_.pop())
        show_saved([header_file_name, body_file_name])

显示随时间增长的输出:

step: 1
header.csv
c,e,f

body.csv
2,1,9

step: 2
header.csv
c,e,f,a,d,b

body.csv
2,1,9
,,,6,1,3

step: 3
header.csv
c,e,f,a,d,b

body.csv
2,1,9
,,,6,1,3
3,,,1,,2

合并最终结果

您可能希望在附加步骤中合并标题和正文,添加此类缺少的尾随逗号。

def merge_header_body(header_file_name, body_file_name, out_file_name):
    with open(header_file_name, 'r', newline='', encoding='utf-8') as fobj:
        reader = csv.reader(fobj)
        header = next(reader)

    with open(out_file_name, 'w', newline='', encoding='utf-8') as fobj_out, 
    open(body_file_name, 'r', newline='', encoding='utf-8') as fobj_in:
        reader = csv.reader(fobj_in)
        writer = csv.writer(fobj_out)
        writer.writerow(header)
        target_length = len(header)
        for row in reader:
            diff = target_length - len(row)
            row.extend([''] * diff)
            writer.writerow(row)

out_file_name = 'merged.csv'
merge_header_body(header_file_name, body_file_name, out_file_name)

merged.csv的内容:

c,e,f,a,d,b
2,1,9,,,
,,,6,1,3
3,,,1,,2

崩溃恢复

如果程序在两者之间崩溃,它将恢复。让我们采用与以前相同的数据并添加更多行:

for x in range(1, 4):
    print('step:', x)
    stream_writer.add_item(data_.pop())
    show_saved([header_file_name, body_file_name])

输出:

step: 1
header.csv
c,e,f,a,d,b

body.csv
2,1,9
,,,6,1,3
3,,,1,,2
2,1,9,,,

step: 2
header.csv
c,e,f,a,d,b

body.csv
2,1,9
,,,6,1,3
3,,,1,,2
2,1,9,,,
,,,6,1,3

step: 3
header.csv
c,e,f,a,d,b

body.csv
2,1,9
,,,6,1,3
3,,,1,,2
2,1,9,,,
,,,6,1,3
3,,,1,,2
另一答案

Edit-1 26-Dec:更新了代码,根据您的数据生成数据

根据您的要求,我建议如下

  • 在headers.csv文件中写入标头
  • 在data.csv文件中写入数据
  • 如果要读取/发送此文件,只需将两个文件合并为一个文件即可
  • 在程序开始时,读取现有的headers.csv文件并创建字段到索引映射
  • 当您在数据中遇到新密钥时,使用新索引更新标头映射并更新header.csv
  • 在编写字典数据时,您将使用标题映射来创建行数据

下面是一个快速/脏的POC,它对我来说很好

import csv

try:
    f = open("headers.csv", mode="r+", encoding="utf-8")
except FileNotFoundError:
    f = open("headers.csv", mode="w+", encoding="utf-8")

f2 = open("data.csv", mode="a+", encoding="utf-8")
f.seek(0)
headers = f.readline().strip().split(",")
if headers == ['']:
    headers = []

headers_map = {}

for index, field in enumerate(headers):
    headers_map[field] = index


def update_header_dict(data):
    updated_headers = False
    for key in data.keys():
        if key not in headers_map:
            new_index = len(headers_map)
            headers_map[key] = new_index
            updated_headers = True

    if updated_headers:
        f.seek(0)
        csv.DictWriter(f, headers_map.keys()).writeheader()
        f.flush()


def get_row_data_dict(data):
    row_data = [""] * len(headers_map)

    for k, v in data.items():
        # if v and v[0] in ('=', '-'):
        #     # Mark the value as text, only needed if you want to display data in excel
        #     # else should be commented out
        #     v = "'" + v
        row_data[headers_map[k]] = v

    return row_data


def main(data):
    data_writer = csv.writer(f2)
    for row in data:
        update_header_dict(row)
        data_writer.writerow(get_row_data_dict(row))


data_ = [
    {'Catalog link': '',
     'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
                 'Accessories / Sander Accessories',
     'Description': 'Exclusive single-piece hub design reduces pad vibration and '
                    'ensures smooth performance.',
     'Each': '$ 24.70',
     'Info': '',
     'Line art': '',
     'Name': '(5") Non-Vacuum Disc Pad Vinyl-Face',
     'Product number': '91456106T',
     'Technical specifications': '',
     'image_1': 'https://www.richelieu.com/documents/docsGr/120/107/6/1201076/1419675_700.jpg'},

    {'Catalog link': '',
     'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
                 'Accessories / Sander Accessories',
     'Description': '',
     'Each': '$ 8.19',
     'Info': '<p><strong>material: </strong>Cork</p>',
     'Line art': '',
     'Name': 'Replacement Plate for MKT9924DB Belt Sander',
     'Product number': 'MKT4230358',
     'Technical specifications': '<p><strong>brand: </strong>Makita</p>',
     'image_1': 'https://www.richelieu.com/documents/docsGr/116/631/4/1166314/1281513_700.jpg',
     'xa0': '$ 257.80'},

    {'Catalog link': '',
     'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
                 'Accessories / Sander Accessories',
     'Description': '',
     'Each': '$ 8.19',
     'Info': '<p><strong>material: </strong>Graphite</p>',
     'Line art': '',
     'Name': 'Replacement Plate for MKT9924DB Belt Sander',
     'Product number': 'MKT4230366',
     'Technical specifications': '<p><strong>brand: </strong>Makita</p>',
     'image_1': 'https://www.richelieu.com/documents/docsPr/MK/T4/23/03/66/MKT4230366/1281514_700.jpg',
     'xa0': '$ 257.80'},

    {'Catalog link': '',
     'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
                 'Accessories / Sander Accessories',
     'Description': '- Exclusive single-piece hub design reduces pad vibration and '
                    'ensures smooth performance.',
     'Each': '$ 38.47',
     'Info': '',
     'Line art': '',
     'Name': 'Non-Grip Vacuum Pads',
     'Product number': '9154325',
     'Technical specifications': '<p><strong>thickness: </strong>3/8 '
                                 'in</p><p><strong>density: '
                                 '</strong>Medium</p><p>&l

以上是关于当标头是动态的时,避免批量数据导出到csv的主要内容,如果未能解决你的问题,请参考以下文章

将 BigQuery 表导出到 Google Storage 时如何避免标头

使用python从netcdf导出到csv时创建标头

php Excel 导出大批量数据解决方案

java导出csv如何避免内存溢出

使用python批量导出xml文件到csv

PHP导出100万数据到excel