当标头是动态的时,避免批量数据导出到csv
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了当标头是动态的时,避免批量数据导出到csv相关的知识,希望对你有一定的参考价值。
我偶然发现了一个非常简单的情况,我似乎无法找到解决方案。
我想做的很简单:将一些数据写入包含以下内容的.csv文件:
- 动态标题
- 一些数据
我现在这样做的方式似乎是我能想到的唯一解决方案:
- 将我需要的数据存储在词典列表中
- 获取上面列表中每个字典的
keys()
并将它们添加到set()
(这将是标题) - 使用
writer.writerows(data)
将数据写入文件
基本上,简单的MCVE可能如下所示:
from csv import DictWriter
RESULT_FILE = 'test_result.csv'
def get_fieldnames(data):
fieldnames = set()
for item in data:
fieldnames.update(item.keys())
return fieldnames
def main(data):
fieldnames = get_fieldnames(data)
with open(RESULT_FILE, 'a', newline='', encoding='utf-8') as f:
writer = DictWriter(f, fieldnames=fieldnames, delimiter=',')
writer.writeheader()
writer.writerows(data)
if __name__ == '__main__':
data_ = [
{
'a': '1',
'b': '2',
'c': '3',
},
{
'a': '6',
'd': '1',
'b': '3',
},
{
'c': '2',
'e': '1',
'f': '9',
}
]
main(data_)
现在,我不喜欢这个:
- 该列表可能会变得非常大(~100k dicts /每个dict包含大约10个字段)。
- 如果程序在将66666 dict添加到列表时崩溃,则一切都会丢失,并且我在csv中也没有任何数据。因为我必须等待将所有数据添加到列表中以获取所有可能的标头,所以我无法避免这种情况。
当标题是动态时,如何避免在csv中一次性导出所有数据?
根据要求,真实数据如下所示:
{'Catalog link': '',
'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
'Accessories / Sander Accessories',
'Description': 'Exclusive single-piece hub design reduces pad vibration and '
'ensures smooth performance.',
'Each': '$ 24.70',
'Info': '',
'Line art': '',
'Name': '(5") Non-Vacuum Disc Pad Vinyl-Face',
'Product number': '91456106T',
'Technical specifications': '',
'image_1': 'https://www.richelieu.com/documents/docsGr/120/107/6/1201076/1419675_700.jpg'}
{'Catalog link': '',
'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
'Accessories / Sander Accessories',
'Description': '',
'Each': '$ 8.19',
'Info': '<p><strong>material: </strong>Cork</p>',
'Line art': '',
'Name': 'Replacement Plate for MKT9924DB Belt Sander',
'Product number': 'MKT4230358',
'Technical specifications': '<p><strong>brand: </strong>Makita</p>',
'image_1': 'https://www.richelieu.com/documents/docsGr/116/631/4/1166314/1281513_700.jpg',
'xa0': '$ 257.80'}
{'Catalog link': '',
'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
'Accessories / Sander Accessories',
'Description': '',
'Each': '$ 8.19',
'Info': '<p><strong>material: </strong>Graphite</p>',
'Line art': '',
'Name': 'Replacement Plate for MKT9924DB Belt Sander',
'Product number': 'MKT4230366',
'Technical specifications': '<p><strong>brand: </strong>Makita</p>',
'image_1': 'https://www.richelieu.com/documents/docsPr/MK/T4/23/03/66/MKT4230366/1281514_700.jpg',
'xa0': '$ 257.80'}
{'Catalog link': '',
'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
'Accessories / Sander Accessories',
'Description': '- Exclusive single-piece hub design reduces pad vibration and '
'ensures smooth performance.',
'Each': '$ 38.47',
'Info': '',
'Line art': '',
'Name': 'Non-Grip Vacuum Pads',
'Product number': '9154325',
'Technical specifications': '<p><strong>thickness: </strong>3/8 '
'in</p><p><strong>density: '
'</strong>Medium</p><p><strong>nap: '
'</strong>Short</p>',
'image_1': 'https://www.richelieu.com/documents/docsPr/91/54/32/5/9154325/1213330_700.jpg',
'image_2': 'https://www.richelieu.com/documents/docsPr/91/54/32/5/9154325/1213331_700.jpg'}
{'Catalog link': '',
'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
'Accessories / Sander Accessories',
'Description': '- Exclusive single-piece hub design reduces pad vibration and '
'ensures smooth performance.',
'Each': '$ 52.92',
'Info': '',
'Line art': '',
'Name': 'Non-Grip Vacuum Pads',
'Product number': '9154327',
'Technical specifications': '<p><strong>thickness: </strong>3/8 '
'in</p><p><strong>density: '
'</strong>Medium</p><p><strong>nap: '
'</strong>Short</p>',
'image_1': 'https://www.richelieu.com/documents/docsGr/105/122/1/1051221/1213328_700.jpg',
'image_2': 'https://www.richelieu.com/documents/docsPr/91/54/32/7/9154327/1213332_700.jpg'}
{'Catalog link': '',
'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
'Accessories / Sander Accessories',
'Description': '- Unique one-piece hub design reduces pad vibration and '
'ensures smooth performance.',
'Each': '$ 26.84',
'Info': '',
'Line art': '',
'Name': 'Stick-on Non-Vacuum Pads',
'Product number': '9156106',
'Technical specifications': '<p><strong>thickness: </strong>3/8 '
'in</p><p><strong>density: </strong>Medium</p>',
'image_1': 'https://www.richelieu.com/documents/docsGr/105/122/4/1051224/1213343_700.jpg',
'image_2': 'https://www.richelieu.com/documents/docsPr/91/56/10/6/9156106/1213345_700.jpg'}
{'Catalog link': '',
'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
'Accessories / Sander Accessories',
'Description': '- Unique one-piece hub design reduces pad vibration and '
'ensures smooth performance.',
'Each': '$ 51.70',
'Info': '',
'Line art': '',
'Name': 'Stick-on Non-Vacuum Pads',
'Product number': '9156107',
'Technical specifications': '<p><strong>thickness: </strong>3/8 '
'in</p><p><strong>density: </strong>Medium</p>',
'image_1': 'https://www.richelieu.com/documents/docsPr/91/56/10/7/9156107/1213344_700.jpg',
'image_2': 'https://www.richelieu.com/documents/docsPr/91/56/10/7/9156107/1213346_700.jpg'}
{'Catalog link': '',
'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
'Accessories / Sander Accessories',
'Description': 'Size: 2-1/2" x 14".',
'Each': '$ 12.36',
'Info': '',
'Line art': '',
'Name': 'Sandpaper Belt 2½ " x 14" for Compact Belt Sander PC371 or PC371K',
'Product number': 'PC371K060',
'Technical specifications': '',
'image_1': 'https://www.richelieu.com/documents/docsPr/PC/37/1K/06/0/PC371K060/1263523_700.jpg',
'xa0': '$ 148.18'}
{'Catalog link': '',
'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
'Accessories / Sander Accessories',
'Description': 'Size: 2-1/2" x 14".',
'Each': '$ 12.36',
'Info': '',
'Line art': '',
'Name': 'Sandpaper Belt 2½ " x 14" for Compact Belt Sander PC371 or PC371K',
'Product number': 'PC371K080',
'Technical specifications': '',
'image_1': 'https://www.richelieu.com/documents/docsPr/PC/37/1K/08/0/PC371K080/1263524_700.jpg',
'xa0': '$ 148.18'}
{'Catalog link': '',
'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
'Accessories / Sander Accessories',
'Description': 'Size: 2-1/2" x 14".',
'Each': '$ 12.36',
'Info': '',
'Line art': '',
'Name': 'Sandpaper Belt 2½ " x 14" for Compact Belt Sander PC371 or PC371K',
'Product number': 'PC371K120',
'Technical specifications': '',
'image_1': 'https://www.richelieu.com/documents/docsPr/PC/37/1K/12/0/PC371K120/1263526_700.jpg',
'xa0': '$ 148.18'}
{'Catalog link': '',
'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
'Accessories / Sander Accessories',
'Description': 'Size: 2-1/2" x 14".',
'Each': '$ 12.36',
'Info': '',
'Line art': '',
'Name': 'Sandpaper Belt 2½ " x 14" for Compact Belt Sander PC371 or PC371K',
'Product number': 'PC371K100',
'Technical specifications': '',
'image_1': 'https://www.richelieu.com/documents/docsPr/PC/37/1K/10/0/PC371K100/1263525_700.jpg',
'xa0': '$ 148.18'}
{'Catalog link': '',
'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
'Accessories / Sander Accessories',
'Description': 'Exclusive single-piece hub design reduces pad vibration and '
'ensures smooth performance.',
'Each': '$ 25.22',
'Info': '',
'Line art': '',
'Name': '5" Non-Vacuum Disc Pad Hook-Face',
'Product number': '91454325T',
'Technical specifications': '',
'image_1': 'https://www.richelieu.com/documents/docsGr/120/107/7/1201077/1419678_700.jpg'}
{'Catalog link': '',
'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
'Accessories / Sander Accessories',
'Description': '- Pads mount with screws.',
'Each': '$ 31.80',
'Info': '',
'Line art': '',
'Name': 'Plates for Non-Vacuum (Grip-On) Dynabug II Disc Pads - 7.62 cm x '
'10.79 cm (3" x 4-1/4")',
'Product number': '9156315',
'Technical specifications': '<p><strong>thickness: </strong>3/8 '
'in</p><p><strong>density: </strong>Medium</p>',
'image_1': 'https://www.richelieu.com/documents/docsGr/116/625/4/1166254/1280825_700.jpg',
'xa0': '$ 179.95'}
答案
临时保存数据
由于您的数据来自抓取,因此可能会将其视为流。为了模仿流,我使用data_.pop()
to一次获取一个项目。以下解决方案添加了来自流的每个项目。 csv的标题和正文存储在不同的文件中。标题随着时间的推移可能会长度增加。在这样的增长步骤之前保存的行自然不能知道这种增长,因此可能缺少一些尾随逗号来表示缺少的项目。
import csv
import os
class StreamCSV: # Python 3
def __init__(self, header_file_name, body_file_name):
self.header_file_name = header_file_name
self.fbody = open(body_file_name, 'a', newline='', encoding='utf-8')
self.csv_body = csv.writer(self.fbody)
def add_item(self, item):
if os.path.exists(self.header_file_name):
with open(self.header_file_name, 'r', newline='', encoding='utf-8') as fobj:
reader = csv.reader(fobj)
try:
current_header = next(reader)
except StopIteration:
current_header = []
else:
current_header = []
header_set = set(current_header)
for key in item:
if key not in header_set:
current_header.append(key)
if len(header_set) < len(current_header):
with open(self.header_file_name, 'w', newline='', encoding='utf-8') as fobj:
writer = csv.writer(fobj)
writer.writerow(current_header)
item_data = [item.get(head, '') for head in current_header]
self.csv_body.writerow(item_data)
self.fbody.flush() # allows peeing into the file
if __name__ == '__main__':
data_ = [
{
'a': '1',
'b': '2',
'c': '3',
},
{
'a': '6',
'd': '1',
'b': '3',
},
{
'c': '2',
'e': '1',
'f': '9',
}
]
def show_saved(file_names):
for name in file_names:
with open(name) as fobj:
print(name)
print(fobj.read())
header_file_name, body_file_name = 'header.csv', 'body.csv'
stream_writer = StreamCSV(header_file_name, body_file_name)
for x in range(1, 4):
print('step:', x)
stream_writer.add_item(data_.pop())
show_saved([header_file_name, body_file_name])
显示随时间增长的输出:
step: 1
header.csv
c,e,f
body.csv
2,1,9
step: 2
header.csv
c,e,f,a,d,b
body.csv
2,1,9
,,,6,1,3
step: 3
header.csv
c,e,f,a,d,b
body.csv
2,1,9
,,,6,1,3
3,,,1,,2
合并最终结果
您可能希望在附加步骤中合并标题和正文,添加此类缺少的尾随逗号。
def merge_header_body(header_file_name, body_file_name, out_file_name):
with open(header_file_name, 'r', newline='', encoding='utf-8') as fobj:
reader = csv.reader(fobj)
header = next(reader)
with open(out_file_name, 'w', newline='', encoding='utf-8') as fobj_out,
open(body_file_name, 'r', newline='', encoding='utf-8') as fobj_in:
reader = csv.reader(fobj_in)
writer = csv.writer(fobj_out)
writer.writerow(header)
target_length = len(header)
for row in reader:
diff = target_length - len(row)
row.extend([''] * diff)
writer.writerow(row)
out_file_name = 'merged.csv'
merge_header_body(header_file_name, body_file_name, out_file_name)
merged.csv
的内容:
c,e,f,a,d,b
2,1,9,,,
,,,6,1,3
3,,,1,,2
崩溃恢复
如果程序在两者之间崩溃,它将恢复。让我们采用与以前相同的数据并添加更多行:
for x in range(1, 4):
print('step:', x)
stream_writer.add_item(data_.pop())
show_saved([header_file_name, body_file_name])
输出:
step: 1
header.csv
c,e,f,a,d,b
body.csv
2,1,9
,,,6,1,3
3,,,1,,2
2,1,9,,,
step: 2
header.csv
c,e,f,a,d,b
body.csv
2,1,9
,,,6,1,3
3,,,1,,2
2,1,9,,,
,,,6,1,3
step: 3
header.csv
c,e,f,a,d,b
body.csv
2,1,9
,,,6,1,3
3,,,1,,2
2,1,9,,,
,,,6,1,3
3,,,1,,2
另一答案
Edit-1 26-Dec:更新了代码,根据您的数据生成数据
根据您的要求,我建议如下
- 在headers.csv文件中写入标头
- 在data.csv文件中写入数据
- 如果要读取/发送此文件,只需将两个文件合并为一个文件即可
- 在程序开始时,读取现有的headers.csv文件并创建字段到索引映射
- 当您在数据中遇到新密钥时,使用新索引更新标头映射并更新header.csv
- 在编写字典数据时,您将使用标题映射来创建行数据
下面是一个快速/脏的POC,它对我来说很好
import csv
try:
f = open("headers.csv", mode="r+", encoding="utf-8")
except FileNotFoundError:
f = open("headers.csv", mode="w+", encoding="utf-8")
f2 = open("data.csv", mode="a+", encoding="utf-8")
f.seek(0)
headers = f.readline().strip().split(",")
if headers == ['']:
headers = []
headers_map = {}
for index, field in enumerate(headers):
headers_map[field] = index
def update_header_dict(data):
updated_headers = False
for key in data.keys():
if key not in headers_map:
new_index = len(headers_map)
headers_map[key] = new_index
updated_headers = True
if updated_headers:
f.seek(0)
csv.DictWriter(f, headers_map.keys()).writeheader()
f.flush()
def get_row_data_dict(data):
row_data = [""] * len(headers_map)
for k, v in data.items():
# if v and v[0] in ('=', '-'):
# # Mark the value as text, only needed if you want to display data in excel
# # else should be commented out
# v = "'" + v
row_data[headers_map[k]] = v
return row_data
def main(data):
data_writer = csv.writer(f2)
for row in data:
update_header_dict(row)
data_writer.writerow(get_row_data_dict(row))
data_ = [
{'Catalog link': '',
'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
'Accessories / Sander Accessories',
'Description': 'Exclusive single-piece hub design reduces pad vibration and '
'ensures smooth performance.',
'Each': '$ 24.70',
'Info': '',
'Line art': '',
'Name': '(5") Non-Vacuum Disc Pad Vinyl-Face',
'Product number': '91456106T',
'Technical specifications': '',
'image_1': 'https://www.richelieu.com/documents/docsGr/120/107/6/1201076/1419675_700.jpg'},
{'Catalog link': '',
'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
'Accessories / Sander Accessories',
'Description': '',
'Each': '$ 8.19',
'Info': '<p><strong>material: </strong>Cork</p>',
'Line art': '',
'Name': 'Replacement Plate for MKT9924DB Belt Sander',
'Product number': 'MKT4230358',
'Technical specifications': '<p><strong>brand: </strong>Makita</p>',
'image_1': 'https://www.richelieu.com/documents/docsGr/116/631/4/1166314/1281513_700.jpg',
'xa0': '$ 257.80'},
{'Catalog link': '',
'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
'Accessories / Sander Accessories',
'Description': '',
'Each': '$ 8.19',
'Info': '<p><strong>material: </strong>Graphite</p>',
'Line art': '',
'Name': 'Replacement Plate for MKT9924DB Belt Sander',
'Product number': 'MKT4230366',
'Technical specifications': '<p><strong>brand: </strong>Makita</p>',
'image_1': 'https://www.richelieu.com/documents/docsPr/MK/T4/23/03/66/MKT4230366/1281514_700.jpg',
'xa0': '$ 257.80'},
{'Catalog link': '',
'Category': 'Tools and Shop Supplies / Workshop Accessories / Tool '
'Accessories / Sander Accessories',
'Description': '- Exclusive single-piece hub design reduces pad vibration and '
'ensures smooth performance.',
'Each': '$ 38.47',
'Info': '',
'Line art': '',
'Name': 'Non-Grip Vacuum Pads',
'Product number': '9154325',
'Technical specifications': '<p><strong>thickness: </strong>3/8 '
'in</p><p><strong>density: '
'</strong>Medium</p><p>&l以上是关于当标头是动态的时,避免批量数据导出到csv的主要内容,如果未能解决你的问题,请参考以下文章