如何在没有 pandas 的情况下读取、格式化、排序和保存 csv 文件
Posted
技术标签:
【中文标题】如何在没有 pandas 的情况下读取、格式化、排序和保存 csv 文件【英文标题】:How to read, format, sort, and save a csv file, without pandas 【发布时间】:2021-01-12 22:29:49 【问题描述】: 给定文件test.csv
中的以下示例数据
27-Mar-12,8.25,8.35,8.17,8.19,9801989
26-Mar-12,8.16,8.25,8.12,8.24,8694416
23-Mar-12,8.05,8.12,7.95,8.09,8149170
如何在不使用pandas
的情况下解析此文件?
-
打开文件
将日期列格式化为
datetime
日期格式字符串
按第 0 列、日期列对所有行进行排序
保存回同一个文件,并带有日期列的标题
pandas
,这可以通过一行(长)代码完成,不包括导入。
需要注意的是,如果不使用date_parser
,使用parse_date
可能会很慢。
import pandas as pd
(pd.read_csv('test.csv', header=None, parse_dates=[0], date_parser=lambda t: pd.to_datetime(t, format='%d-%b-%y'))
.rename(columns=0: 'date')
.sort_values('date')
.to_csv('test.csv', index=False))
预期形式
date,1,2,3,4,5
2012-03-23,8.05,8.12,7.95,8.09,8149170
2012-03-26,8.16,8.25,8.12,8.24,8694416
2012-03-27,8.25,8.35,8.17,8.19,9801989
编写此问题和答案是为了填补 Stack Overflow 上的知识内容空白。
使用pandas
来完成这项任务非常简单。
在没有pandas
的情况下,想出所有必要的部分来创建一个完整的解决方案是非常困难的。
这应该对任何对此任务感兴趣的人以及禁止使用pandas
的学生有所帮助。
我不介意看到 numpy
的解决方案,但问题的主要点是仅使用标准库中的包来完成此任务。
三个答案都是可以接受的问题的解决方案。
【问题讨论】:
【参考方案1】: 到目前为止,pandas
是解析和清理文件的更简单工具。
需要 1 行 pandas
,占用 11 行代码,还需要 for-loop
。
这需要以下包和函数
csv
& datetime
Methods of File Objects: .seek
& .truncate
Sorting: How To
最初,list()
用于解压缩 csv.reader
对象,但已被删除,以更新日期值,同时迭代 reader
。
可以向sorted
提供自定义键函数以自定义排序顺序,但我看不到从lambda
表达式返回值的方法。
最初使用 key=lambda row: datetime.strptime(row[0], '%Y-%m-%d')
,但已被删除,因为更新的日期列不包含月份名称。
如果日期列包含月份名称,则如果没有自定义排序键,它将无法正确排序。
import csv
from datetime import datetime
# open the file for reading and writing
with open('test1.csv', mode='r+', newline='') as f:
# create a reader and writer opbject
reader, writer = csv.reader(f), csv.writer(f)
data = list()
# iterate through the reader and update column 0 to a datetime date string
for row in reader:
# update column 0 to a datetime date string
row[0] = datetime.strptime(row[0], "%d-%b-%y").date().isoformat()
# append the row to data
data.append(row)
# sort all of the rows, based on date, with a lambda expression
data = sorted(data, key=lambda row: row[0])
# change the stream position to the given byte offset
f.seek(0)
# truncate the file size
f.truncate()
# add a header to data
data.insert(0, ['date', 1, 2, 3, 4, 5])
# write data to the file
writer.writerows(data)
更新test.csv
date,1,2,3,4,5
2012-03-23,8.05,8.12,7.95,8.09,8149170
2012-03-26,8.16,8.25,8.12,8.24,8694416
2012-03-27,8.25,8.35,8.17,8.19,9801989
%time
测试
import pandas
import pandas_datareader as web
# test data with 1M rows
df = web.DataReader(ticker, data_source='yahoo', start='1980-01-01', end='2020-09-27').drop(columns=['Adj Close']).reset_index().sort_values('High', ascending=False)
df.Date = df.Date.dt.strftime('%d-%b-%y')
df = pd.concat([df]*100)
df.to_csv('test.csv', index=False, header=False)
测试
# pandas test with date_parser
%time pandas_test('test.csv')
[out]:
Wall time: 17.9 s
# pandas test without the date_parser parameter
%time pandas_test('test.csv')
[out]:
Wall time: 1min 17s
# from Paddy Alton
%time paddy('test.csv')
[out]:
Wall time: 15.9 s
# from Trenton
%time trenton('test.csv')
[out]:
Wall time: 17.7 s
# from sammywemmy with functions updated to return the correct date format
%time sammy('test.csv')
[out]:
Wall time: 22.2 s
%time sammy2('test.csv')
[out]:
Wall time: 22.2 s
测试函数
from operator import itemgetter
import csv
import pandas as pd
from datetime import datetime
def pandas_test(file):
(pd.read_csv(file, header=None, parse_dates=[0], date_parser=lambda t: pd.to_datetime(t, format='%d-%b-%y'))
.rename(columns=0: 'date')
.sort_values('date')
.to_csv(file, index=False))
def trenton(file):
with open(file, mode='r+', newline='') as f:
reader, writer = csv.reader(f), csv.writer(f)
data = list()
for row in reader:
row[0] = datetime.strptime(row[0], "%d-%b-%y").date().isoformat()
data.append(row)
data = sorted(data, key=lambda row: row[0])
f.seek(0)
f.truncate()
data.insert(0, ['date', 1, 2, 3, 4, 5])
writer.writerows(data)
def paddy(file):
def format_date(date: str) -> str:
formatted_date = datetime.strptime(date, "%d-%b-%y").date().isoformat()
return formatted_date
with open(file, "r") as f:
lines = f.readlines()
records = [[value for value in line.split(",")] for line in lines]
for record in records:
record[0] = format_date(record[0])
sorted_records = sorted(records, key = lambda r: r[0])
prepared_lines = [",".join(record).strip("\n") for record in sorted_records]
field_names = "date,1,2,3,4,5"
prepared_lines.insert(0, field_names)
prepared_data = "\n".join(prepared_lines)
with open(file, "w") as f:
f.write(prepared_data)
def sammy(file):
# updated with .date().isoformat() to return the correct format
with open(file) as csvfile:
fieldnames = ["date", 1, 2, 3, 4, 5]
reader = csv.DictReader(csvfile, fieldnames=fieldnames)
mapping = list(reader)
mapping = [
key: datetime.strptime(value, ("%d-%b-%y")).date().isoformat()
if key == "date" else value
for key, value in entry.items()
for entry in mapping
]
mapping = sorted(mapping, key=itemgetter("date"))
with open(file, mode="w", newline="") as csvfile:
fieldnames = mapping[0].keys()
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for row in mapping:
writer.writerow(row)
def sammy2(file):
# updated with .date().isoformat() to return the correct format
with open(file) as csvfile:
reader = csv.reader(csvfile, delimiter=",")
mapping = dict(enumerate(reader))
num_of_cols = len(mapping[0])
fieldnames = ["date" if n == 0 else n
for n in range(num_of_cols)]
mapping = [
[ datetime.strptime(val, "%d-%b-%y").date().isoformat()
if ind == 0 else val
for ind, val in enumerate(value)
]
for key, value in mapping.items()
]
mapping = sorted(mapping, key=itemgetter(0))
with open(file, mode="w", newline="") as csvfile:
csvwriter = csv.writer(csvfile, delimiter=",")
csvwriter.writerow(fieldnames)
for row in mapping:
csvwriter.writerow(row)
【讨论】:
我喜欢seek
和truncate
的用法。我已经看到了一些关于 SO 的问题,人们“只想保存”对 CSV 的更改,但我没有一个好的解决方案;我总是推荐'read-old,write-new,mv/rename old→new'。我已经收藏了,谢谢!【参考方案2】:
正如 OP 所述,Pandas 让这一切变得简单;另一种方法是使用DictReader 和DictWriter 选项;它仍然比使用 Pandas 更冗长(这里的抽象之美,Pandas 为我们完成了繁重的工作)。
import csv
from datetime import datetime
from operator import itemgetter
with open("test.csv") as csvfile:
fieldnames = ["date", 1, 2, 3, 4, 5]
reader = csv.DictReader(csvfile, fieldnames=fieldnames)
mapping = list(reader)
mapping = [
key: datetime.strptime(value, ("%d-%b-%y"))
if key == "date" else value
for key, value in entry.items()
for entry in mapping
]
mapping = sorted(mapping, key=itemgetter("date"))
with open("test.csv", mode="w", newline="") as csvfile:
fieldnames = mapping[0].keys()
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for row in mapping:
writer.writerow(row)
由于事先不知道字段名称,我们可以使用csvreader 和csvwriter 选项:
with open("test.csv") as csvfile:
reader = csv.reader(csvfile, delimiter=",")
mapping = dict(enumerate(reader))
num_of_cols = len(mapping[0])
fieldnames = ["date" if n == 0 else n
for n in range(num_of_cols)]
mapping = [
[ datetime.strptime(val, "%d-%b-%y")
if ind == 0 else val
for ind, val in enumerate(value)
]
for key, value in mapping.items()
]
mapping = sorted(mapping, key=itemgetter(0))
with open("test.csv", mode="w", newline="") as csvfile:
csvwriter = csv.writer(csvfile, delimiter=",")
csvwriter.writerow(fieldnames)
for row in mapping:
csvwriter.writerow(row)
【讨论】:
【参考方案3】:尽可能少地使用导入:
from datetime import datetime
def format_date(date: str) -> str:
formatted_date = datetime.strptime(date, "%d-%b-%y").date().isoformat()
return formatted_date
# read in the CSV
with open("test.csv", "r") as file:
lines = file.readlines()
records = [[value for value in line.split(",")] for line in lines]
# reformat the first field in each record
for record in records:
record[0] = format_date(record[0])
# having formatted the dates, sort records by first (date) field:
sorted_records = sorted(records, key = lambda r: r[0])
# join values with commas once more, removing newline characters
prepared_lines = [",".join(record).strip("\n") for record in sorted_records]
# create a header row
field_names = "date,1,2,3,4,5"
# prepend the header row
prepared_lines.insert(0, field_names)
prepared_data = "\n".join(prepared_lines)
# write out the CSV
with open("test.csv", "w") as file:
file.write(prepared_data)
【讨论】:
您可以跳过.strip("\n")
和"\n".join()
,然后使用file.write(field_names+"\n")
,然后使用file.writelines(prepared_lines)
。
肯定有几种方法可以管理代码的读/写位,而且会更短。此外,回想起来,也许format_date
应该将记录作为输入并格式化其第一个元素以允许更优雅的formatted_records = list(map(format_date, records))
代替循环。以上是关于如何在没有 pandas 的情况下读取、格式化、排序和保存 csv 文件的主要内容,如果未能解决你的问题,请参考以下文章
如何在不使用外部库(例如 Numpy、Pandas)的情况下读取 CSV 文件?
Pandas 可以在不修改文件其余部分的情况下读取和修改单个 Excel 文件工作表(选项卡)吗?
如何在没有操作的情况下对 Pandas 数据框进行分组或聚合