什么是更快,更Pythonic的方式来读取CSV并从中创建数据框?
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了什么是更快,更Pythonic的方式来读取CSV并从中创建数据框?相关的知识,希望对你有一定的参考价值。
输入:包含50,000行的CSV;每行包含910列值0/1。 输出:运行CNN的数据帧。
我写了一行代码逐行读取CSV。对于每一行,我将数据分成两部分,称为神经元(900列)和标签(10列)。由于这些是列表,我将它们转换为Numpy数组。当我转到下一行时,我会做同样的事情并堆叠数组,最终获得4个传统数据集: x_train,x_test,y_train,y_test
我的代码正在运行,因为我在一个只有6行的小型CSV上测试了它。但是,当我在数组初始化之后在50,000行的实际数据集上运行它以将行转换为数据帧时,它将永远消失。
所以我想知道是否有更快的方式来进行这种转换,或者可以在这里等待!
这是我的代码:
import numpy as np
import pandas as pd
import time
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.utils import np_utils
from sklearn.model_selection import train_test_split
# Read the dataset from the CSV file into a dataframe
df = pd.read_csv("bci_dataset_labelled.csv")
start_init = time.time()
xvalues = np.zeros((900,), dtype=np.int)
yvalues = np.zeros((10,), dtype=np.int)
print("--- Arrays initialized in %s seconds ---" % (time.time() - start_init))
start_conversion = time.time()
for row in df.itertuples(index=False):
# separate the neurons from the labels
x = list(row[:900])
y = list(row[900:])
# convert the lists to numpy arrays
x = np.array(x)
y = np.array(y)
xvalues = np.vstack((xvalues, x))
yvalues = np.vstack((yvalues, y))
print("--- CSV rows converted to dataframe in %s seconds ---" % (time.time() - start_conversion))
start_split = time.time()
x_train, x_test, y_train, y_test = train_test_split(xvalues, yvalues, test_size=0.2)
print("--- Dataframe split into training and testing datasets in %s seconds ---" % (time.time() - start_split))
num_classes = y_test.shape[1]
num_neurons = x_train[0].shape[0]
# define baseline model
def baseline_model():
#create model
model = Sequential()
model.add(Dense(
num_neurons,
input_dim = num_neurons,
kernel_initializer = 'normal',
activation = 'relu'
))
model.add(Dense(
num_classes,
kernel_initializer = 'normal',
activation = 'softmax'
))
#compile model
model.compile(
loss = 'categorical_crossentropy',
optimizer = 'adam',
metrics = ['accuracy'])
return model
# build the model
model = baseline_model()
# fit the model
model.fit(x_train, y_train, validation_data = (x_test, y_test),
epochs = 10, batch_size = 200, verbose = 2)
# final evaluation of the model
scores = model.evaluate(x_test, y_test, verbose=0)
print("Baseline error: %0.2f%%" % (100-scores[1]*100))
它只是停留在这里:
Rachayitas-MacBook-Pro:bci_hp rachayitagiri$ python3 binarycnn.py
Using TensorFlow backend.
--- Arrays initialized in 2.4080276489257812e-05 seconds ---
任何建议将不胜感激!谢谢!
编辑:将输出作为文本从控制台输入,而不是图片。感谢您的建议。
从我看到的,你的问题不是read_csv
函数,而是你从DataFrame中提取信息的方式。你可以直接从DataFrame获得xvalues
和yvalues
,而不是一行一行地读取你的DataFrame,这是非常昂贵的。 DataFrames允许您以非常优化的方式执行此操作。
根据我的理解,您的X值在900个第一列中,Y值在此之后。这是我将如何去做:
import pandas as pd
import numpy as np
import time
start_init = time.time()
df = pd.DataFrame(np.random.randint(0,100,size=(50000, 910)))
print("--- DataFrame initialized in %s seconds ---" % (time.time() - start_init))
start_conversion = time.time()
x = df.loc[:, :900] # Here's where you get your x values, 900 first values in each row
y = df.loc[:, 900:] # And here you retrieve the y values
# All that's left is to convert that to a numpy array by doing this
xvalues = x.values
yvalues = y.values
print("--- Took data out of DataFrame in %s seconds ---" % (time.time() -
start_conversion))
print(x.shape, y.shape)
我得到以下这段代码的打印件:
--- Arrays initialized in 0.6232161521911621 seconds ---
--- Took data out of DataFrame in 0.038640737533569336 seconds ---
(50000, 901) (50000, 10)
你可能无法击败read_csv,它开箱即用,可能比其他任何解决方案都更好。
以上是关于什么是更快,更Pythonic的方式来读取CSV并从中创建数据框?的主要内容,如果未能解决你的问题,请参考以下文章
是否有一种 Pythonic 的方式来跳过 for 循环中的 if 语句以使我的代码运行得更快?
更快更 Pythonic 的 PyTorch 2.0 | 非常值得期待