如何将多文件(b、t、f)形状的数据流式传输到 TensorFlow 数据集中
Posted
技术标签:
【中文标题】如何将多文件(b、t、f)形状的数据流式传输到 TensorFlow 数据集中【英文标题】:How to stream a multi-file (b, t, f)-shaped data into Tensorflow Dataset 【发布时间】:2022-01-10 15:36:02 【问题描述】:我有一个大数据,我想将其加载到 Tensorflow 数据集中以训练 LSTM 网络。由于数据的大小,我想使用流功能而不是将整个数据读入内存。我正在努力阅读我的数据,以便每个样本 i 正确地塑造为 (ti, m)。
要复制的示例代码:
# One hundred samples, each with three features
# Second dim is time-steps for each sample. I will
# randomize this in a step below
x = np.random.randn(100,10,3)
# One hundred 0,1 labels
y = (np.random.rand(100)>0.5)*1
y=y.reshape((-1,1))
# Save each sample in its own file
for i in range(len(x)):
cat = y[i][0]
data = x[i]
# Simulate random length of each sample
data = data[:np.random.randint(4,10),:]
fname = 'tmp_csv/:.0f/:03.0f.csv'.format(cat,i)
np.savetxt(fname, data, delimiter=',')
现在我有一百个 csv 文件,每个文件都有一个大小为 (ti, 3) 的样本。如何在保持每个样本形状的同时将这些文件读回 Tensorflow 数据集?
我尝试了序列化(但不知道如何正确执行),展平以使每个样本都在一行中(但不知道如何处理可变行大小以及如何重塑),我尝试了 vanilla make_csv_dataset
。这是我的make_csv_dataset
尝试:
ds = tf.data.experimental.make_csv_dataset(
file_pattern = "tmp_csv/*/*.csv",
batch_size=10, num_epochs=1,
num_parallel_reads=5,
shuffle_buffer_size=10,
header=False,
column_names=['a','b','c']
)
for i in ds.take(1):
print(i)
...但这会导致每个样本的形状为 (1,3)。
【问题讨论】:
【参考方案1】:问题在于make_csv_dataset
将每个csv
文件中的每一行都解释为一个样本。你可以尝试这样的事情,但我不确定它对你的用例有多有效:
import tensorflow as tf
import numpy as np
# One hundred samples, each with three features
# Second dim is time-steps for each sample. I will
# randomize this in a step below
x = np.random.randn(100,10,3)
# One hundred 0,1 labels
y = (np.random.rand(100)>0.5)*1
y=y.reshape((-1,1))
# Save each sample in its own file
for i in range(len(x)):
cat = y[i][0]
data = x[i]
# Simulate random length of each sample
data = data[:np.random.randint(4,10),:]
fname = 'tmp_csv/:.0f:03.0f.csv'.format(cat,i)
np.savetxt(fname, data, delimiter=',')
def columns_to_tensor(data_from_one_csv):
ta = tf.TensorArray(dtype=tf.float32, size=0, dynamic_size=True)
for i, t in enumerate(data_from_one_csv):
ta = ta.write(tf.cast(i, dtype=tf.int32), tf.stack([t[0], t[1], t[2]], axis=0))
return ta.stack()
files = tf.data.Dataset.list_files("tmp_csv/*.csv")
ds = files.map(lambda file: tf.data.experimental.CsvDataset(file, record_defaults=[tf.float32, tf.float32, tf.float32], header=False))
ds = ds.map(columns_to_tensor)
for i,j in enumerate(ds):
print(i, j.shape)
0 (5, 3)
1 (9, 3)
2 (5, 3)
3 (6, 3)
4 (8, 3)
5 (7, 3)
6 (6, 3)
7 (8, 3)
8 (8, 3)
9 (7, 3)
10 (9, 3)
11 (9, 3)
12 (7, 3)
13 (9, 3)
14 (4, 3)
15 (5, 3)
16 (6, 3)
17 (6, 3)
18 (8, 3)
19 (8, 3)
20 (8, 3)
21 (9, 3)
22 (9, 3)
23 (7, 3)
24 (8, 3)
25 (8, 3)
26 (5, 3)
27 (7, 3)
28 (5, 3)
29 (8, 3)
30 (9, 3)
31 (6, 3)
32 (6, 3)
33 (7, 3)
34 (6, 3)
35 (9, 3)
36 (9, 3)
37 (5, 3)
38 (9, 3)
39 (9, 3)
40 (7, 3)
41 (7, 3)
42 (7, 3)
43 (6, 3)
44 (9, 3)
45 (4, 3)
46 (9, 3)
47 (6, 3)
48 (9, 3)
49 (8, 3)
50 (7, 3)
51 (4, 3)
52 (4, 3)
53 (6, 3)
54 (7, 3)
55 (7, 3)
56 (9, 3)
57 (7, 3)
58 (5, 3)
59 (7, 3)
60 (8, 3)
61 (8, 3)
62 (5, 3)
63 (5, 3)
64 (7, 3)
65 (6, 3)
66 (6, 3)
67 (7, 3)
68 (6, 3)
69 (9, 3)
70 (5, 3)
71 (4, 3)
72 (8, 3)
73 (8, 3)
74 (6, 3)
75 (7, 3)
76 (9, 3)
77 (6, 3)
78 (5, 3)
79 (7, 3)
80 (6, 3)
81 (5, 3)
82 (4, 3)
83 (5, 3)
84 (4, 3)
85 (5, 3)
86 (4, 3)
87 (4, 3)
88 (7, 3)
89 (5, 3)
90 (4, 3)
91 (7, 3)
92 (4, 3)
93 (7, 3)
94 (4, 3)
95 (5, 3)
96 (6, 3)
97 (6, 3)
98 (7, 3)
99 (9, 3)
之后,只需调用 ds.batch
并提供您所需的批量大小。
【讨论】:
谢谢!是的,它有点慢,但它确实有效......有什么想法让它更有效率吗? 增加批量大小,检查***.com/questions/56714388/…,但在您的情况下,很难有大批量大小,因为每个样本都有不同的时间步长。你可以考虑填充它们,然后你就可以使用任何你想要的批量大小。 我明白了,谢谢。我还可以进行预取以消除这种滞后。我想这样就可以了,谢谢@Alone。以上是关于如何将多文件(b、t、f)形状的数据流式传输到 TensorFlow 数据集中的主要内容,如果未能解决你的问题,请参考以下文章