如何将多文件（b、t、f）形状的数据流式传输到 TensorFlow 数据集中

Posted 2023-02-16

技术标签:

【中文标题】如何将多文件（b、t、f）形状的数据流式传输到 TensorFlow 数据集中【英文标题】：How to stream a multi-file (b, t, f)-shaped data into Tensorflow Dataset 【发布时间】：2022-01-10 15:36:02 【问题描述】：

我有一个大数据，我想将其加载到 Tensorflow 数据集中以训练 LSTM 网络。由于数据的大小，我想使用流功能而不是将整个数据读入内存。我正在努力阅读我的数据，以便每个样本 i 正确地塑造为 (t_i, m)。

要复制的示例代码：

# One hundred samples, each with three features
# Second dim is time-steps for each sample. I will
# randomize this in a step below
x = np.random.randn(100,10,3)
# One hundred 0,1 labels
y = (np.random.rand(100)>0.5)*1
y=y.reshape((-1,1))

# Save each sample in its own file
for i in range(len(x)):
  cat = y[i][0]
  data = x[i]
  # Simulate random length of each sample
  data = data[:np.random.randint(4,10),:]
  fname = 'tmp_csv/:.0f/:03.0f.csv'.format(cat,i)
  np.savetxt(fname, data, delimiter=',')

现在我有一百个 csv 文件，每个文件都有一个大小为 (t_i, 3) 的样本。如何在保持每个样本形状的同时将这些文件读回 Tensorflow 数据集？

我尝试了序列化（但不知道如何正确执行），展平以使每个样本都在一行中（但不知道如何处理可变行大小以及如何重塑），我尝试了 vanilla make_csv_dataset。这是我的make_csv_dataset 尝试：

ds = tf.data.experimental.make_csv_dataset(
  file_pattern = "tmp_csv/*/*.csv",
  batch_size=10, num_epochs=1,
  num_parallel_reads=5,
  shuffle_buffer_size=10,
  header=False,
  column_names=['a','b','c']
)

for i in ds.take(1):
  print(i)

...但这会导致每个样本的形状为 (1,3)。

【问题讨论】：

【参考方案1】：

问题在于make_csv_dataset 将每个csv 文件中的每一行都解释为一个样本。你可以尝试这样的事情，但我不确定它对你的用例有多有效：

import tensorflow as tf
import numpy as np

# One hundred samples, each with three features
# Second dim is time-steps for each sample. I will
# randomize this in a step below
x = np.random.randn(100,10,3)
# One hundred 0,1 labels
y = (np.random.rand(100)>0.5)*1
y=y.reshape((-1,1))

# Save each sample in its own file
for i in range(len(x)):
  cat = y[i][0]
  data = x[i]
  # Simulate random length of each sample
  data = data[:np.random.randint(4,10),:]
  fname = 'tmp_csv/:.0f:03.0f.csv'.format(cat,i)
  np.savetxt(fname, data, delimiter=',')

def columns_to_tensor(data_from_one_csv):
  ta = tf.TensorArray(dtype=tf.float32, size=0, dynamic_size=True)
  for i, t in enumerate(data_from_one_csv):
    ta = ta.write(tf.cast(i, dtype=tf.int32), tf.stack([t[0], t[1], t[2]], axis=0))
  return ta.stack()

files = tf.data.Dataset.list_files("tmp_csv/*.csv")
ds = files.map(lambda file: tf.data.experimental.CsvDataset(file, record_defaults=[tf.float32, tf.float32, tf.float32], header=False))
ds = ds.map(columns_to_tensor)
for i,j in enumerate(ds):
  print(i, j.shape)

0 (5, 3)
1 (9, 3)
2 (5, 3)
3 (6, 3)
4 (8, 3)
5 (7, 3)
6 (6, 3)
7 (8, 3)
8 (8, 3)
9 (7, 3)
10 (9, 3)
11 (9, 3)
12 (7, 3)
13 (9, 3)
14 (4, 3)
15 (5, 3)
16 (6, 3)
17 (6, 3)
18 (8, 3)
19 (8, 3)
20 (8, 3)
21 (9, 3)
22 (9, 3)
23 (7, 3)
24 (8, 3)
25 (8, 3)
26 (5, 3)
27 (7, 3)
28 (5, 3)
29 (8, 3)
30 (9, 3)
31 (6, 3)
32 (6, 3)
33 (7, 3)
34 (6, 3)
35 (9, 3)
36 (9, 3)
37 (5, 3)
38 (9, 3)
39 (9, 3)
40 (7, 3)
41 (7, 3)
42 (7, 3)
43 (6, 3)
44 (9, 3)
45 (4, 3)
46 (9, 3)
47 (6, 3)
48 (9, 3)
49 (8, 3)
50 (7, 3)
51 (4, 3)
52 (4, 3)
53 (6, 3)
54 (7, 3)
55 (7, 3)
56 (9, 3)
57 (7, 3)
58 (5, 3)
59 (7, 3)
60 (8, 3)
61 (8, 3)
62 (5, 3)
63 (5, 3)
64 (7, 3)
65 (6, 3)
66 (6, 3)
67 (7, 3)
68 (6, 3)
69 (9, 3)
70 (5, 3)
71 (4, 3)
72 (8, 3)
73 (8, 3)
74 (6, 3)
75 (7, 3)
76 (9, 3)
77 (6, 3)
78 (5, 3)
79 (7, 3)
80 (6, 3)
81 (5, 3)
82 (4, 3)
83 (5, 3)
84 (4, 3)
85 (5, 3)
86 (4, 3)
87 (4, 3)
88 (7, 3)
89 (5, 3)
90 (4, 3)
91 (7, 3)
92 (4, 3)
93 (7, 3)
94 (4, 3)
95 (5, 3)
96 (6, 3)
97 (6, 3)
98 (7, 3)
99 (9, 3)

之后，只需调用 ds.batch 并提供您所需的批量大小。

【讨论】：

谢谢！是的，它有点慢，但它确实有效......有什么想法让它更有效率吗？增加批量大小，检查***.com/questions/56714388/…，但在您的情况下，很难有大批量大小，因为每个样本都有不同的时间步长。你可以考虑填充它们，然后你就可以使用任何你想要的批量大小。我明白了，谢谢。我还可以进行预取以消除这种滞后。我想这样就可以了，谢谢@Alone。

以上是关于如何将多文件（b、t、f）形状的数据流式传输到 TensorFlow 数据集中的主要内容，如果未能解决你的问题，请参考以下文章