如果我想使用无法通过 TensorFlow 加载到内存中的大型数据集，我该怎么办？

Posted 2023-04-18

技术标签:

【中文标题】如果我想使用无法通过 TensorFlow 加载到内存中的大型数据集，我该怎么办？【英文标题】：What should I do if I want to use large datasets that can't load into the memory with TensorFlow? 【发布时间】：2017-02-22 11:20:05 【问题描述】：

我想使用一次无法加载到内存中的大型数据集来使用 TensorFlow 训练模型。但我不知道我该怎么做。

我已经阅读了一些关于TFRecords 文件格式和官方文档的精彩帖子。巴士我还是想不通。

TensorFlow 有完整的解决方案吗？

【问题讨论】：

【参考方案1】：

考虑使用tf.TextLineReader，它与tf.train.string_input_producer 结合使用可让您从磁盘上的多个文件加载数据（如果您的数据集足够大以至于需要分散到多个文件中）。

见https://www.tensorflow.org/programmers_guide/reading_data#reading_from_files

上面链接中的代码 sn-p：

filename_queue = tf.train.string_input_producer(["file0.csv", "file1.csv"])

reader = tf.TextLineReader()
key, value = reader.read(filename_queue)

# Default values, in case of empty columns. Also specifies the type of the
# decoded result.
record_defaults = [[1], [1], [1], [1], [1]]
col1, col2, col3, col4, col5 = tf.decode_csv(
    value, record_defaults=record_defaults)
features = tf.stack([col1, col2, col3, col4])

with tf.Session() as sess:
  # Start populating the filename queue.
  coord = tf.train.Coordinator()
  threads = tf.train.start_queue_runners(coord=coord)

  for     filename_queue = tf.train.string_input_producer(["file0.csv", "file1.csv"])

reader = tf.TextLineReader()
key, value = reader.read(filename_queue)

# Default values, in case of empty columns. Also specifies the type of the
# decoded result.
record_defaults = [[1], [1], [1], [1], [1]]
col1, col2, col3, col4, col5 = tf.decode_csv(
    value, record_defaults=record_defaults)
features = tf.stack([col1, col2, col3, col4])

with tf.Session() as sess:
  # Start populating the filename queue.
  coord = tf.train.Coordinator()
  threads = tf.train.start_queue_runners(coord=coord)

  for i in range(1200):
    # Retrieve a single instance:
    example, label = sess.run([features, col5])

  coord.request_stop()
  coord.join(threads)i in range(1200):
    # Retrieve a single instance:
    example, label = sess.run([features, col5])

  coord.request_stop()
  coord.join(threads)

【讨论】：

感谢您的回答。但是如果 CSV 文件中有 许多列 怎么办？我要写很多col1、col2、col3……等等？以及如何从二进制文件中读取数据？ @secsilm 是的，您的 CSV 中的每一列都需要 col1、col2 等。请记住，col1 只是一个变量名，因此您可以给它一个更易记的名称，例如 price 之类的。对于二进制文件，请参阅tensorflow.org/api_docs/python/tf/FixedLengthRecordReader【参考方案2】：

通常，无论如何您都会使用批量训练，这样您就可以即时加载数据。例如图片：

for bid in nrBatches:
     batch_x, batch_y = load_data_from_hd(bid)
     train_step.run(feed_dict=x: batch_x, y_: batch_y)

因此，您可以即时加载每个批次，并且只加载您在任何给定时刻需要加载的数据。当使用硬盘而不是内存来加载数据时，您的训练时间自然会增加。

【讨论】：

以上是关于如果我想使用无法通过 TensorFlow 加载到内存中的大型数据集，我该怎么办？的主要内容，如果未能解决你的问题，请参考以下文章