使用 TPU 运行时在 Google Colab 上训练 Keras 模型时出错

Posted 2023-02-16

技术标签:

【中文标题】使用 TPU 运行时在 Google Colab 上训练 Keras 模型时出错【英文标题】：Error Training Keras Model on Google Colab using TPU runtime 【发布时间】：2020-12-10 21:19:46 【问题描述】：

我正在尝试在 Google Colab 中使用 TPU 创建和训练我的 CNN 模型。我打算用它来分类狗和猫。该模型使用 GPU/CPU 运行时工作，但我无法在 TPU 运行时运行它。这是创建我的模型的代码。

我使用 flow_from_directory() 函数输入我的数据集，这是它的代码

train_datagen = ImageDataGenerator(rescale=1./255)
train_generator = train_datagen.flow_from_directory(
    MAIN_DIR,
    target_size = (128,128),
    batch_size = 50,
    class_mode = 'binary'
)

def create_model():

  model=Sequential()
  model.add(Conv2D(32,(3,3),activation='relu',input_shape=(128,128,3)))
  model.add(BatchNormalization())
  model.add(MaxPooling2D(pool_size=(2,2)))
  model.add(Dropout(0.25))
  model.add(Conv2D(64,(3,3),activation='relu'))
  model.add(BatchNormalization())
  model.add(MaxPooling2D(pool_size=(2,2)))
  model.add(Dropout(0.25))
  model.add(Conv2D(128,(3,3),activation='relu'))
  model.add(BatchNormalization())
  model.add(MaxPooling2D(pool_size=(2,2)))
  model.add(Dropout(0.25))
  model.add(Flatten())
  model.add(Dense(512,activation='relu'))
  model.add(BatchNormalization())
  model.add(Dropout(0.5))
  model.add(Dense(2,activation='softmax'))
  
  return model

这是用于在 google Colab 上启动 TPU 的代码

tf.keras.backend.clear_session()

resolver = tf.distribute.cluster_resolver.TPUClusterResolver('grpc://' + os.environ['COLAB_TPU_ADDR'])
tf.config.experimental_connect_to_cluster(resolver)

# This is the TPU initialization code that has to be at the beginning.
tf.tpu.experimental.initialize_tpu_system(resolver)
print("All devices: ", tf.config.list_logical_devices('TPU'))

strategy = tf.distribute.experimental.TPUStrategy(resolver)

with strategy.scope():
  model = create_model()
  model.compile(
      optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3, ),
      loss='sparse_categorical_crossentropy',
      metrics=['sparse_categorical_accuracy'])


model.fit(
    train_generator, 
    epochs = 5,
    
)

但是当我运行这段代码时，我收到了这个错误：

UnavailableError                          Traceback (most recent call last)
<ipython-input-15-1970b3405ba3> in <module>()
     20 model.fit(
     21     train_generator,
---> 22     epochs = 5,
     23 
     24 )

14 frames
/usr/local/lib/python3.6/dist-packages/six.py in raise_from(value, from_value)

UnavailableError: 5 root error(s) found.
  (0) Unavailable: function_node __inference_train_function_42823 failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:"created":"@1598016644.748265484","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3948,"referenced_errors":["created":"@1598016644.748262999","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":394,"grpc_status":14]
     [[node MultiDeviceIteratorGetNextFromShard]]
     [[RemoteCall]]
     [[IteratorGetNextAsOptional]]
     [[cond_11/switch_pred/_107/_78]]
  (1) Unavailable: function_node __inference_train_function_42823 failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:"created":"@1598016644.748265484","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3948,"referenced_errors":["created":"@1598016644.748262999","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":394,"grpc_status":14]
     [[node MultiDeviceIteratorGetNextFromShard]]
     [[RemoteCall]]
     [[IteratorGetNextAsOptional]]
     [[cond_12/switch_pred/_118/_82]]
  (2) Unavailable: function_node __inference_train_function_42823 failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:"created":"@1598016644.748265484","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3948,"referenced_errors":["created":"@1598016644.748262999","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":394,"grpc_status":14]
     [[node MultiDeviceIteratorGetNextFromShard]]
     [[RemoteCall]]
     [[IteratorGetNextAsOptional]]
     [[TPUReplicate/_compile/_7955920754087029306/_4/_266]]
  (3) Unavailable: function_node __inference_train_function_42823 failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:"created":"@1598016644.748265484","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3948,"referenced_errors":["created":"@1598016644.748262999","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":394,"grpc_status":14]
     [[node MultiDeviceIteratorGetNextFromShard]]
     [[RemoteCall]]
     [[IteratorGetNextAsOptional]]
     [[Shape_7/_104]]
  (4) Unavailable: functi ... [truncated]

我真的不知道，我该如何解决这个问题。我也不知道这些错误是什么意思。

【问题讨论】：

你使用 Firebase 功能吗？我确定我没有使用 firebase 函数。错误failed to connect to all addresses 表示某些东西阻止了连接，可能是防火墙。 MAIN_DIR 是本地数据集吗？这不适用于 TPU，因为加速器在不同的 VM 上运行。您必须将数据集移动到 GCS 并使用 tf.data.Dataset 加载它以获得最佳结果。 This codelab covers the basics 和 this document 展示了如何将现有的图像分类数据集转换为 TFRecords。您需要拥有 Google Cloud Storage 上的数据才能使用 TPU。 【参考方案1】：

您遇到了 TPU 的一个已知问题 - 它们不支持 PyFunction。详情在这里：#38762、#34346、#39099：

抱歉这个问题。 Dataset.from_generator 预计不适用于 TPU，因为它使用了与 Cloud TPU 2VM 设置不兼容的 py_function。如果您想从大型数据集中读取数据，不妨尝试在磁盘上实现它并改用 TFRecordDataest。

由于 ImageDataGenerator 在底层也使用 PyFunction，它与 TPU 不兼容。相反，您必须使用 tf.data API 来加载图像。 This tutorial 解释了如何做到这一点。

【讨论】：

以上是关于使用 TPU 运行时在 Google Colab 上训练 Keras 模型时出错的主要内容，如果未能解决你的问题，请参考以下文章

Google Colab TPU 中未实现文件系统方案“[本地]”

Google Colab 中的 Keras 调谐器和 TPU

colab使用总结

如何在谷歌colab中使用TPU

为啥 Google Colab TPU 和我的电脑一样慢？

在 Google Colab Pro 中使用 TPU v3