TF.Keras model.predict 比直接 Numpy 慢？

Posted 2023-02-23

技术标签:

【中文标题】TF.Keras model.predict 比直接 Numpy 慢？【英文标题】：TF.Keras model.predict is slower than straight Numpy? 【发布时间】：2020-10-22 03:53:39 【问题描述】：

感谢大家帮助我理解以下问题。我已经更新了问题并生成了 CPU-only 运行和 GPU-only 运行。一般来说，在任何一种情况下，直接numpy 的计算似乎都比model. predict() 快数百倍。希望这可以澄清这似乎不是 CPU 与 GPU 的问题（如果是，我希望得到解释）。

让我们用 keras 创建一个经过训练的模型。

import tensorflow as tf

(X,Y),(Xt,Yt) = tf.keras.datasets.mnist.load_data()

model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(1000,'relu'),
    tf.keras.layers.Dense(100,'relu'),
    tf.keras.layers.Dense(10,'softmax'),
])
model.compile('adam','sparse_categorical_crossentropy')
model.fit(X,Y,epochs=20,batch_size=1024)

现在让我们使用numpy 重新创建model.predict 函数。

import numpy as np

W = model.get_weights()

def predict(X):
    X      = X.reshape((X.shape[0],-1))           #Flatten
    X      = X @ W[0] + W[1]                      #Dense
    X[X<0] = 0                                    #Relu
    X      = X @ W[2] + W[3]                      #Dense
    X[X<0] = 0                                    #Relu
    X      = X @ W[4] + W[5]                      #Dense
    X      = np.exp(X)/np.exp(X).sum(1)[...,None] #Softmax
    return X

我们可以很容易地验证这些是相同的功能（模块机器实现错误）。

print(model.predict(X[:100]).argmax(1))
print(predict(X[:100]).argmax(1))

我们还可以测试这些函数的运行速度。使用ipython：

%timeit model.predict(X[:10]).argmax(1) # 10   loops takes 37.7 ms
%timeit predict(X[:10]).argmax(1)       # 1000 loops takes 356 µs

我知道predict 在小批量时的运行速度比 model. predict 快大约 10,000 倍，而在大批量时则降低到大约 100 倍。不管怎样，为什么predict 这么快？事实上，predict 甚至没有优化，我们可以使用numba，甚至直接在C 代码中重写predict 并编译它。

考虑到部署目的，为什么手动从模型中提取权重并重新编写函数比keras 内部执行的速度快数千倍？这也意味着编写脚本来利用.h5 文件或类似文件，可能比手动重写预测函数要慢得多。一般来说，这是真的吗？

Ipython 输出（CPU）：

Python 3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.19.0 -- An enhanced Interactive Python. Type '?' for help.
PyDev console: using IPython 7.19.0
Python 3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)] on win32
import os
os.environ["CUDA_VISIBLE_DEVICES"]="-1"    
import tensorflow as tf
(X,Y),(Xt,Yt) = tf.keras.datasets.mnist.load_data()
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(1000,'relu'),
    tf.keras.layers.Dense(100,'relu'),
    tf.keras.layers.Dense(10,'softmax'),
])
model.compile('adam','sparse_categorical_crossentropy')
model.fit(X,Y,epochs=20,batch_size=1024)
2021-04-19 15:10:58.323137: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021-04-19 15:11:01.990590: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library nvcuda.dll
2021-04-19 15:11:02.039285: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2021-04-19 15:11:02.042553: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: DESKTOP-G0U8S3P
2021-04-19 15:11:02.043134: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: DESKTOP-G0U8S3P
2021-04-19 15:11:02.128834: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:127] None of the MLIR optimization passes are enabled (registered 2)
Epoch 1/20
59/59 [==============================] - 4s 60ms/step - loss: 35.3708
Epoch 2/20
59/59 [==============================] - 3s 58ms/step - loss: 0.8671
Epoch 3/20
59/59 [==============================] - 3s 56ms/step - loss: 0.5641
Epoch 4/20
59/59 [==============================] - 3s 56ms/step - loss: 0.4359
Epoch 5/20
59/59 [==============================] - 3s 56ms/step - loss: 0.3447
Epoch 6/20
59/59 [==============================] - 3s 56ms/step - loss: 0.2891
Epoch 7/20
59/59 [==============================] - 3s 56ms/step - loss: 0.2371
Epoch 8/20
59/59 [==============================] - 3s 57ms/step - loss: 0.1977
Epoch 9/20
59/59 [==============================] - 3s 57ms/step - loss: 0.1713
Epoch 10/20
59/59 [==============================] - 3s 57ms/step - loss: 0.1381
Epoch 11/20
59/59 [==============================] - 4s 61ms/step - loss: 0.1203
Epoch 12/20
59/59 [==============================] - 3s 57ms/step - loss: 0.1095
Epoch 13/20
59/59 [==============================] - 3s 56ms/step - loss: 0.0877
Epoch 14/20
59/59 [==============================] - 3s 57ms/step - loss: 0.0793
Epoch 15/20
59/59 [==============================] - 3s 56ms/step - loss: 0.0727
Epoch 16/20
59/59 [==============================] - 3s 56ms/step - loss: 0.0702
Epoch 17/20
59/59 [==============================] - 3s 56ms/step - loss: 0.0701
Epoch 18/20
59/59 [==============================] - 3s 57ms/step - loss: 0.0631
Epoch 19/20
59/59 [==============================] - 3s 56ms/step - loss: 0.0539
Epoch 20/20
59/59 [==============================] - 3s 58ms/step - loss: 0.0493
Out[3]: <tensorflow.python.keras.callbacks.History at 0x143069fdf40>

import numpy as np
W = model.get_weights()
def predict(X):
    X      = X.reshape((X.shape[0],-1))           #Flatten
    X      = X @ W[0] + W[1]                      #Dense
    X[X<0] = 0                                    #Relu
    X      = X @ W[2] + W[3]                      #Dense
    X[X<0] = 0                                    #Relu
    X      = X @ W[4] + W[5]                      #Dense
    X      = np.exp(X)/np.exp(X).sum(1)[...,None] #Softmax
    return X

%timeit model.predict(X[:10]).argmax(1) # 10   loops takes 37.7 ms
%timeit predict(X[:10]).argmax(1)       # 1000 loops takes 356 µs

52.8 ms ± 2.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
640 µs ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Ipython 输出（GPU）：

Python 3.7.7 (default, Mar 26 2020, 15:48:22) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import tensorflow as tf 
   ...:  
   ...: (X,Y),(Xt,Yt) = tf.keras.datasets.mnist.load_data() 
   ...:  
   ...: model = tf.keras.models.Sequential([ 
   ...:     tf.keras.layers.Flatten(), 
   ...:     tf.keras.layers.Dense(1000,'relu'), 
   ...:     tf.keras.layers.Dense(100,'relu'), 
   ...:     tf.keras.layers.Dense(10,'softmax'), 
   ...: ]) 
   ...: model.compile('adam','sparse_categorical_crossentropy') 
   ...: model.fit(X,Y,epochs=20,batch_size=1024)                                                                                                                                                                   
2020-07-01 15:50:46.008518: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-07-01 15:50:46.054495: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:05:00.0
2020-07-01 15:50:46.059582: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-07-01 15:50:46.114562: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-07-01 15:50:46.142058: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-07-01 15:50:46.152899: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-07-01 15:50:46.217725: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-07-01 15:50:46.260758: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-07-01 15:50:46.374328: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-07-01 15:50:46.376747: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-07-01 15:50:46.377688: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX FMA
2020-07-01 15:50:46.433422: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 4018875000 Hz
2020-07-01 15:50:46.434383: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x563e4d0d71c0 executing computations on platform Host. Devices:
2020-07-01 15:50:46.435119: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
2020-07-01 15:50:46.596077: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x563e4a9379f0 executing computations on platform CUDA. Devices:
2020-07-01 15:50:46.596119: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): GeForce RTX 2080 Ti, Compute Capability 7.5
2020-07-01 15:50:46.597894: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:05:00.0
2020-07-01 15:50:46.597961: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-07-01 15:50:46.597988: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-07-01 15:50:46.598014: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-07-01 15:50:46.598040: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-07-01 15:50:46.598065: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-07-01 15:50:46.598090: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-07-01 15:50:46.598115: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-07-01 15:50:46.599766: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-07-01 15:50:46.600611: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-07-01 15:50:46.603713: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-07-01 15:50:46.603751: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0 
2020-07-01 15:50:46.603763: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N 
2020-07-01 15:50:46.605917: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10311 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:05:00.0, compute capability: 7.5)
Train on 60000 samples
Epoch 1/20
2020-07-01 15:50:49.995091: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
60000/60000 [==============================] - 2s 26us/sample - loss: 9.9370
Epoch 2/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.6094
Epoch 3/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.3672
Epoch 4/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.2720
Epoch 5/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.2196
Epoch 6/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.1673
Epoch 7/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.1367
Epoch 8/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.1082
Epoch 9/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0895
Epoch 10/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0781
Epoch 11/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0666
Epoch 12/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0537
Epoch 13/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0459
Epoch 14/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0412
Epoch 15/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0401
Epoch 16/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0318
Epoch 17/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0275
Epoch 18/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0237
Epoch 19/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0212
Epoch 20/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0199
Out[1]: <tensorflow.python.keras.callbacks.History at 0x7f7c9000b550>

In [2]: import numpy as np 
   ...:  
   ...: W = model.get_weights() 
   ...:  
   ...: def predict(X): 
   ...:     X      = X.reshape((X.shape[0],-1))           #Flatten 
   ...:     X      = X @ W[0] + W[1]                      #Dense 
   ...:     X[X<0] = 0                                    #Relu 
   ...:     X      = X @ W[2] + W[3]                      #Dense 
   ...:     X[X<0] = 0                                    #Relu 
   ...:     X      = X @ W[4] + W[5]                      #Dense 
   ...:     X      = np.exp(X)/np.exp(X).sum(1)[...,None] #Softmax 
   ...:     return X 
   ...:                                                                                                                                                                                                            

In [3]: print(model.predict(X[:100]).argmax(1)) 
   ...: print(predict(X[:100]).argmax(1))                                                                                                                                                                          
[5 0 4 1 9 2 1 3 1 4 3 5 3 6 1 7 2 8 6 9 4 0 9 1 1 2 4 3 2 7 3 8 6 9 0 5 6
 0 7 6 1 8 7 9 3 9 8 5 9 3 3 0 7 4 9 8 0 9 4 1 4 4 6 0 4 5 6 1 0 0 1 7 1 6
 3 0 2 1 1 7 5 0 2 6 7 8 3 9 0 4 6 7 4 6 8 0 7 8 3 1]
/home/bobbyocean/anaconda3/bin/ipython3:12: RuntimeWarning: overflow encountered in exp
/home/bobbyocean/anaconda3/bin/ipython3:12: RuntimeWarning: invalid value encountered in true_divide
[5 0 4 1 9 2 1 3 1 4 3 5 3 6 1 7 2 8 6 9 4 0 9 1 1 2 4 3 2 7 3 8 6 9 0 5 6
 0 7 6 1 8 7 9 3 9 8 5 9 3 3 0 7 4 9 8 0 9 4 1 4 4 6 0 4 5 6 1 0 0 1 7 1 6
 3 0 2 1 1 7 5 0 2 6 7 8 3 9 0 4 6 7 4 6 8 0 7 8 3 1]

In [4]: %timeit model.predict(X[:10]).argmax(1)                                                                                                                                                                    
37.7 ms ± 806 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [5]: %timeit predict(X[:10]).argmax(1)                                                                                                                                                                          
361 µs ± 13.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

【问题讨论】：

【参考方案1】：

正如其他人所指出的，所讨论的 Tensorflow 二进制文件是为 GPU 优化而编译的：虽然 GPU 由于其拥有极高数量的计算核心而在密集的数字处理方面表现出色，但在处理方面却非常缓慢来回移动数据。

当模型在显卡上执行时，所有必要的数据都必须突发到 GPU——它无法访问主机系统的 RAM（主机系统也没有访问视频内存）。一旦 GPU 完成处理，所有结果都必须传送回主机系统。

所有这些数据移动都需要大量时间；此外，据我所知，编译为使用 GPU/CUDA 执行的 Tensorflow 二进制文件不包含在 CPU 上执行的任何标准优化（例如使用更快的扩展指令集，例如 AVX、AVX2等）。

因此，您正在比较一个高度 CPU 优化的科学库，它可以处理数据，甚至不必在一半时间返回 RAM（CPU 寄存器和芯片上的缓存存储）；代码必须在将所有数据发送到显卡并返回之前收集它需要的每一位。我也忽略了在 Tensorflow 引擎盖下进行的所有数据操作。毕竟，它适用于自己的数据结构。

我想，急切执行也是效率低下的另一层。

至于部署 Keras 模型的最佳实践，我认为它就像软件中的其他一切一样：过早的优化是万恶之源。如果您不需要它快速精简，然后让它变得缓慢、模块化、可重用和直观。但是，嘿，如果你需要或想要效率，那就给你力量。 Keras 专为快速开发和研究而设计，而非生产代码。

简而言之，答案是出于同样的原因 C++ 比 Python 快（因为 Python 解释器的开销要大得多，Tensorflow 也是如此）。

【讨论】：

【参考方案2】：

另一个答案在“如何使tf keras 预测更快”方面更有用，但我认为以下内容可以帮助更多“它在做什么需要这么多时间”？即使禁用了 Eager 模式，您也可能想知道执行的样子（例如，提供或不提供 batch_size 等）。

要回答这个问题，您可能会发现跟踪分析器很有用。跟踪执行会增加很多开销（尤其是对于有一堆非常轻量级的 Python 调用的地方），但总的来说应该让您对正在执行的 Python 代码的哪一部分有相当多的了解，因为，好吧，它只是准确记录正在发生的事情。您可以尝试pytracing，因为它会生成 Chrome 浏览器在其内置的chrome://tracing 页面上很好地可视化的文件。使用它，例如在 google colab 中，您可以执行以下操作：

首先，安装pytracing：

!pip install pytracing

然后生成trace：

from pytracing import TraceProfiler
tp = TraceProfiler(output=open('/root/trace.out', 'wt'))
with tp.traced():
  for i in range(2): 
    model.predict(X[:1000], batch_size=1000)

然后下载trace：

from google.colab import files
files.download('/root/trace.out')

之后在Chrome浏览器打开chrome://tracing页面，点击“加载”按钮，选择trace.out文件，就下载好了。

您将看到类似以下内容 - 您可以单击任何元素，查看 python 函数的全名和文件来源 + 所花费的时间（同样，所有这些都高于正常运行，因为跟踪开销）：

您可以看到禁用/启用 Eager Execution 或更改批处理大小将如何改变输出，并且可以亲自查看花费最多的时间。从我目前看到的情况来看（在非急切模式 + 像model.predict(X[:1000], batch_size=1000) 这样的呼叫）花费了相当多的时间：

标准化您的数据（无论是什么意思）：~2.5ms（包括跟踪开销！）：

/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/training_v1.py:2336:_standardize_user_data

准备回调（即使我们没有设置任何回调）：~2ms（包括跟踪开销）

/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/callbacks.py:133:configure_callbacks

关于numpy 版本没有优化的说法——我不同意。这里的 numpy 实现非常优化——python 没有在其中进行任何纯 python 调用（执行predict 只会导致调用 C 中的函数——起初我不敢相信，但似乎是这样），所以Python 的开销非常小。通过优化 ReLU 的方式并消除额外的分配/释放，您可能会有所收获，但这只会带来非常小的性能提升。

【讨论】：

您也可以使用viztracer 来替代pytracing - 与pytracing 相比，viztracer 的开销似乎更低。你也可以像pip install viztracer一样使用pip安装它，用法类似于pytracing。使用viztrace，我可以读取标准化数据和准备回调，每个时间大约为 0.5 毫秒。真正的执行（没有跟踪应该比这更快，但是如果你的批量大小足够小，那么与使用 numpy 的纯算术相比，tf 所做的所有这些额外的事情都会对其产生不利影响）。【参考方案3】：

我们观察到主要问题是Eager Execution 模式的原因。我们根据 CPU 和 GPU 基础对您的代码和相应的结果进行浅显的了解。 numpy 确实不在 GPU 上运行，因此与 tf-gpu 不同，它不会遇到任何数据转移开销。

但也相当明显与model. predict 和np 相比，您定义的predict 方法和np 完成了多少快速计算，而输入测试集是 仅 10 个样本。但是，我们不会进行任何深入的分析，例如您可能喜欢阅读的一件艺术品here。

我的设置如下。我正在使用 Colab 环境并检查 CPU 和 GPU 模式。

TensorFlow 1.15.2
Keras 2.3.1
Numpy 1.19.5

TensorFlow 2.4.1
Keras 2.4.0
Numpy 1.19.5

TF 1.15.2 - CPU

%tensorflow_version 1.x

import os
os.environ["CUDA_VISIBLE_DEVICES"]="-1"   

import tensorflow as tf
from tensorflow.python.client import device_lib

print(tf.__version__)
print('A: ', tf.test.is_built_with_cuda)
print('B: ', tf.test.gpu_device_name())
local_device_protos = device_lib.list_local_devices()
([x.name for x in local_device_protos if x.device_type == 'GPU'], 
 [x.name for x in local_device_protos if x.device_type == 'CPU'])

TensorFlow 1.x selected.
1.15.2
A:  <function is_built_with_cuda at 0x7f122d58dcb0>
B:  
([], ['/device:CPU:0'])

现在，运行您的代码。

import tensorflow as tf
import keras
print(tf.executing_eagerly()) # False

(X,Y),(Xt,Yt) = keras.datasets.mnist.load_data()
model = keras.models.Sequential([])
model.compile
model.fit

%timeit model.predict(X[:10]).argmax(1) # yours: 10   loops takes 37.7 ms
%timeit predict(X[:10]).argmax(1)       # yours: 1000 loops takes 356 µs

1000 loops, best of 5: 1.07 ms per loop
1000 loops, best of 5: 1.48 ms per loop

我们可以看到执行时间与旧的keras相当。现在，让我们也用 GPU 进行测试。

TF 1.15.2 - GPU

%tensorflow_version 1.x

import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"   

import tensorflow as tf
from tensorflow.python.client import device_lib

print(tf.__version__)
print('A: ', tf.test.is_built_with_cuda)
print('B: ', tf.test.gpu_device_name())
local_device_protos = device_lib.list_local_devices()
([x.name for x in local_device_protos if x.device_type == 'GPU'], 
 [x.name for x in local_device_protos if x.device_type == 'CPU'])

1.15.2
A:  <function is_built_with_cuda at 0x7f0b5ad46830>
B:  /device:GPU:0
(['/device:GPU:0'], ['/device:CPU:0'])

...
...
%timeit model.predict(X[:10]).argmax(1) # yours: 10   loops takes 37.7 ms
%timeit predict(X[:10]).argmax(1)       # yours: 1000 loops takes 356 µs

1000 loops, best of 5: 1.02 ms per loop
1000 loops, best of 5: 1.44 ms per loop

现在，这里的执行时间也与旧的keras 相当，并且没有急切模式。现在让我们先看看带有 Eager 模式的新 tf. keras，然后我们观察没有 Eager 模式的情况。

TF 2.4.1 - CPU

热切

import os
os.environ["CUDA_VISIBLE_DEVICES"]="-1"   

import tensorflow as tf
from tensorflow.python.client import device_lib

print(tf.__version__)
print('A: ', tf.test.is_built_with_cuda)
print('B: ', tf.test.gpu_device_name())
local_device_protos = device_lib.list_local_devices()
([x.name for x in local_device_protos if x.device_type == 'GPU'], 
 [x.name for x in local_device_protos if x.device_type == 'CPU'])

2.4.1
A:  <function is_built_with_cuda at 0x7fed85de3560>
B:  
([], ['/device:CPU:0'])

现在，以 Eager 模式运行代码。

import tensorflow as tf
import keras

print(tf.executing_eagerly())  # True
(X,Y),(Xt,Yt) = keras.datasets.mnist.load_data()

model = keras.models.Sequential([ ])
model.compile
model.fit

%timeit model.predict(X[:10]).argmax(1) # yours: 10   loops takes 37.7 ms
%timeit predict(X[:10]).argmax(1)       # yours: 1000 loops takes 356 µs

10 loops, best of 5: 28 ms per loop
1000 loops, best of 5: 1.73 ms per loop

急切地禁用

现在，如果我们禁用 Eager 模式并运行以下相同的代码，我们将得到：

import tensorflow as tf
import keras

# # Disables eager execution
tf.compat.v1.disable_eager_execution()
# or, 
# Disables eager execution of tf.functions.
# tf.config.run_functions_eagerly(False)
print(tf.executing_eagerly())
False

(X,Y),(Xt,Yt) = keras.datasets.mnist.load_data()
model = keras.models.Sequential([])
model.compile
model.fit

%timeit model.predict(X[:10]).argmax(1) # yours: 10   loops takes 37.7 ms
%timeit predict(X[:10]).argmax(1)       # yours: 1000 loops takes 356 µs

1000 loops, best of 5: 1.37 ms per loop
1000 loops, best of 5: 1.57 ms per loop

现在，我们可以看到在新的tf. keras 中禁用急切模式的执行时间相当。现在，让我们也使用 GPU 模式进行测试。

TF 2.4.1 - GPU

热切

import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"   

import tensorflow as tf
from tensorflow.python.client import device_lib

print(tf.__version__)
print('A: ', tf.test.is_built_with_cuda)
print('B: ', tf.test.gpu_device_name())
local_device_protos = device_lib.list_local_devices()
([x.name for x in local_device_protos if x.device_type == 'GPU'], 
 [x.name for x in local_device_protos if x.device_type == 'CPU'])

2.4.1
A:  <function is_built_with_cuda at 0x7f16ad88f680>
B:  /device:GPU:0
(['/device:GPU:0'], ['/device:CPU:0'])

import tensorflow as tf
import keras

print(tf.executing_eagerly()) # True
(X,Y),(Xt,Yt) = keras.datasets.mnist.load_data()
model = keras.models.Sequential([ ])
model.compile
model.fit

%timeit model.predict(X[:10]).argmax(1) # yours: 10   loops takes 37.7 ms
%timeit predict(X[:10]).argmax(1)       # yours: 1000 loops takes 356 µs

10 loops, best of 5: 26.3 ms per loop
1000 loops, best of 5: 1.48 ms per loop

急切地禁用

最后，如果我们禁用 Eager 模式并运行以下相同的代码，我们将得到：

# Disables eager execution
tf.compat.v1.disable_eager_execution()
# or, 
# Disables eager execution of tf.functions.
# tf.config.run_functions_eagerly(False)
print(tf.executing_eagerly()) # False 

(X,Y),(Xt,Yt) = keras.datasets.mnist.load_data()
model = keras.models.Sequential([ ])
model.compile
model.fit

%timeit model.predict(X[:10]).argmax(1) # yours: 10   loops takes 37.7 ms
%timeit predict(X[:10]).argmax(1)       # yours: 1000 loops takes 356 µs

1000 loops, best of 5: 1.12 ms per loop
1000 loops, best of 5: 1.45 ms per loop

和以前一样，执行时间与新tf. keras 中的非急切模式相当。这就是为什么，Eager 模式 是tf. keras 性能比直接numpy 慢的根本原因。

【讨论】：

以上是关于TF.Keras model.predict 比直接 Numpy 慢？的主要内容，如果未能解决你的问题，请参考以下文章