TF.Keras model.predict 比直接 Numpy 慢?

Posted

技术标签:

【中文标题】TF.Keras model.predict 比直接 Numpy 慢?【英文标题】:TF.Keras model.predict is slower than straight Numpy? 【发布时间】:2020-10-22 03:53:39 【问题描述】:

感谢大家帮助我理解以下问题。我已经更新了问题并生成了 CPU-only 运行和 GPU-only 运行。一般来说,在任何一种情况下,直接numpy 的计算似乎都比model. predict() 快数百倍。希望这可以澄清这似乎不是 CPUGPU 的问题(如果是,我希望得到解释)。


让我们用 keras 创建一个经过训练的模型。

import tensorflow as tf

(X,Y),(Xt,Yt) = tf.keras.datasets.mnist.load_data()

model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(1000,'relu'),
    tf.keras.layers.Dense(100,'relu'),
    tf.keras.layers.Dense(10,'softmax'),
])
model.compile('adam','sparse_categorical_crossentropy')
model.fit(X,Y,epochs=20,batch_size=1024)

现在让我们使用numpy 重新创建model.predict 函数。

import numpy as np

W = model.get_weights()

def predict(X):
    X      = X.reshape((X.shape[0],-1))           #Flatten
    X      = X @ W[0] + W[1]                      #Dense
    X[X<0] = 0                                    #Relu
    X      = X @ W[2] + W[3]                      #Dense
    X[X<0] = 0                                    #Relu
    X      = X @ W[4] + W[5]                      #Dense
    X      = np.exp(X)/np.exp(X).sum(1)[...,None] #Softmax
    return X

我们可以很容易地验证这些是相同的功能(模块机器实现错误)。

print(model.predict(X[:100]).argmax(1))
print(predict(X[:100]).argmax(1))

我们还可以测试这些函数的运行速度。使用ipython

%timeit model.predict(X[:10]).argmax(1) # 10   loops takes 37.7 ms
%timeit predict(X[:10]).argmax(1)       # 1000 loops takes 356 µs

我知道predict 在小批量时的运行速度比 model. predict 快大约 10,000 倍,而在大批量时则降低到大约 100 倍。不管怎样,为什么predict 这么快?事实上,predict 甚至没有优化,我们可以使用numba,甚至直接在C 代码中重写predict 并编译它。

考虑到部署目的,为什么手动从模型中提取权重并重新编写函数比keras 内部执行的速度快数千倍?这也意味着编写脚本来利用.h5 文件或类似文件,可能比手动重写预测函数要慢得多。一般来说,这是真的吗?


Ipython 输出(CPU):

Python 3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.19.0 -- An enhanced Interactive Python. Type '?' for help.
PyDev console: using IPython 7.19.0
Python 3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)] on win32
import os
os.environ["CUDA_VISIBLE_DEVICES"]="-1"    
import tensorflow as tf
(X,Y),(Xt,Yt) = tf.keras.datasets.mnist.load_data()
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(1000,'relu'),
    tf.keras.layers.Dense(100,'relu'),
    tf.keras.layers.Dense(10,'softmax'),
])
model.compile('adam','sparse_categorical_crossentropy')
model.fit(X,Y,epochs=20,batch_size=1024)
2021-04-19 15:10:58.323137: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021-04-19 15:11:01.990590: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library nvcuda.dll
2021-04-19 15:11:02.039285: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2021-04-19 15:11:02.042553: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: DESKTOP-G0U8S3P
2021-04-19 15:11:02.043134: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: DESKTOP-G0U8S3P
2021-04-19 15:11:02.128834: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:127] None of the MLIR optimization passes are enabled (registered 2)
Epoch 1/20
59/59 [==============================] - 4s 60ms/step - loss: 35.3708
Epoch 2/20
59/59 [==============================] - 3s 58ms/step - loss: 0.8671
Epoch 3/20
59/59 [==============================] - 3s 56ms/step - loss: 0.5641
Epoch 4/20
59/59 [==============================] - 3s 56ms/step - loss: 0.4359
Epoch 5/20
59/59 [==============================] - 3s 56ms/step - loss: 0.3447
Epoch 6/20
59/59 [==============================] - 3s 56ms/step - loss: 0.2891
Epoch 7/20
59/59 [==============================] - 3s 56ms/step - loss: 0.2371
Epoch 8/20
59/59 [==============================] - 3s 57ms/step - loss: 0.1977
Epoch 9/20
59/59 [==============================] - 3s 57ms/step - loss: 0.1713
Epoch 10/20
59/59 [==============================] - 3s 57ms/step - loss: 0.1381
Epoch 11/20
59/59 [==============================] - 4s 61ms/step - loss: 0.1203
Epoch 12/20
59/59 [==============================] - 3s 57ms/step - loss: 0.1095
Epoch 13/20
59/59 [==============================] - 3s 56ms/step - loss: 0.0877
Epoch 14/20
59/59 [==============================] - 3s 57ms/step - loss: 0.0793
Epoch 15/20
59/59 [==============================] - 3s 56ms/step - loss: 0.0727
Epoch 16/20
59/59 [==============================] - 3s 56ms/step - loss: 0.0702
Epoch 17/20
59/59 [==============================] - 3s 56ms/step - loss: 0.0701
Epoch 18/20
59/59 [==============================] - 3s 57ms/step - loss: 0.0631
Epoch 19/20
59/59 [==============================] - 3s 56ms/step - loss: 0.0539
Epoch 20/20
59/59 [==============================] - 3s 58ms/step - loss: 0.0493
Out[3]: <tensorflow.python.keras.callbacks.History at 0x143069fdf40>

import numpy as np
W = model.get_weights()
def predict(X):
    X      = X.reshape((X.shape[0],-1))           #Flatten
    X      = X @ W[0] + W[1]                      #Dense
    X[X<0] = 0                                    #Relu
    X      = X @ W[2] + W[3]                      #Dense
    X[X<0] = 0                                    #Relu
    X      = X @ W[4] + W[5]                      #Dense
    X      = np.exp(X)/np.exp(X).sum(1)[...,None] #Softmax
    return X

%timeit model.predict(X[:10]).argmax(1) # 10   loops takes 37.7 ms
%timeit predict(X[:10]).argmax(1)       # 1000 loops takes 356 µs

52.8 ms ± 2.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
640 µs ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Ipython 输出(GPU):

Python 3.7.7 (default, Mar 26 2020, 15:48:22) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import tensorflow as tf 
   ...:  
   ...: (X,Y),(Xt,Yt) = tf.keras.datasets.mnist.load_data() 
   ...:  
   ...: model = tf.keras.models.Sequential([ 
   ...:     tf.keras.layers.Flatten(), 
   ...:     tf.keras.layers.Dense(1000,'relu'), 
   ...:     tf.keras.layers.Dense(100,'relu'), 
   ...:     tf.keras.layers.Dense(10,'softmax'), 
   ...: ]) 
   ...: model.compile('adam','sparse_categorical_crossentropy') 
   ...: model.fit(X,Y,epochs=20,batch_size=1024)                                                                                                                                                                   
2020-07-01 15:50:46.008518: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-07-01 15:50:46.054495: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:05:00.0
2020-07-01 15:50:46.059582: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-07-01 15:50:46.114562: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-07-01 15:50:46.142058: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-07-01 15:50:46.152899: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-07-01 15:50:46.217725: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-07-01 15:50:46.260758: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-07-01 15:50:46.374328: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-07-01 15:50:46.376747: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-07-01 15:50:46.377688: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX FMA
2020-07-01 15:50:46.433422: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 4018875000 Hz
2020-07-01 15:50:46.434383: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x563e4d0d71c0 executing computations on platform Host. Devices:
2020-07-01 15:50:46.435119: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
2020-07-01 15:50:46.596077: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x563e4a9379f0 executing computations on platform CUDA. Devices:
2020-07-01 15:50:46.596119: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): GeForce RTX 2080 Ti, Compute Capability 7.5
2020-07-01 15:50:46.597894: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:05:00.0
2020-07-01 15:50:46.597961: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-07-01 15:50:46.597988: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-07-01 15:50:46.598014: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-07-01 15:50:46.598040: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-07-01 15:50:46.598065: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-07-01 15:50:46.598090: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-07-01 15:50:46.598115: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-07-01 15:50:46.599766: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-07-01 15:50:46.600611: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-07-01 15:50:46.603713: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-07-01 15:50:46.603751: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0 
2020-07-01 15:50:46.603763: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N 
2020-07-01 15:50:46.605917: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10311 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:05:00.0, compute capability: 7.5)
Train on 60000 samples
Epoch 1/20
2020-07-01 15:50:49.995091: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
60000/60000 [==============================] - 2s 26us/sample - loss: 9.9370
Epoch 2/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.6094
Epoch 3/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.3672
Epoch 4/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.2720
Epoch 5/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.2196
Epoch 6/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.1673
Epoch 7/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.1367
Epoch 8/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.1082
Epoch 9/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0895
Epoch 10/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0781
Epoch 11/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0666
Epoch 12/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0537
Epoch 13/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0459
Epoch 14/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0412
Epoch 15/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0401
Epoch 16/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0318
Epoch 17/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0275
Epoch 18/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0237
Epoch 19/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0212
Epoch 20/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0199
Out[1]: <tensorflow.python.keras.callbacks.History at 0x7f7c9000b550>

In [2]: import numpy as np 
   ...:  
   ...: W = model.get_weights() 
   ...:  
   ...: def predict(X): 
   ...:     X      = X.reshape((X.shape[0],-1))           #Flatten 
   ...:     X      = X @ W[0] + W[1]                      #Dense 
   ...:     X[X<0] = 0                                    #Relu 
   ...:     X      = X @ W[2] + W[3]                      #Dense 
   ...:     X[X<0] = 0                                    #Relu 
   ...:     X      = X @ W[4] + W[5]                      #Dense 
   ...:     X      = np.exp(X)/np.exp(X).sum(1)[...,None] #Softmax 
   ...:     return X 
   ...:                                                                                                                                                                                                            

In [3]: print(model.predict(X[:100]).argmax(1)) 
   ...: print(predict(X[:100]).argmax(1))                                                                                                                                                                          
[5 0 4 1 9 2 1 3 1 4 3 5 3 6 1 7 2 8 6 9 4 0 9 1 1 2 4 3 2 7 3 8 6 9 0 5 6
 0 7 6 1 8 7 9 3 9 8 5 9 3 3 0 7 4 9 8 0 9 4 1 4 4 6 0 4 5 6 1 0 0 1 7 1 6
 3 0 2 1 1 7 5 0 2 6 7 8 3 9 0 4 6 7 4 6 8 0 7 8 3 1]
/home/bobbyocean/anaconda3/bin/ipython3:12: RuntimeWarning: overflow encountered in exp
/home/bobbyocean/anaconda3/bin/ipython3:12: RuntimeWarning: invalid value encountered in true_divide
[5 0 4 1 9 2 1 3 1 4 3 5 3 6 1 7 2 8 6 9 4 0 9 1 1 2 4 3 2 7 3 8 6 9 0 5 6
 0 7 6 1 8 7 9 3 9 8 5 9 3 3 0 7 4 9 8 0 9 4 1 4 4 6 0 4 5 6 1 0 0 1 7 1 6
 3 0 2 1 1 7 5 0 2 6 7 8 3 9 0 4 6 7 4 6 8 0 7 8 3 1]

In [4]: %timeit model.predict(X[:10]).argmax(1)                                                                                                                                                                    
37.7 ms ± 806 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [5]: %timeit predict(X[:10]).argmax(1)                                                                                                                                                                          
361 µs ± 13.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

【问题讨论】:

【参考方案1】:

正如其他人所指出的,所讨论的 Tensorflow 二进制文件是为 GPU 优化而编译的:虽然 GPU 由于其拥有极高数量的计算核心而在密集的数字处理方面表现出色,但在处理方面却非常缓慢来回移动数据。

当模型在显卡上执行时,所有必要的数据都必须突发到 GPU——它无法访问主机系统的 RAM(主机系统也没有访问视频内存)。一旦 GPU 完成处理,所有结果都必须传送回主机系统。

所有这些数据移动都需要大量时间;此外,据我所知,编译为使用 GPU/CUDA 执行的 Tensorflow 二进制文件不包含在 CPU 上执行的任何标准优化(例如使用更快的扩展指令集,例如 AVX、AVX2等)。

因此,您正在比较一个高度 CPU 优化的科学库,它可以处理数据,甚至不必在一半时间返回 RAM(CPU 寄存器和芯片上的缓存存储);代码必须在将所有数据发送到显卡并返回之前收集它需要的每一位。我也忽略了在 Tensorflow 引擎盖下进行的所有数据操作。毕竟,它适用于自己的数据结构。

我想,急切执行也是效率低下的另一层。

至于部署 Keras 模型的最佳实践,我认为它就像软件中的其他一切一样:过早的优化是万恶之源。如果您不需要它快速精简,然后让它变得缓慢、模块化、可重用和直观。但是,嘿,如果你需要或想要效率,那就给你力量。 Keras 专为快速开发和研究而设计,而非生产代码。

简而言之,答案是出于同样的原因 C++ 比 Python 快(因为 Python 解释器的开销要大得多,Tensorflow 也是如此)。

【讨论】:

【参考方案2】:

另一个答案在“如何使tf keras 预测更快”方面更有用,但我认为以下内容可以帮助更多“它在做什么需要这么多时间”?即使禁用了 Eager 模式,您也可能想知道执行的样子(例如,提供或不提供 batch_size 等)。

要回答这个问题,您可能会发现跟踪分析器很有用。跟踪执行会增加很多开销(尤其是对于有一堆非常轻量级的 Python 调用的地方),但总的来说应该让您对正在执行的 Python 代码的哪一部分有相当多的了解,因为,好吧,它只是准确记录正在发生的事情。您可以尝试pytracing,因为它会生成 Chrome 浏览器在其内置的chrome://tracing 页面上很好地可视化的文件。使用它,例如在 google colab 中,您可以执行以下操作:

首先,安装pytracing:

!pip install pytracing

然后生成trace:

from pytracing import TraceProfiler
tp = TraceProfiler(output=open('/root/trace.out', 'wt'))
with tp.traced():
  for i in range(2): 
    model.predict(X[:1000], batch_size=1000)

然后下载trace:

from google.colab import files
files.download('/root/trace.out') 

之后在Chrome浏览器打开chrome://tracing页面,点击“加载”按钮,选择trace.out文件,就下载好了。

您将看到类似以下内容 - 您可以单击任何元素,查看 python 函数的全名和文件来源 + 所花费的时间(同样,所有这些都高于正常运行,因为跟踪开销):

您可以看到禁用/启用 Eager Execution 或更改批处理大小将如何改变输出,并且可以亲自查看花费最多的时间。从我目前看到的情况来看(在非急切模式 + 像model.predict(X[:1000], batch_size=1000) 这样的呼叫)花费了相当多的时间:

标准化您的数据(无论是什么意思):~2.5ms(包括跟踪开销!):

/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/training_v1.py:2336:_standardize_user_data

准备回调(即使我们没有设置任何回调):~2ms(包括跟踪开销)

/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/callbacks.py:133:configure_callbacks

关于numpy 版本没有优化的说法——我不同意。这里的 numpy 实现非常优化——python 没有在其中进行任何纯 python 调用(执行predict 只会导致调用 C 中的函数——起初我不敢相信,但似乎是这样),所以Python 的开销非常小。通过优化 ReLU 的方式并消除额外的分配/释放,您可能会有所收获,但这只会带来非常小的性能提升。

【讨论】:

您也可以使用viztracer 来替代pytracing - 与pytracing 相比,viztracer 的开销似乎更低。你也可以像pip install viztracer一样使用pip安装它,用法类似于pytracing。使用viztrace,我可以读取标准化数据和准备回调,每个时间大约为 0.5 毫秒。真正的执行(没有跟踪应该比这更快,但是如果你的批量大小足够小,那么与使用 numpy 的纯算术相比,tf 所做的所有这些额外的事情都会对其产生不利影响)。【参考方案3】:

我们观察到主要问题是Eager Execution 模式的原因。我们根据 CPUGPU 基础对您的代码和相应的结果进行浅显的了解。 numpy 确实不在 GPU 上运行,因此与 tf-gpu 不同,它不会遇到任何数据转移开销。

但也相当明显model. predictnp 相比,您定义的predict 方法和np 完成了多少快速计算,而输入测试集是 仅 10 个样本。但是,我们不会进行任何深入的分析,例如您可能喜欢阅读的一件艺术品here。


我的设置如下。我正在使用 Colab 环境并检查 CPUGPU 模式。

TensorFlow 1.15.2
Keras 2.3.1
Numpy 1.19.5

TensorFlow 2.4.1
Keras 2.4.0
Numpy 1.19.5

TF 1.15.2 - CPU

%tensorflow_version 1.x

import os
os.environ["CUDA_VISIBLE_DEVICES"]="-1"   

import tensorflow as tf
from tensorflow.python.client import device_lib

print(tf.__version__)
print('A: ', tf.test.is_built_with_cuda)
print('B: ', tf.test.gpu_device_name())
local_device_protos = device_lib.list_local_devices()
([x.name for x in local_device_protos if x.device_type == 'GPU'], 
 [x.name for x in local_device_protos if x.device_type == 'CPU'])
TensorFlow 1.x selected.
1.15.2
A:  <function is_built_with_cuda at 0x7f122d58dcb0>
B:  
([], ['/device:CPU:0'])

现在,运行您的代码。

import tensorflow as tf
import keras
print(tf.executing_eagerly()) # False

(X,Y),(Xt,Yt) = keras.datasets.mnist.load_data()
model = keras.models.Sequential([])
model.compile
model.fit

%timeit model.predict(X[:10]).argmax(1) # yours: 10   loops takes 37.7 ms
%timeit predict(X[:10]).argmax(1)       # yours: 1000 loops takes 356 µs

1000 loops, best of 5: 1.07 ms per loop
1000 loops, best of 5: 1.48 ms per loop

我们可以看到执行时间与旧的keras相当。现在,让我们也用 GPU 进行测试。


TF 1.15.2 - GPU

%tensorflow_version 1.x

import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"   

import tensorflow as tf
from tensorflow.python.client import device_lib

print(tf.__version__)
print('A: ', tf.test.is_built_with_cuda)
print('B: ', tf.test.gpu_device_name())
local_device_protos = device_lib.list_local_devices()
([x.name for x in local_device_protos if x.device_type == 'GPU'], 
 [x.name for x in local_device_protos if x.device_type == 'CPU'])
1.15.2
A:  <function is_built_with_cuda at 0x7f0b5ad46830>
B:  /device:GPU:0
(['/device:GPU:0'], ['/device:CPU:0'])
...
...
%timeit model.predict(X[:10]).argmax(1) # yours: 10   loops takes 37.7 ms
%timeit predict(X[:10]).argmax(1)       # yours: 1000 loops takes 356 µs

1000 loops, best of 5: 1.02 ms per loop
1000 loops, best of 5: 1.44 ms per loop

现在,这里的执行时间也与旧的keras 相当,并且没有急切模式。现在让我们先看看带有 Eager 模式的新 tf. keras,然后我们观察没有 Eager 模式的情况。


TF 2.4.1 - CPU

热切

import os
os.environ["CUDA_VISIBLE_DEVICES"]="-1"   

import tensorflow as tf
from tensorflow.python.client import device_lib

print(tf.__version__)
print('A: ', tf.test.is_built_with_cuda)
print('B: ', tf.test.gpu_device_name())
local_device_protos = device_lib.list_local_devices()
([x.name for x in local_device_protos if x.device_type == 'GPU'], 
 [x.name for x in local_device_protos if x.device_type == 'CPU'])
2.4.1
A:  <function is_built_with_cuda at 0x7fed85de3560>
B:  
([], ['/device:CPU:0'])

现在,以 Eager 模式运行代码。

import tensorflow as tf
import keras

print(tf.executing_eagerly())  # True
(X,Y),(Xt,Yt) = keras.datasets.mnist.load_data()

model = keras.models.Sequential([ ])
model.compile
model.fit

%timeit model.predict(X[:10]).argmax(1) # yours: 10   loops takes 37.7 ms
%timeit predict(X[:10]).argmax(1)       # yours: 1000 loops takes 356 µs

10 loops, best of 5: 28 ms per loop
1000 loops, best of 5: 1.73 ms per loop

急切地禁用

现在,如果我们禁用 Eager 模式并运行以下相同的代码,我们将得到:

import tensorflow as tf
import keras

# # Disables eager execution
tf.compat.v1.disable_eager_execution()
# or, 
# Disables eager execution of tf.functions.
# tf.config.run_functions_eagerly(False)
print(tf.executing_eagerly())
False
(X,Y),(Xt,Yt) = keras.datasets.mnist.load_data()
model = keras.models.Sequential([])
model.compile
model.fit

%timeit model.predict(X[:10]).argmax(1) # yours: 10   loops takes 37.7 ms
%timeit predict(X[:10]).argmax(1)       # yours: 1000 loops takes 356 µs

1000 loops, best of 5: 1.37 ms per loop
1000 loops, best of 5: 1.57 ms per loop

现在,我们可以看到在新的tf. keras 中禁用急切模式的执行时间相当。现在,让我们也使用 GPU 模式进行测试。


TF 2.4.1 - GPU

热切

import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"   

import tensorflow as tf
from tensorflow.python.client import device_lib

print(tf.__version__)
print('A: ', tf.test.is_built_with_cuda)
print('B: ', tf.test.gpu_device_name())
local_device_protos = device_lib.list_local_devices()
([x.name for x in local_device_protos if x.device_type == 'GPU'], 
 [x.name for x in local_device_protos if x.device_type == 'CPU'])
2.4.1
A:  <function is_built_with_cuda at 0x7f16ad88f680>
B:  /device:GPU:0
(['/device:GPU:0'], ['/device:CPU:0'])
import tensorflow as tf
import keras

print(tf.executing_eagerly()) # True
(X,Y),(Xt,Yt) = keras.datasets.mnist.load_data()
model = keras.models.Sequential([ ])
model.compile
model.fit

%timeit model.predict(X[:10]).argmax(1) # yours: 10   loops takes 37.7 ms
%timeit predict(X[:10]).argmax(1)       # yours: 1000 loops takes 356 µs

10 loops, best of 5: 26.3 ms per loop
1000 loops, best of 5: 1.48 ms per loop

急切地禁用

最后,如果我们禁用 Eager 模式并运行以下相同的代码,我们将得到:

# Disables eager execution
tf.compat.v1.disable_eager_execution()
# or, 
# Disables eager execution of tf.functions.
# tf.config.run_functions_eagerly(False)
print(tf.executing_eagerly()) # False 

(X,Y),(Xt,Yt) = keras.datasets.mnist.load_data()
model = keras.models.Sequential([ ])
model.compile
model.fit

%timeit model.predict(X[:10]).argmax(1) # yours: 10   loops takes 37.7 ms
%timeit predict(X[:10]).argmax(1)       # yours: 1000 loops takes 356 µs

1000 loops, best of 5: 1.12 ms per loop
1000 loops, best of 5: 1.45 ms per loop

和以前一样,执行时间与新tf. keras 中的非急切模式相当。这就是为什么,Eager 模式tf. keras 性能比直接numpy 慢的根本原因。

【讨论】:

以上是关于TF.Keras model.predict 比直接 Numpy 慢?的主要内容,如果未能解决你的问题,请参考以下文章

model.predict 不适用于 Keras 自定义层(推理错误)

为啥 TF Keras 推理方式比 Numpy 操作慢?

获取关于 Keras ANN 模型中输入的梯度

keras 中 model.predict() 和 model.predict_generator() 之间的预测差异

model.predict() 和 model.fit() 做啥?

tf.keras 和 tf.python.keras 有啥区别?