TF.Keras model.predict 比直接 Numpy 慢?
Posted
技术标签:
【中文标题】TF.Keras model.predict 比直接 Numpy 慢?【英文标题】:TF.Keras model.predict is slower than straight Numpy? 【发布时间】:2020-10-22 03:53:39 【问题描述】:感谢大家帮助我理解以下问题。我已经更新了问题并生成了 CPU-only 运行和 GPU-only 运行。一般来说,在任何一种情况下,直接numpy
的计算似乎都比model. predict()
快数百倍。希望这可以澄清这似乎不是 CPU 与 GPU 的问题(如果是,我希望得到解释)。
让我们用 keras 创建一个经过训练的模型。
import tensorflow as tf
(X,Y),(Xt,Yt) = tf.keras.datasets.mnist.load_data()
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(1000,'relu'),
tf.keras.layers.Dense(100,'relu'),
tf.keras.layers.Dense(10,'softmax'),
])
model.compile('adam','sparse_categorical_crossentropy')
model.fit(X,Y,epochs=20,batch_size=1024)
现在让我们使用numpy
重新创建model.predict
函数。
import numpy as np
W = model.get_weights()
def predict(X):
X = X.reshape((X.shape[0],-1)) #Flatten
X = X @ W[0] + W[1] #Dense
X[X<0] = 0 #Relu
X = X @ W[2] + W[3] #Dense
X[X<0] = 0 #Relu
X = X @ W[4] + W[5] #Dense
X = np.exp(X)/np.exp(X).sum(1)[...,None] #Softmax
return X
我们可以很容易地验证这些是相同的功能(模块机器实现错误)。
print(model.predict(X[:100]).argmax(1))
print(predict(X[:100]).argmax(1))
我们还可以测试这些函数的运行速度。使用ipython
:
%timeit model.predict(X[:10]).argmax(1) # 10 loops takes 37.7 ms
%timeit predict(X[:10]).argmax(1) # 1000 loops takes 356 µs
我知道predict
在小批量时的运行速度比 model. predict
快大约 10,000 倍,而在大批量时则降低到大约 100 倍。不管怎样,为什么predict
这么快?事实上,predict
甚至没有优化,我们可以使用numba
,甚至直接在C
代码中重写predict
并编译它。
考虑到部署目的,为什么手动从模型中提取权重并重新编写函数比keras
内部执行的速度快数千倍?这也意味着编写脚本来利用.h5
文件或类似文件,可能比手动重写预测函数要慢得多。一般来说,这是真的吗?
Ipython 输出(CPU):
Python 3.8.5 (default, Sep 3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.19.0 -- An enhanced Interactive Python. Type '?' for help.
PyDev console: using IPython 7.19.0
Python 3.8.5 (default, Sep 3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)] on win32
import os
os.environ["CUDA_VISIBLE_DEVICES"]="-1"
import tensorflow as tf
(X,Y),(Xt,Yt) = tf.keras.datasets.mnist.load_data()
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(1000,'relu'),
tf.keras.layers.Dense(100,'relu'),
tf.keras.layers.Dense(10,'softmax'),
])
model.compile('adam','sparse_categorical_crossentropy')
model.fit(X,Y,epochs=20,batch_size=1024)
2021-04-19 15:10:58.323137: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021-04-19 15:11:01.990590: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library nvcuda.dll
2021-04-19 15:11:02.039285: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2021-04-19 15:11:02.042553: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: DESKTOP-G0U8S3P
2021-04-19 15:11:02.043134: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: DESKTOP-G0U8S3P
2021-04-19 15:11:02.128834: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:127] None of the MLIR optimization passes are enabled (registered 2)
Epoch 1/20
59/59 [==============================] - 4s 60ms/step - loss: 35.3708
Epoch 2/20
59/59 [==============================] - 3s 58ms/step - loss: 0.8671
Epoch 3/20
59/59 [==============================] - 3s 56ms/step - loss: 0.5641
Epoch 4/20
59/59 [==============================] - 3s 56ms/step - loss: 0.4359
Epoch 5/20
59/59 [==============================] - 3s 56ms/step - loss: 0.3447
Epoch 6/20
59/59 [==============================] - 3s 56ms/step - loss: 0.2891
Epoch 7/20
59/59 [==============================] - 3s 56ms/step - loss: 0.2371
Epoch 8/20
59/59 [==============================] - 3s 57ms/step - loss: 0.1977
Epoch 9/20
59/59 [==============================] - 3s 57ms/step - loss: 0.1713
Epoch 10/20
59/59 [==============================] - 3s 57ms/step - loss: 0.1381
Epoch 11/20
59/59 [==============================] - 4s 61ms/step - loss: 0.1203
Epoch 12/20
59/59 [==============================] - 3s 57ms/step - loss: 0.1095
Epoch 13/20
59/59 [==============================] - 3s 56ms/step - loss: 0.0877
Epoch 14/20
59/59 [==============================] - 3s 57ms/step - loss: 0.0793
Epoch 15/20
59/59 [==============================] - 3s 56ms/step - loss: 0.0727
Epoch 16/20
59/59 [==============================] - 3s 56ms/step - loss: 0.0702
Epoch 17/20
59/59 [==============================] - 3s 56ms/step - loss: 0.0701
Epoch 18/20
59/59 [==============================] - 3s 57ms/step - loss: 0.0631
Epoch 19/20
59/59 [==============================] - 3s 56ms/step - loss: 0.0539
Epoch 20/20
59/59 [==============================] - 3s 58ms/step - loss: 0.0493
Out[3]: <tensorflow.python.keras.callbacks.History at 0x143069fdf40>
import numpy as np
W = model.get_weights()
def predict(X):
X = X.reshape((X.shape[0],-1)) #Flatten
X = X @ W[0] + W[1] #Dense
X[X<0] = 0 #Relu
X = X @ W[2] + W[3] #Dense
X[X<0] = 0 #Relu
X = X @ W[4] + W[5] #Dense
X = np.exp(X)/np.exp(X).sum(1)[...,None] #Softmax
return X
%timeit model.predict(X[:10]).argmax(1) # 10 loops takes 37.7 ms
%timeit predict(X[:10]).argmax(1) # 1000 loops takes 356 µs
52.8 ms ± 2.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
640 µs ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Ipython 输出(GPU):
Python 3.7.7 (default, Mar 26 2020, 15:48:22)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.4.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import tensorflow as tf
...:
...: (X,Y),(Xt,Yt) = tf.keras.datasets.mnist.load_data()
...:
...: model = tf.keras.models.Sequential([
...: tf.keras.layers.Flatten(),
...: tf.keras.layers.Dense(1000,'relu'),
...: tf.keras.layers.Dense(100,'relu'),
...: tf.keras.layers.Dense(10,'softmax'),
...: ])
...: model.compile('adam','sparse_categorical_crossentropy')
...: model.fit(X,Y,epochs=20,batch_size=1024)
2020-07-01 15:50:46.008518: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-07-01 15:50:46.054495: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:05:00.0
2020-07-01 15:50:46.059582: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-07-01 15:50:46.114562: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-07-01 15:50:46.142058: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-07-01 15:50:46.152899: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-07-01 15:50:46.217725: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-07-01 15:50:46.260758: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-07-01 15:50:46.374328: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-07-01 15:50:46.376747: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-07-01 15:50:46.377688: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX FMA
2020-07-01 15:50:46.433422: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 4018875000 Hz
2020-07-01 15:50:46.434383: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x563e4d0d71c0 executing computations on platform Host. Devices:
2020-07-01 15:50:46.435119: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): Host, Default Version
2020-07-01 15:50:46.596077: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x563e4a9379f0 executing computations on platform CUDA. Devices:
2020-07-01 15:50:46.596119: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): GeForce RTX 2080 Ti, Compute Capability 7.5
2020-07-01 15:50:46.597894: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:05:00.0
2020-07-01 15:50:46.597961: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-07-01 15:50:46.597988: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-07-01 15:50:46.598014: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-07-01 15:50:46.598040: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-07-01 15:50:46.598065: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-07-01 15:50:46.598090: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-07-01 15:50:46.598115: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-07-01 15:50:46.599766: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-07-01 15:50:46.600611: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-07-01 15:50:46.603713: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-07-01 15:50:46.603751: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2020-07-01 15:50:46.603763: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2020-07-01 15:50:46.605917: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10311 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:05:00.0, compute capability: 7.5)
Train on 60000 samples
Epoch 1/20
2020-07-01 15:50:49.995091: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
60000/60000 [==============================] - 2s 26us/sample - loss: 9.9370
Epoch 2/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.6094
Epoch 3/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.3672
Epoch 4/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.2720
Epoch 5/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.2196
Epoch 6/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.1673
Epoch 7/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.1367
Epoch 8/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.1082
Epoch 9/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0895
Epoch 10/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0781
Epoch 11/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0666
Epoch 12/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0537
Epoch 13/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0459
Epoch 14/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0412
Epoch 15/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0401
Epoch 16/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0318
Epoch 17/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0275
Epoch 18/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0237
Epoch 19/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0212
Epoch 20/20
60000/60000 [==============================] - 0s 4us/sample - loss: 0.0199
Out[1]: <tensorflow.python.keras.callbacks.History at 0x7f7c9000b550>
In [2]: import numpy as np
...:
...: W = model.get_weights()
...:
...: def predict(X):
...: X = X.reshape((X.shape[0],-1)) #Flatten
...: X = X @ W[0] + W[1] #Dense
...: X[X<0] = 0 #Relu
...: X = X @ W[2] + W[3] #Dense
...: X[X<0] = 0 #Relu
...: X = X @ W[4] + W[5] #Dense
...: X = np.exp(X)/np.exp(X).sum(1)[...,None] #Softmax
...: return X
...:
In [3]: print(model.predict(X[:100]).argmax(1))
...: print(predict(X[:100]).argmax(1))
[5 0 4 1 9 2 1 3 1 4 3 5 3 6 1 7 2 8 6 9 4 0 9 1 1 2 4 3 2 7 3 8 6 9 0 5 6
0 7 6 1 8 7 9 3 9 8 5 9 3 3 0 7 4 9 8 0 9 4 1 4 4 6 0 4 5 6 1 0 0 1 7 1 6
3 0 2 1 1 7 5 0 2 6 7 8 3 9 0 4 6 7 4 6 8 0 7 8 3 1]
/home/bobbyocean/anaconda3/bin/ipython3:12: RuntimeWarning: overflow encountered in exp
/home/bobbyocean/anaconda3/bin/ipython3:12: RuntimeWarning: invalid value encountered in true_divide
[5 0 4 1 9 2 1 3 1 4 3 5 3 6 1 7 2 8 6 9 4 0 9 1 1 2 4 3 2 7 3 8 6 9 0 5 6
0 7 6 1 8 7 9 3 9 8 5 9 3 3 0 7 4 9 8 0 9 4 1 4 4 6 0 4 5 6 1 0 0 1 7 1 6
3 0 2 1 1 7 5 0 2 6 7 8 3 9 0 4 6 7 4 6 8 0 7 8 3 1]
In [4]: %timeit model.predict(X[:10]).argmax(1)
37.7 ms ± 806 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [5]: %timeit predict(X[:10]).argmax(1)
361 µs ± 13.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
【问题讨论】:
【参考方案1】:正如其他人所指出的,所讨论的 Tensorflow 二进制文件是为 GPU 优化而编译的:虽然 GPU 由于其拥有极高数量的计算核心而在密集的数字处理方面表现出色,但在处理方面却非常缓慢来回移动数据。
当模型在显卡上执行时,所有必要的数据都必须突发到 GPU——它无法访问主机系统的 RAM(主机系统也没有访问视频内存)。一旦 GPU 完成处理,所有结果都必须传送回主机系统。
所有这些数据移动都需要大量时间;此外,据我所知,编译为使用 GPU/CUDA 执行的 Tensorflow 二进制文件不包含在 CPU 上执行的任何标准优化(例如使用更快的扩展指令集,例如 AVX、AVX2等)。
因此,您正在比较一个高度 CPU 优化的科学库,它可以处理数据,甚至不必在一半时间返回 RAM(CPU 寄存器和芯片上的缓存存储);代码必须在将所有数据发送到显卡并返回之前收集它需要的每一位。我也忽略了在 Tensorflow 引擎盖下进行的所有数据操作。毕竟,它适用于自己的数据结构。
我想,急切执行也是效率低下的另一层。
至于部署 Keras 模型的最佳实践,我认为它就像软件中的其他一切一样:过早的优化是万恶之源。如果您不需要它快速精简,然后让它变得缓慢、模块化、可重用和直观。但是,嘿,如果你需要或想要效率,那就给你力量。 Keras 专为快速开发和研究而设计,而非生产代码。
简而言之,答案是出于同样的原因 C++ 比 Python 快(因为 Python 解释器的开销要大得多,Tensorflow 也是如此)。
【讨论】:
【参考方案2】:另一个答案在“如何使tf
keras 预测更快”方面更有用,但我认为以下内容可以帮助更多“它在做什么需要这么多时间”?即使禁用了 Eager 模式,您也可能想知道执行的样子(例如,提供或不提供 batch_size 等)。
要回答这个问题,您可能会发现跟踪分析器很有用。跟踪执行会增加很多开销(尤其是对于有一堆非常轻量级的 Python 调用的地方),但总的来说应该让您对正在执行的 Python 代码的哪一部分有相当多的了解,因为,好吧,它只是准确记录正在发生的事情。您可以尝试pytracing
,因为它会生成 Chrome 浏览器在其内置的chrome://tracing
页面上很好地可视化的文件。使用它,例如在 google colab 中,您可以执行以下操作:
首先,安装pytracing:
!pip install pytracing
然后生成trace:
from pytracing import TraceProfiler
tp = TraceProfiler(output=open('/root/trace.out', 'wt'))
with tp.traced():
for i in range(2):
model.predict(X[:1000], batch_size=1000)
然后下载trace:
from google.colab import files
files.download('/root/trace.out')
之后在Chrome浏览器打开chrome://tracing
页面,点击“加载”按钮,选择trace.out文件,就下载好了。
您将看到类似以下内容 - 您可以单击任何元素,查看 python 函数的全名和文件来源 + 所花费的时间(同样,所有这些都高于正常运行,因为跟踪开销):
您可以看到禁用/启用 Eager Execution 或更改批处理大小将如何改变输出,并且可以亲自查看花费最多的时间。从我目前看到的情况来看(在非急切模式 + 像model.predict(X[:1000], batch_size=1000)
这样的呼叫)花费了相当多的时间:
标准化您的数据(无论是什么意思):~2.5ms(包括跟踪开销!):
/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/training_v1.py:2336:_standardize_user_data
准备回调(即使我们没有设置任何回调):~2ms(包括跟踪开销)
/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/callbacks.py:133:configure_callbacks
关于numpy
版本没有优化的说法——我不同意。这里的 numpy 实现非常优化——python 没有在其中进行任何纯 python 调用(执行predict
只会导致调用 C 中的函数——起初我不敢相信,但似乎是这样),所以Python 的开销非常小。通过优化 ReLU 的方式并消除额外的分配/释放,您可能会有所收获,但这只会带来非常小的性能提升。
【讨论】:
您也可以使用viztracer
来替代pytracing
- 与pytracing
相比,viztracer
的开销似乎更低。你也可以像pip install viztracer
一样使用pip安装它,用法类似于pytracing
。使用viztrace
,我可以读取标准化数据和准备回调,每个时间大约为 0.5 毫秒。真正的执行(没有跟踪应该比这更快,但是如果你的批量大小足够小,那么与使用 numpy 的纯算术相比,tf 所做的所有这些额外的事情都会对其产生不利影响)。【参考方案3】:
我们观察到主要问题是Eager Execution 模式的原因。我们根据 CPU 和 GPU 基础对您的代码和相应的结果进行浅显的了解。 numpy
确实不在 GPU 上运行,因此与 tf-gpu
不同,它不会遇到任何数据转移开销。
但也相当明显与model. predict
和np
相比,您定义的predict
方法和np
完成了多少快速计算,而输入测试集是 仅 10 个样本。但是,我们不会进行任何深入的分析,例如您可能喜欢阅读的一件艺术品here。
我的设置如下。我正在使用 Colab 环境并检查 CPU 和 GPU 模式。
TensorFlow 1.15.2
Keras 2.3.1
Numpy 1.19.5
TensorFlow 2.4.1
Keras 2.4.0
Numpy 1.19.5
TF 1.15.2 - CPU
%tensorflow_version 1.x
import os
os.environ["CUDA_VISIBLE_DEVICES"]="-1"
import tensorflow as tf
from tensorflow.python.client import device_lib
print(tf.__version__)
print('A: ', tf.test.is_built_with_cuda)
print('B: ', tf.test.gpu_device_name())
local_device_protos = device_lib.list_local_devices()
([x.name for x in local_device_protos if x.device_type == 'GPU'],
[x.name for x in local_device_protos if x.device_type == 'CPU'])
TensorFlow 1.x selected.
1.15.2
A: <function is_built_with_cuda at 0x7f122d58dcb0>
B:
([], ['/device:CPU:0'])
现在,运行您的代码。
import tensorflow as tf
import keras
print(tf.executing_eagerly()) # False
(X,Y),(Xt,Yt) = keras.datasets.mnist.load_data()
model = keras.models.Sequential([])
model.compile
model.fit
%timeit model.predict(X[:10]).argmax(1) # yours: 10 loops takes 37.7 ms
%timeit predict(X[:10]).argmax(1) # yours: 1000 loops takes 356 µs
1000 loops, best of 5: 1.07 ms per loop
1000 loops, best of 5: 1.48 ms per loop
我们可以看到执行时间与旧的keras
相当。现在,让我们也用 GPU 进行测试。
TF 1.15.2 - GPU
%tensorflow_version 1.x
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"
import tensorflow as tf
from tensorflow.python.client import device_lib
print(tf.__version__)
print('A: ', tf.test.is_built_with_cuda)
print('B: ', tf.test.gpu_device_name())
local_device_protos = device_lib.list_local_devices()
([x.name for x in local_device_protos if x.device_type == 'GPU'],
[x.name for x in local_device_protos if x.device_type == 'CPU'])
1.15.2
A: <function is_built_with_cuda at 0x7f0b5ad46830>
B: /device:GPU:0
(['/device:GPU:0'], ['/device:CPU:0'])
...
...
%timeit model.predict(X[:10]).argmax(1) # yours: 10 loops takes 37.7 ms
%timeit predict(X[:10]).argmax(1) # yours: 1000 loops takes 356 µs
1000 loops, best of 5: 1.02 ms per loop
1000 loops, best of 5: 1.44 ms per loop
现在,这里的执行时间也与旧的keras
相当,并且没有急切模式。现在让我们先看看带有 Eager 模式的新 tf. keras
,然后我们观察没有 Eager 模式的情况。
TF 2.4.1 - CPU
热切
import os
os.environ["CUDA_VISIBLE_DEVICES"]="-1"
import tensorflow as tf
from tensorflow.python.client import device_lib
print(tf.__version__)
print('A: ', tf.test.is_built_with_cuda)
print('B: ', tf.test.gpu_device_name())
local_device_protos = device_lib.list_local_devices()
([x.name for x in local_device_protos if x.device_type == 'GPU'],
[x.name for x in local_device_protos if x.device_type == 'CPU'])
2.4.1
A: <function is_built_with_cuda at 0x7fed85de3560>
B:
([], ['/device:CPU:0'])
现在,以 Eager 模式运行代码。
import tensorflow as tf
import keras
print(tf.executing_eagerly()) # True
(X,Y),(Xt,Yt) = keras.datasets.mnist.load_data()
model = keras.models.Sequential([ ])
model.compile
model.fit
%timeit model.predict(X[:10]).argmax(1) # yours: 10 loops takes 37.7 ms
%timeit predict(X[:10]).argmax(1) # yours: 1000 loops takes 356 µs
10 loops, best of 5: 28 ms per loop
1000 loops, best of 5: 1.73 ms per loop
急切地禁用
现在,如果我们禁用 Eager 模式并运行以下相同的代码,我们将得到:
import tensorflow as tf
import keras
# # Disables eager execution
tf.compat.v1.disable_eager_execution()
# or,
# Disables eager execution of tf.functions.
# tf.config.run_functions_eagerly(False)
print(tf.executing_eagerly())
False
(X,Y),(Xt,Yt) = keras.datasets.mnist.load_data()
model = keras.models.Sequential([])
model.compile
model.fit
%timeit model.predict(X[:10]).argmax(1) # yours: 10 loops takes 37.7 ms
%timeit predict(X[:10]).argmax(1) # yours: 1000 loops takes 356 µs
1000 loops, best of 5: 1.37 ms per loop
1000 loops, best of 5: 1.57 ms per loop
现在,我们可以看到在新的tf. keras
中禁用急切模式的执行时间相当。现在,让我们也使用 GPU 模式进行测试。
TF 2.4.1 - GPU
热切
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"
import tensorflow as tf
from tensorflow.python.client import device_lib
print(tf.__version__)
print('A: ', tf.test.is_built_with_cuda)
print('B: ', tf.test.gpu_device_name())
local_device_protos = device_lib.list_local_devices()
([x.name for x in local_device_protos if x.device_type == 'GPU'],
[x.name for x in local_device_protos if x.device_type == 'CPU'])
2.4.1
A: <function is_built_with_cuda at 0x7f16ad88f680>
B: /device:GPU:0
(['/device:GPU:0'], ['/device:CPU:0'])
import tensorflow as tf
import keras
print(tf.executing_eagerly()) # True
(X,Y),(Xt,Yt) = keras.datasets.mnist.load_data()
model = keras.models.Sequential([ ])
model.compile
model.fit
%timeit model.predict(X[:10]).argmax(1) # yours: 10 loops takes 37.7 ms
%timeit predict(X[:10]).argmax(1) # yours: 1000 loops takes 356 µs
10 loops, best of 5: 26.3 ms per loop
1000 loops, best of 5: 1.48 ms per loop
急切地禁用
最后,如果我们禁用 Eager 模式并运行以下相同的代码,我们将得到:
# Disables eager execution
tf.compat.v1.disable_eager_execution()
# or,
# Disables eager execution of tf.functions.
# tf.config.run_functions_eagerly(False)
print(tf.executing_eagerly()) # False
(X,Y),(Xt,Yt) = keras.datasets.mnist.load_data()
model = keras.models.Sequential([ ])
model.compile
model.fit
%timeit model.predict(X[:10]).argmax(1) # yours: 10 loops takes 37.7 ms
%timeit predict(X[:10]).argmax(1) # yours: 1000 loops takes 356 µs
1000 loops, best of 5: 1.12 ms per loop
1000 loops, best of 5: 1.45 ms per loop
和以前一样,执行时间与新tf. keras
中的非急切模式相当。这就是为什么,Eager 模式 是tf. keras
性能比直接numpy
慢的根本原因。
【讨论】:
以上是关于TF.Keras model.predict 比直接 Numpy 慢?的主要内容,如果未能解决你的问题,请参考以下文章
model.predict 不适用于 Keras 自定义层(推理错误)
keras 中 model.predict() 和 model.predict_generator() 之间的预测差异