我可以使用 TensorFlow 测量单个操作的执行时间吗？

Posted 2023-02-16

技术标签:

【中文标题】我可以使用 TensorFlow 测量单个操作的执行时间吗？【英文标题】：Can I measure the execution time of individual operations with TensorFlow? 【发布时间】：2016-03-21 12:12:29 【问题描述】：

我知道我可以测量对sess.run() 的调用的执行时间，但是否有可能获得更精细的粒度并测量单个操作的执行时间？

【问题讨论】：

【参考方案1】：

Uber SBNet 最近发布的自定义操作库（http://www.github.com/uber/sbnet）有一个基于 cuda 事件的定时器的实现，可以通过以下方式使用：

with tf.control_dependencies([input1, input2]):
    dt0 = sbnet_module.cuda_timer_start()
with tf.control_dependencies([dt0]):
    input1 = tf.identity(input1)
    input2 = tf.identity(input2)

### portion of subgraph to time goes in here

with tf.control_dependencies([result1, result2, dt0]):
    cuda_time = sbnet_module.cuda_timer_end(dt0)
with tf.control_dependencies([cuda_time]):
    result1 = tf.identity(result1)
    result2 = tf.identity(result2)

py_result1, py_result2, dt = session.run([result1, result2, cuda_time])
print "Milliseconds elapsed=", dt

请注意，子图的任何部分都可以是异步的，您应该非常小心地为计时器操作指定所有输入和输出依赖项。否则，计时器可能会乱序插入到图表中，您可能会得到错误的时间。我发现用于分析 Tensorflow 图的实用程序非常有限的时间线和 time.time() 时间。另请注意，cuda_timer API 将在默认流上同步，这是目前的设计，因为 TF 使用多个流。

话虽如此，我个人还是建议切换到 PyTorch :) 开发迭代更快，代码运行更快，一切都少了很多痛苦。

另一种从 tf.Session 中减去开销（可能是巨大的）的有点老套和神秘的方法是复制图表 N 次并针对变量 N 运行它，求解一个未知的固定开销方程。 IE。你会用 N1=10 和 N2=20 来测量 session.run()，你知道你的时间是 t，开销是 x。所以像

N1*x+t = t1
N2*x+t = t2

求解 x 和 t。缺点是这可能需要大量内存并且不一定准确:) 还要确保您的输入完全不同/随机/独立，否则 TF 将折叠整个子图而不是运行 N 次......玩 TensorFlow 玩得开心： )

【讨论】：

这个例子缺少一套完整的变量或关于如何创建它们的建议。当我点击 Github 中的 sbnet repo 时，它似乎已经过时了 3-4 年。【参考方案2】：

2.0兼容答案：您可以在Keras Callback中使用Profiling。

代码是：

log_dir="logs/profile/" + datetime.now().strftime("%Y%m%d-%H%M%S")

tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1, profile_batch = 3)

model.fit(train_data,
          steps_per_epoch=20,
          epochs=5, 
          callbacks=[tensorboard_callback])

有关如何进行 Profiling 的更多详细信息，请参阅此Tensorboard Link。

【讨论】：

【参考方案3】：

从 Tensorflow 1.8 开始，有一个使用 tf.profile.Profiler here 的非常好的示例。

【讨论】：

链接失效了，有更新版本吗？（仍然适用于 TF 1.x）【参考方案4】：

您可以使用runtime statistics 提取此信息。您将需要这样做（查看上述链接中的完整示例）：

run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()
sess.run(<values_you_want_to_execute>, options=run_options, run_metadata=run_metadata)
your_writer.add_run_metadata(run_metadata, 'step%d' % i)

比打印更好，你可以在 tensorboard 中看到它：

此外，单击一个节点将显示确切的总内存，计算时间和张量输出大小。

【讨论】：

链接（tensorflow.org/programmers_guide/graph_viz#runtime_statistics）已更新。【参考方案5】：

由于在谷歌上搜索“Tensorflow Profiling”时这个值很高，请注意当前（2017 年末，TensorFlow 1.4）获取时间线的方法是使用ProfilerHook。这适用于 tf.Estimator 中的 MonitoredSessions，其中 tf.RunOptions 不可用。

estimator = tf.estimator.Estimator(model_fn=...)
hook = tf.train.ProfilerHook(save_steps=10, output_dir='.')
estimator.train(input_fn=..., steps=..., hooks=[hook])

【讨论】：

【参考方案6】：

对于 Olivier Moindrot 回答下的 fat-lobyte 的 cmets，如果您想收集所有会话的时间线，您可以更改“open('timeline.json', 'w')”到“open('timeline.json', 'a')”。

【讨论】：

【参考方案7】：

我已经使用Timeline object 来获取图中每个节点的执行时间：

您使用经典的sess.run()，但还指定可选参数options 和run_metadata 然后使用run_metadata.step_stats 数据创建一个Timeline 对象

这是一个衡量矩阵乘法性能的示例程序：

import tensorflow as tf
from tensorflow.python.client import timeline

x = tf.random_normal([1000, 1000])
y = tf.random_normal([1000, 1000])
res = tf.matmul(x, y)

# Run the graph with full trace option
with tf.Session() as sess:
    run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
    run_metadata = tf.RunMetadata()
    sess.run(res, options=run_options, run_metadata=run_metadata)

    # Create the Timeline object, and write it to a json
    tl = timeline.Timeline(run_metadata.step_stats)
    ctf = tl.generate_chrome_trace_format()
    with open('timeline.json', 'w') as f:
        f.write(ctf)

然后您可以打开谷歌浏览器，转到页面chrome://tracing 并加载timeline.json 文件。您应该会看到如下内容：

【讨论】：

嗨！我尝试为我的网络培训创建一个时间线，但不幸的是，如您所展示的那样只为最后一次调用 session.run 生成一个时间线。有没有办法汇总所有会话的时间线？使用 TensorFlow 0.12.0-rc0，我发现我需要确保 libcupti.so/libcupti.dylib 在库路径中才能正常工作。对我来说（在 Mac 上），我将 /usr/local/cuda/extras/CUPTI/lib 添加到 DYLD_LIBRARY_PATH。或LD_LIBRARY_PATH=/usr/local/cuda/extras/CUPTI/lib64:$LD_LIBRARY_PATH 在 Ubuntu 上这里为什么有加法操作符？因为在调用tf.random_normal时，TensorFlow首先创建了一个均值为0方差为1的随机张量，然后乘以标准差（这里为1）并加上均值（这里为0）。跨度> 【参考方案8】：

为了更新这个答案，我们确实有一些 CPU 分析功能，专注于推理。如果您查看https://github.com/tensorflow/tensorflow/tree/master/tensorflow/tools/benchmark，您会看到一个程序，您可以在模型上运行以获得每个操作的计时。

【讨论】：

在原始 Tensorflow 操作上获得 GPU 基准测试怎么样？【参考方案9】：

在公开版本中还没有办法做到这一点。我们知道这是一个重要的功能，我们正在努力。

【讨论】：

这个答案有可能更新吗？因为github.com/tensorflow/tensorflow/issues/899 似乎可以计算单个操作的 FLOP，从而可以深入了解执行时间。

以上是关于我可以使用 TensorFlow 测量单个操作的执行时间吗？的主要内容，如果未能解决你的问题，请参考以下文章

如何在 TensorFlow 中可靠地测量 sess.run() 的时间？

第十四节，TensorFlow中的反卷积，反池化操作以及gradients的使用

如何处理单个活动的多个片段

测量 NVIDIA 张量核心加速

winform批量更新数据_长时间的执行会导致界面卡死

tensorflow 测量工具，与自定义训练