使用许多属性和字典查找优化 Python 代码

Posted 2023-03-06

技术标签:

【中文标题】使用许多属性和字典查找优化 Python 代码【英文标题】：Optimizing Python code with many attribute and dictionary lookups 【发布时间】：2010-04-05 18:22:38 【问题描述】：

我用 Python 编写了一个程序，它花费大量时间从字典键中查找对象的属性和值。我想知道是否有任何方法可以优化这些查找时间，可能使用 C 扩展，以减少执行时间，或者我是否需要简单地用编译语言重新实现程序。

该程序使用图形实现了一些算法。它在我们的数据集上运行得非常缓慢，因此我使用cProfile 使用可以实际完成的缩减数据集来分析代码。大量大部分时间都消耗在一个函数中，特别是在函数内的两个语句（生成器表达式）中：

第 202 行的生成器表达式是

    neighbors_in_selected_nodes = (neighbor for neighbor in
            node_neighbors if neighbor in selected_nodes)

第 204 行的生成器表达式是

    neighbor_z_scores = (interaction_graph.node[neighbor]['weight'] for
            neighbor in neighbors_in_selected_nodes)

此上下文函数的源代码如下。

selected_nodes 是 set 中的节点 interaction_graph，这是一个 NetworkX Graph 实例。 node_neighbors 是来自 Graph.neighbors_iter() 的迭代器。

Graph 本身使用字典来存储节点和边。它的Graph.node 属性是一个字典，将节点及其属性（例如'weight'）存储在属于每个节点的字典中。

这些查找中的每一个都应按常数时间摊销（即 O(1)），但是，我仍然为查找付出了巨大的代价。有什么方法可以加快这些查找速度（例如，通过将其中的一部分编写为 C 扩展），还是需要将程序移至编译语言？

以下是提供上下文的函数的完整源代码；绝大多数执行时间都花在了这个函数上。

def calculate_node_z_prime(
        node,
        interaction_graph,
        selected_nodes
    ):
    """Calculates a z'-score for a given node.

    The z'-score is based on the z-scores (weights) of the neighbors of
    the given node, and proportional to the z-score (weight) of the
    given node. Specifically, we find the maximum z-score of all
    neighbors of the given node that are also members of the given set
    of selected nodes, multiply this z-score by the z-score of the given
    node, and return this value as the z'-score for the given node.

    If the given node has no neighbors in the interaction graph, the
    z'-score is defined as zero.

    Returns the z'-score as zero or a positive floating point value.

    :Parameters:
    - `node`: the node for which to compute the z-prime score
    - `interaction_graph`: graph containing the gene-gene or gene
      product-gene product interactions
    - `selected_nodes`: a `set` of nodes fitting some criterion of
      interest (e.g., annotated with a term of interest)

    """
    node_neighbors = interaction_graph.neighbors_iter(node)
    neighbors_in_selected_nodes = (neighbor for neighbor in
            node_neighbors if neighbor in selected_nodes)
    neighbor_z_scores = (interaction_graph.node[neighbor]['weight'] for
            neighbor in neighbors_in_selected_nodes)
    try:
        max_z_score = max(neighbor_z_scores)
    # max() throws a ValueError if its argument has no elements; in this
    # case, we need to set the max_z_score to zero
    except ValueError, e:
        # Check to make certain max() raised this error
        if 'max()' in e.args[0]:
            max_z_score = 0
        else:
            raise e

    z_prime = interaction_graph.node[node]['weight'] * max_z_score
    return z_prime

这里是根据 cProfiler 的前几个调用，按时间排序。

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
156067701  352.313    0.000  642.072    0.000 bpln_contextual.py:204(<genexpr>)
156067701  289.759    0.000  289.759    0.000 bpln_contextual.py:202(<genexpr>)
 13963893  174.047    0.000  816.119    0.000 max
 13963885   69.804    0.000  936.754    0.000 bpln_contextual.py:171(calculate_node_z_prime)
  7116883   61.982    0.000   61.982    0.000 method 'update' of 'set' objects

【问题讨论】：

为什么是两个循环？ neighbors_in_selected_nodes 和 neighbor_z_scores?为什么不是一个循环？两步法似乎没有引入任何新东西。为什么这样做？您能否更新问题以解释为什么使用两种理解而不是一种？我只想考虑 1) 邻居和 2) 选择的节点的权重（存在于 selected_nodes 中）。我使用两个生成器表达式来做到这一点：neighbors_in_selected_nodes 过滤器用于选择的节点；这被链接到neighbor_z_scores 以检索权重。链式迭代器应该使它们成为一个循环，而不是两个；循环在max() 内进行评估。这两个生成器表达式确实可以写成一个 for 循环；第一个表达式表示语句if neighbor in selected_nodes:，第二个表达式表示neighbor_z_scores.append(interaction_graph.node[neighbor]['weight'])。通过使用生成器表达式，我避免了创建列表和附加操作。从获取邻居直到评估max(neighbor_z_scores)，我完全处理迭代器链。 【参考方案1】：

如何保持interaction_graph.neighbors_iter(node) 的迭代顺序排序（或使用collections.heapq 部分排序）？由于您只是试图找到最大值，您可以按降序迭代node_neighbors，selected_node中的第一个节点必须是selected_node中的最大值。

其次，selected_node 多久会改变一次？如果它很少更改，您可以通过“interaction_graph.node[neighbor] for x in selected_node”列表来节省大量迭代，而不必每次都重新构建此列表。

编辑：回复 cmets

一个 sort() 需要 O(n log n)

不一定，您看课本太多了。不管你的教科书怎么说，你有时可以通过利用数据的某些结构来打破 O(n log n) 的障碍。如果您首先将邻居列表保存在自然排序的数据结构中（例如 heapq、二叉树），则无需在每次迭代时重新排序。当然，这是一个时空权衡，因为您需要存储冗余的邻居列表，并且代码复杂性要确保邻居列表在邻居更改时更新。

另外，python 的 list.sort()，它使用timsort 算法，对于几乎排序的数据非常快（在某些情况下可能平均 O(n)）。它仍然没有破坏 O(n log n)，这已被证明在数学上是不可能的。

您需要在放弃解决方案之前进行分析，因为它不太可能提高性能。在进行极端优化时，您可能会发现在某些非常特殊的边缘情况中，旧而慢的冒泡排序可能会胜过美化的快速排序或归并排序。

【讨论】：

排序是一个有趣的想法。 sort() 将采用 O(n log n)（其中 n 是邻居的数量）；这将比线性搜索更昂贵。根据heapq 文档，heapify() 是O(n)，但我认为每个流行音乐要么是O(log n) 要么是O(n)（不确定），这意味着在最坏的情况下，它仍然是操作数的两倍通过邻居的简单线性循环。为了解决第二点，不幸的是，selected_nodes 会随着每次调用此函数而改变，因此它不适合缓存。不过，好主意。【参考方案2】：

我不明白为什么您的“权重”查找必须采用 ["weight"]（节点是字典？）而不是 .weight（节点是对象）的形式。

如果您的节点是对象，并且没有很多字段，您可以利用the __slots__ directive 来优化它们的存储：

class Node(object):
    # ... class stuff goes here ...

    __slots__ = ('weight',) # tuple of member names.

编辑：所以我查看了您提供的 NetworkX 链接，有几件事让我感到困扰。首先是，在顶部，“字典”的定义是“FIXME”。

总的来说，它似乎坚持使用字典，而不是使用可以被子类化的类来存储属性。虽然对象上的属性查找可能本质上是字典查找，但我看不出使用对象会如何更糟。如果有的话，它可能会更好，因为对象属性查找更有可能被优化，因为：

对象属性查找是如此普遍，对象属性的键空间比字典键更受限制，因此可以在搜索中使用优化的比较算法，并且对象具有__slots__ 优化，正是针对这些情况，您的对象只有几个字段并且需要优化对它们的访问。

例如，我经常在表示坐标的类上使用__slots__。对我来说，树节点似乎是另一个明显的用途。

所以当我阅读时：

节点节点可以是任何可散列的 Python 对象，但 None 除外。

我想，好吧，没问题，但紧接着就是

节点属性 在添加节点或分配给指定节点 n 的 G.node[n] 属性字典时，节点可以使用关键字/值对将任意 Python 对象分配为属性。

我想，如果一个节点需要属性，为什么要单独存储呢？为什么不把它放在节点中？使用contentString 和weight 成员编写课程是否有害？边缘似乎更疯狂，因为它们被规定为元组而不是您可以子类化的对象。

所以我对 NetworkX 背后的设计决策感到迷茫。

如果你坚持使用它，我建议将属性从这些字典移动到实际节点中，或者如果这不是一个选项，使用整数作为键而不是字符串，这样搜索使用的速度要快得多比较算法。

最后，如果你将你的生成器组合起来会怎样：

neighbor_z_scores = (interaction_graph.node[neighbor]['weight'] for
        neighbor in node_neighbors if neighbor in selected_nodes)

【讨论】：

好问题。节点本身只是代表生物实体的 Python 字符串 (ID)。每个节点的权重属性使用其节点属性结构存储在图中：networkx.lanl.gov/reference/glossary.html#term-node-attribute 由于属性查找本质上是字典查找，我不确定切换到真正的属性是否会提高性能。性能提升将是避免字符串比较，将其替换为 id 比较......或类似的东西。语言不是我写的，但我很确定 foo.bar 比 'bar': 123['bar'] 生成更好的字节码，因为前者的情况要多得多。迈克，凭直觉，我也相信你，但我刚刚在比较中得出的计时结果显示并非如此。 bitbucket.org/gotgenes/interesting-python-timings/src 总结是，对于 Python 2.6.4，类属性访问 __slots__ 访问 __slots__ 快大约 10%。【参考方案3】：

尝试直接访问 dict 并捕获 KeyErrors，这可能会更快，具体取决于您的命中/未命中率：

# cache this object
ignode = interaction_graph.node
neighbor_z_scores = []
for neighbor in node_neighbors:
    try:
        neighbor_z_scores.append(ignode[neighbor]['weight'])
    except KeyError:
        pass

或使用 getdefault 和列表推导：

sentinel = object()
# cache this object 
ignode = interaction_graph.node

neighbor_z_scores = (ignode[neighbor]['weight'] for neighbor in node_neighbors)
# using identity testing, it's slightly faster
neighbor_z_scores = (neighbor for neighbor in neighbor_z_scores if neighbor is not sentinel)

【讨论】：

不是selected_nodes 成员的邻居会比成员多得多，所以我首先过滤这个标准。这意味着更少的'weight' 查找。我可以保证所有'weight' 查找都会成功，因此那里的try-except 子句没有任何好处。不过，ignode 是一个非常好的主意，并且会针对该属性进行多次查找。【参考方案4】：

在不深入研究您的代码的情况下，尝试使用itertools 增加一点速度。

在模块级别添加这些：

import itertools as it, operator as op
GET_WEIGHT= op.attrgetter('weight')

变化：

neighbors_in_selected_nodes = (neighbor for neighbor in
        node_neighbors if neighbor in selected_nodes)

进入：

neighbors_in_selected_nodes = it.ifilter(selected_node.__contains__, node_neighbors)

和：

neighbor_z_scores = (interaction_graph.node[neighbor]['weight'] for
        neighbor in neighbors_in_selected_nodes)

进入：

neighbor_z_scores = (
    it.imap(
        GET_WEIGHT,
        it.imap(
            interaction_graph.node.__getitem__,
            neighbors_in_selected_nodes)
    )
)

这些有帮助吗？

【讨论】：

以上是关于使用许多属性和字典查找优化 Python 代码的主要内容，如果未能解决你的问题，请参考以下文章