使用 mptt 在 Python / Django 中创建 JSON 以反映树结构的最快方法

Posted 2023-02-24

技术标签:

【中文标题】使用 mptt 在 Python / Django 中创建 JSON 以反映树结构的最快方法【英文标题】：fastest way to create JSON to reflect a tree structure in Python / Django using mptt 【发布时间】：2012-09-15 09:29:35 【问题描述】：

在 Python (Django) 中基于 Django 查询集创建 JSON 的最快方法是什么。请注意，在模板中按照建议 here 解析它不是一个选项。

背景是我创建了一个循环遍历树中所有节点的方法，但是在转换大约 300 个节点时已经非常慢了。我想到的第一个（可能也是最糟糕的）想法是以某种方式“手动”创建 json。请参阅下面的代码。

#! Solution 1 !!#
def quoteStr(input):
    return "\"" + smart_str(smart_unicode(input)) + "\""

def createJSONTreeDump(user, node, root=False, lastChild=False):
    q = "\""

    #open tag for object
    json = str("\n" + indent + "" +
                  quoteStr("name") + ": " + quoteStr(node.name) + ",\n" +
                  quoteStr("id") + ": " + quoteStr(node.pk) + ",\n" +
                )

    childrenTag = "children"
    children = node.get_children()
    if children.count() > 0 :
        #create children array opening tag
        json += str(indent + quoteStr(childrenTag) + ": [")
        #for child in children:
        for idx, child in enumerate(children):
            if (idx + 1) == children.count():
                //recursive call
                json += createJSONTreeDump(user, child, False, True, layout)
            else:
                //recursive call
                json += createJSONTreeDump(user, child, False, False, layout)
        #add children closing tag
        json += "]\n"

    #closing tag for object
    if lastChild == False:
        #more children following, add ","
        json += indent + ",\n"
    else:
        #last child, do not add ","
        json += indent + "\n"
    return json

要渲染的树结构是用mptt 构建的树，其中调用 .get_children() 返回节点的所有子节点。

模型看起来就这么简单，mptt 负责其他一切。

class Node(MPTTModel, ExtraManager):
    """
    Representation of a single node
    """ 
    name = models.CharField(max_length=200)
    parent = TreeForeignKey('self', null=True, blank=True, related_name='%(app_label)s_%(class)s_children')

在模板var root = jsonTree|safe 中这样创建的预期 JSON result

编辑：根据this 的回答，我创建了以下代码（绝对是更好的代码），但感觉只是稍微快了一点。

解决方案 2：

def serializable_object(node):
    "Recurse into tree to build a serializable object"
    obj = 'name': node.name, 'id': node.pk, 'children': []
    for child in node.get_children():
        obj['children'].append(serializable_object(child))
    return obj

import json
jsonTree = json.dumps(serializable_object(nodeInstance))

解决方案 3：

def serializable_object_List_Comprehension(node):
    "Recurse into tree to build a serializable object"
    obj = 
        'name': node.name,
        'id': node.pk,
        'children': [serializable_object(ch) for ch in node.get_children()]
    
    return obj

解决方案 4：

def recursive_node_to_dict(node):
    result = 
        'name': node.name, 'id': node.pk
    
    children = [recursive_node_to_dict(c) for c in node.get_children()],
    if children is not None:
        result['children'] = children
    return result

from mptt.templatetags.mptt_tags import cache_tree_children
root_nodes = cache_tree_children(root.get_descendants())
dicts = []
for n in root_nodes:
    dicts.append(recursive_node_to_dict(root_nodes[0]))
    jsonTree = json.dumps(dicts, indent=4)

解决方案5（使用select_related到pre_fetch，但不确定是否正确使用）

def serializable_object_select_related(node):
    "Recurse into tree to build a serializable object, make use of select_related"
    obj = 'name': node.get_wbs_code(), 'wbsCode': node.get_wbs_code(), 'id': node.pk, 'level': node.level, 'position': node.position, 'children': []
    for child in node.get_children().select_related():
        obj['children'].append(serializable_object(child))
    return obj

解决方案 6（改进的解决方案 4，使用子节点的缓存）：

def recursive_node_to_dict(node):
    return 
        'name': node.name, 'id': node.pk,
         # Notice the use of node._cached_children instead of node.get_children()
        'children' : [recursive_node_to_dict(c) for c in node._cached_children]

调用方式：

from mptt.templatetags.mptt_tags import cache_tree_children
subTrees = cache_tree_children(root.get_descendants(include_self=True))
subTreeDicts = []
for subTree in subTrees:
    subTree = recursive_node_to_dict(subTree)
    subTreeDicts.append(subTree)
jsonTree = json.dumps(subTreeDicts, indent=4)
#optional clean up, remove the [ ] at the beginning and the end, its needed for D3.js
jsonTree = jsonTree[1:len(jsonTree)]
jsonTree = jsonTree[:len(jsonTree)-1]

您可以在下面看到分析结果，按照 MuMind 的建议使用 cProfile 创建，设置 Django 视图以启动独立方法 profileJSON()，该方法又调用不同的解决方案来创建 JSON 输出。

def startProfileJSON(request):
    print "startProfileJSON"
    import cProfile
    cProfile.runctx('profileJSON()', globals=globals(), locals=locals())
    print "endProfileJSON"

结果：

解决方案 1： 3350347 次函数调用（3130372 次原始调用）在 4.969 秒内 (details)

解决方案 2： 2533705 次函数调用（2354516 次原始调用）在 3.630 秒内 (details)

解决方案 3： 2533621 次函数调用（2354441 次原始调用）在 3.684 秒内 (details)

解决方案 4： 2812725 个函数调用（2466028 个原始调用）在 3.840 秒内 (details)

解决方案 5： 2536504 次函数调用（2357256 次原始调用）在 3.779 秒内 (details)

解决方案 6（改进的解决方案 4）： 2593122 个函数调用（2299165 个原始调用）在 3.663 秒内 (details)

讨论：

解决方案 1：自己的编码实现。坏主意

解决方案 2 + 3：目前最快，但仍然非常缓慢

解决方案 4：缓存孩子看起来很有希望，但性能相似，并且当前生成无效的 json，因为孩子被放入双 []：

"children": [[]] instead of "children": []

解决方案 5：使用 select_related 并没有什么不同，但可能以错误的方式使用，因为一个节点总是有一个 ForeignKey 到它的父节点，我们正在从根解析到子节点。

更新：解决方案 6：对我来说，它看起来是最干净的解决方案，使用子节点缓存。但只执行类似于解决方案 2 + 3。这对我来说很奇怪。

还有更多关于性能改进的想法吗？

【问题讨论】：

你经历过docs.djangoproject.com/en/dev/topics/serialization吗？感谢您的提示。关于递归序列化，我经历了它，但无法理解它。但是你的提示让我搜索序列化+树+ django，第一个命中是这个问题，看起来它是我正在寻找的解决方案。 ***.com/questions/5597136/… .. 会试一试！看起来你在上面第一个版本的代码中的主要瓶颈将是重复附加到一个字符串，这是一个非常低效的操作，因为它必须不断重新分配连续的内存块。将字符串附加到列表并在最后执行 ''.join(pieces) 会表现得更好。感谢您的提示。明天试试。是时候休息了。在您的数据库中的所有节点中，您实际使用此方法转储到 JSON 的节点有多少？ 【参考方案1】：

将您的数据组织到嵌套字典或列表中，然后调用json转储方法：

import json   
data = ['foo', 'bar': ('baz', None, 1.0, 2)]
json.dump(data)

【讨论】：

如果我理解正确，这就是我在解决方案 2 中所做的。请参阅上面我的问题的更新。【参考方案2】：

您的更新版本看起来开销很小。我认为使用列表推导会更有效（也更易读！）：

def serializable_object(node):
    "Recurse into tree to build a serializable object"
    obj = 
        'name': node.name,
        'children': [serializable_object(ch) for ch in node.get_children()]
    
    return obj

除此之外，您所能做的就是分析它以找到瓶颈。编写一些独立的代码来加载和序列化你的 300 个节点，然后运行它

python -m profile serialize_benchmark.py

（或 -m cProfile 如果效果更好）。

可以看到 3 个不同的潜在瓶颈：

DB 访问（.get_children() 和.name）——我不确定幕后到底发生了什么，但我有这样的代码，它为每个节点执行数据库查询，增加了巨大的开销。如果这是您的问题，您可以将其配置为使用 select_related 或类似的东西进行“急切加载”。函数调用开销（例如serializable_object 本身）——只要确保serializable_object 的 ncalls 看起来是一个合理的数字。如果我理解您的描述，应该在 300 附近。在最后进行序列化 (json.dumps(nodeInstance)) -- 不太可能是罪魁祸首，因为您说它只有 300 个节点，但如果您确实看到这会占用大量执行时间，请确保您已为 JSON 工作编译了加速正确。

如果您无法从分析中得知太多信息，请制作一个精简版本，例如递归调用 node.name 和 node.get_children() 但不将结果存储在数据结构中，然后看看比较.

更新：在解决方案 3 中有 2192 次对 execute_sql 的调用，在解决方案 5 中有 2192 次调用，所以我认为过多的数据库查询是一个问题，select_related 没有按照上面使用的方式执行任何操作。查看django-mptt issue #88: Allow select_related in model methods 表明您使用它或多或少是正确的，但我有疑问，get_children 与get_descendants 可能会产生巨大的差异。

copy.deepcopy 也占用了大量时间，这令人费解，因为您没有直接调用它，而且我看不到它是从 MPTT 代码调用的。什么是 tree.py？

如果您在分析方面做了大量工作，我强烈推荐使用真正灵巧的工具RunSnakeRun，它可以让您以非常方便的网格形式查看您的个人资料数据并更快地理解数据。

无论如何，这是简化数据库方面的又一次尝试：

import weakref
obj_cache = weakref.WeakValueDictionary()

def serializable_object(node):
    root_obj = 'name': node.get_wbs_code(), 'wbsCode': node.get_wbs_code(),
            'id': node.pk, 'level': node.level, 'position': node.position,
            'children': []
    obj_cache[node.pk] = root_obj
    # don't know if the following .select_related() does anything...
    for descendant in node.get_descendants().select_related():
        # get_descendants supposedly traverses in "tree order", which I think
        # means the parent obj will always be created already
        parent_obj = obj_cache[descendant.parent.pk]    # hope parent is cached
        descendant_obj = 'name': descendant.get_wbs_code(),
            'wbsCode': descendant.get_wbs_code(), 'id': descendant.pk,
            'level': descendant.level, 'position': descendant.position,
            'children': []
        parent_obj['children'].append(descendant_obj)
        obj_cache[descendant.pk] = descendant_obj
    return root_obj

请注意，这不再是递归的。它通过节点迭代地进行，理论上在他们的父母被访问之后，并且这一切都使用对MPTTModel.get_descendants()的一个大调用，所以希望这是很好的优化并缓存.parent等（或者也许有一种更直接的方法来获取在父键上？）。它最初创建每个没有子对象的 obj，然后将所有值“嫁接”给它们的父对象。

【讨论】：

谢谢！我检查了加速，它们似乎在那里，使用 Python2.7: from _json import scanstring as c_scanstring .【参考方案3】：

我怀疑到目前为止最大的减速是每个节点执行 1 次数据库查询。与数据库的数百次往返相比，json 呈现微不足道。

您应该在每个节点上缓存子节点，以便可以一次完成所有查询。 django-mptt 有一个 cache_tree_children() 函数，你可以使用它。

import json
from mptt.templatetags.mptt_tags import cache_tree_children

def recursive_node_to_dict(node):
    result = 
        'id': node.pk,
        'name': node.name,
    
    children = [recursive_node_to_dict(c) for c in node.get_children()]
    if children:
        result['children'] = children
    return result

root_nodes = cache_tree_children(Node.objects.all())
dicts = []
for n in root_nodes:
    dicts.append(recursive_node_to_dict(n))

print json.dumps(dicts, indent=4)

自定义 json 编码虽然在某些情况下可能会稍微加快速度，但我强烈反对这样做，因为它会包含大量代码，而且很容易获得 very wrong。

【讨论】：

谢谢。您能否检查我的代码中的解决方案 4 并建议如何防止儿童出现双重列表？孩子们目前被打包成两个 [] 而不是一个。完成。它有一个尾随逗号，python 很高兴地变成了一个元组：s 谢谢。我使用了你的代码，因为它是最优雅的，并重新编写了一点。参见上面的解决方案 6。一件重要的事情是利用 node._cached_children 而不是 node.get_children() 来利用 cache_tree_children 完成的缓存。无论如何，代码的性能并不比解决方案 2 好，这对我来说很奇怪，因为您只点击一次 db 的论点是有说服力的。知道为什么它没有表现得更好吗？【参考方案4】：

玩了一会儿，我发现解决方案都太慢了，因为 mptt 本身正在多次扫描缓存到get_children。

利用 mptt 以正确的顺序返回行以轻松构建树这一事实，我做了这个：

def flat_tree_to_dict(nodes, max_depth):
    tree = []
    last_levels = [None] * max_depth
    for n in nodes:
        d = 'name': n.name
        if n.level == 0:
            tree.append(d)
        else:
            parent_dict = last_levels[n.level - 1]
            if 'children' not in parent_dict:
                parent_dict['children'] = []
            parent_dict['children'].append(d)
        last_levels[n.level] = d
    return tree

对于我的数据集，它的运行速度比其他解决方案快 10 倍，因为它是 O(n)，只迭代数据一次。

我是这样使用的：

json.dumps(flat_tree_to_dict(Model.objects.all(), 4), indent=4)

【讨论】：

以上是关于使用 mptt 在 Python / Django 中创建 JSON 以反映树结构的最快方法的主要内容，如果未能解决你的问题，请参考以下文章