Python 从 JSON 文件创建树

Posted

技术标签:

【中文标题】Python 从 JSON 文件创建树【英文标题】:Python create tree from a JSON file 【发布时间】:2019-09-19 10:55:17 【问题描述】:

假设我们有以下 JSON 文件。出于示例的目的,它由字符串模拟。字符串是输入,Tree 对象应该是输出。我将使用树的图形符号来呈现输出。

我找到了以下类来处理 Python 中的树概念:

class TreeNode(object):
    def __init__(self, data):
        self.data = data
        self.children = []

    def add_child(self, obj):
        self.children.append(obj)

    def __str__(self, level=0):
        ret = "\t"*level+repr(self.data)+"\n"
        for child in self.children:
            ret += child.__str__(level+1)
        return ret

    def __repr__(self):
        return '<tree node representation>'

class Tree:
    def __init__(self):
        self.root = TreeNode('ROOT')

    def __str__(self):
        return self.root.__str__()

输入文件可以有不同的复杂度:

简单案例

输入:

json_file = '"item1": "end1", "item2": "end2"'

输出:

"ROOT"
    item1
        end1
    item2
        end2

嵌入式案例

输入:

json_file = "item1": "end1", "item2": "item3": "end3"

输出:

"ROOT"
    item1
        end1
    item2
        item3
            end3

数组案例

输入:

json_file = "name": "John", "items": [ "item_name": "lettuce", "price": 2.65, "units": "no" , "item_name": "ketchup", "price": 1.51, "units": "litres" ]

输出:

"ROOT"
    name
        John
    items
        1
            item_name
                lettuce
            price
                2.65
            units
                no
        2   
            item_name
                ketchup
            price
                1.51
            units
                litres

请注意,数组中的每一项都用一个整数(从 1 开始)描述。

到目前为止,我已经设法提出了以下功能,可以解决简单案例的问题。就嵌入式案例而言,我知道我必须使用递归,但到目前为止我得到了UnboundLocalError: local variable 'tree' referenced before assignment

def create_tree_from_JSON(json, parent=None):
    if not parent:
        tree = Tree()
        node_0 = TreeNode("ROOT")
        tree.root = node_0
        parent = node_0
    else:
        parent = parent

    for key in json:
        if isinstance(json[key], dict):
            head = TreeNode(key)
            create_tree_from_JSON(json[key], head)
        else:
            node = TreeNode(key)
            node.add_child(TreeNode(json[key]))
            parent.add_child(node)

    return tree

问题的背景

您可能想知道为什么我需要将 JSON 对象更改为树。您可能知道 PostgreSQL 提供了一种处理数据库中 JSON 字段的方法。给定一个 JSON 对象,我可以使用 -&gt;-&gt;&gt; 表示法获取任何字段的值。 Here 和 here 更多关于该主题的信息。我将根据字段的名称和值创建新表。不幸的是,JSON 对象的变化如此之大,以至于我无法手动编写 .sql 代码 - 我必须找到一种方法来自动完成。

假设我想根据嵌入的情况创建一个表。我需要得到以下.sql 代码:

select 
    content_json ->> 'item1' as end1,
    content_json -> 'item_2' ->> 'item_3' as end3
from table_with_json

content_json替换"ROOT",你可以看到SQL代码中的每一行只是从“ROOT”到叶子的深度优先遍历(从最后一个节点移动到叶子总是用-&gt;&gt;注释)。

编辑:为了使问题更清楚,我正在为数组案例添加目标 .sql 查询。我希望有与数组中的元素一样多的查询:

select
    content_json ->> 'name' as name,
    content_json -> 'items' -> 1 -> 'item_name' as item_name,
    content_json -> 'items' -> 1 -> 'price' as price,
    content_json -> 'items' -> 1 -> 'units' as units
from table_with_json

select
    content_json ->> 'name' as name,
    content_json -> 'items' -> 2 ->> 'item_name' as item_name,
    content_json -> 'items' -> 2 ->> 'price' as price,
    content_json -> 'items' -> 2 ->> 'units' as units
from table_with_json

到目前为止的解决方案 (07.05.2019)

我目前正在测试当前的解决方案:

from collections import OrderedDict

def treeify(data) -> dict:
    if isinstance(data, dict):  # already have keys, just recurse
        return OrderedDict((key, treeify(children)) for key, children in data.items())
    elif isinstance(data, list):  # make keys from indices
        return OrderedDict((idx, treeify(children)) for idx, children in enumerate(data, start=1))
    else:  # leave node, no recursion
        return data

def format_query(tree, stack=('content_json',)) -> str:
    if isinstance(tree, dict):  # build stack of keys
        for key, child in tree.items():
            yield from format_query(child, stack + (key,))
    else:  # print complete stack, discarding leaf data in tree
        *keys, field = stack
        path = ' -> '.join(
            str(key) if isinstance(key, int) else "'%s'" % key
            for key in keys
        )
        yield path + " ->> '%s' as %s" % (field, field)

def create_select_query(lines_list):
    query = "select\n"
    for line_number in range(len(lines_list)):
        if "_class" in lines_list[line_number]:
            # ignore '_class' fields
            continue
        query += "\t" + lines_list[line_number]
        if line_number == len(lines_list)-1:
            query += "\n"
        else:
            query += ",\n"
    query += "from table_with_json"
    return query

我目前正在处理这样的 JSON:

stack_nested_example = "_class":"value_to_be_ignored","first_key":"second_key":"user_id":"123456","company_id":"9876","question":"subject":"some_subject","case_type":"urgent","from_date":"year":2011,"month":11,"day":11,"to_date":"year":2012,"month":12,"day":12,"third_key":["role":"driver","weather":"great","role":"father","weather":"rainy"]

在输出中,我得到唯一不变的元素是用数组逻辑处理的行的顺序。其他行的顺序不同。我想得到的输出是考虑到键顺序的输出:

select
        'content_json' -> 'first_key' -> 'second_key' ->> 'user_id' as user_id,
        'content_json' -> 'first_key' -> 'second_key' ->> 'company_id' as company_id,
        'content_json' -> 'first_key' -> 'second_key' -> 'question' ->> 'subject' as subject,
        'content_json' -> 'first_key' -> 'second_key' -> 'question' ->> 'case_type' as case_type,
        'content_json' -> 'first_key' -> 'second_key' -> 'question' -> 'from_date' ->> 'year' as year,
        'content_json' -> 'first_key' -> 'second_key' -> 'question' -> 'from_date' ->> 'month' as month,
        'content_json' -> 'first_key' -> 'second_key' -> 'question' -> 'from_date' ->> 'day' as day,
        'content_json' -> 'first_key' -> 'second_key' -> 'question' -> 'to_date' ->> 'year' as year,
        'content_json' -> 'first_key' -> 'second_key' -> 'question' -> 'to_date' ->> 'month' as month,
        'content_json' -> 'first_key' -> 'second_key' -> 'question' -> 'to_date' ->> 'day' as day,
        'content_json' -> 'first_key' -> 'second_key' -> 'third_key' -> 1 ->> 'role' as role,
        'content_json' -> 'first_key' -> 'second_key' -> 'third_key' -> 1 ->> 'weather' as weather,
        'content_json' -> 'first_key' -> 'second_key' -> 'third_key' -> 2 ->> 'role' as role,
        'content_json' -> 'first_key' -> 'second_key' -> 'third_key' -> 2 ->> 'weather' as weather
from table_with_json

【问题讨论】:

为什么不坚持使用普通的dict 并可能将列表转换为dict 以保持一致性?它们已经形成了一棵树,不需要额外的类。 这可能会有所帮助:***.com/questions/53828023/… vanya.jp.net/vtree 【参考方案1】:

你可以使用递归:

def format_query(d):
  if all(not isinstance(i, tuple) for i in d):
    return 'select\n\nfrom table_with_json'.format(',\n'.join('\tcontent_json '.format("->> '' as ".format(i[0], i[0]) if len(i) == 1 else "->  ->> '' as ".format(' -> '.join("''".format(j) for j in i[:-1]), i[-1], i[-1])) for i in d))
  return '\n\n'.join(format_query([c for b in i for c in b]) for i in d)

def get_dict(d, c = []):
  for a, b in d.items():
     if not isinstance(b, (dict, list)):
       yield c+[a]
     elif isinstance(b, dict):
       yield from to_query(b, c+[a])

def to_query(d, q = []):
  if not any(isinstance(i, list) for i in d.values()):
     yield from get_dict(d, c=q)
  else:
     _c = list(get_dict(d))
     for a, b in d.items():
       if isinstance(b, list):
         for i, j in enumerate(b, 1):
            yield (_c, list(get_dict(j, [a, i])))

现在,格式化:

json_file =  "name": "John", "items": [  "item_name": "lettuce", "price": 2.65, "units": "no" ,  "item_name": "ketchup", "price": 1.51, "units": "litres"  ] 
print(format_query(list(to_query(json_file))))

输出:

select
      content_json ->> 'name' as name,
      content_json -> 'items' -> '1' ->> 'item_name' as item_name,
      content_json -> 'items' -> '1' ->> 'price' as price,
      content_json -> 'items' -> '1' ->> 'units' as units
from table_with_json

select
     content_json ->> 'name' as name,
     content_json -> 'items' -> '2' ->> 'item_name' as item_name,
     content_json -> 'items' -> '2' ->> 'price' as price,
     content_json -> 'items' -> '2' ->> 'units' as units
from table_with_json

【讨论】:

感谢您的评论。您的答案适用于打印树,但正如我在问题中指出的那样,我需要生成一棵树,并且我正在使用图形(带选项卡)表示来呈现我想要获得的输出:) @balkon16 Opps,感谢您指出这一点。对于带有列表的输入,您的 SQL 查询会是什么样子? @balkon16 谢谢。在没有数组的输入中,最终键的值包含在结果中,即content_json -&gt; 'item_1' -&gt;&gt; 'end1' as end1,。但是,对于数组,您只保留键 content_json -&gt;&gt; 'name' as name,,而不提供值 content_json -&gt; 'name' -&gt;&gt; 'John' as John。你能为我澄清一下吗?谢谢。 这是我的错误。嵌入案例(无数组)的.sql 查询遵循与数组案例相同的逻辑。 @balkon16 感谢您的耐心等待 :) 请查看我最近的编辑。【参考方案2】:

在您的create_tree_from_JSON 中,您永远不会在递归期间传递树。然而你却试图退货。

def create_tree_from_JSON(json, parent=None):
    if not parent:
        tree = Tree()  # tree is only created for root node
        ...
    else:
        parent = parent  # tree is not created here
    ...
    return tree  # tree is always returned

在递归期间传递tree,或者将根步骤与其他步骤分开:

def create_tree_from_JSON(json):  # root case
    tree = Tree()
    node_0 = TreeNode("ROOT")
    tree.root = node_0
    parent = node_0
    _walk_tree(json, parent)

def _walk_tree(json, parent):  # recursive case
    for key in json:
        if isinstance(json[key], dict):
            head = TreeNode(key)
            _walk_tree(json[key], head)
        else:
            node = TreeNode(key)
            node.add_child(TreeNode(json[key]))
        parent.add_child(node)

请注意,使用简单的dicts 可以更轻松地解决您正在做的事情。您的类实际上只是在 dict 周围包装了一个自定义接口。

def treeify(data) -> dict:
    if isinstance(data, dict):  # already have keys, just recurse
       return key: treeify(children) for key, children in data.items()
    elif isinstance(data, list):  # make keys from indices
       return idx: treeify(children) for idx, children in enumerate(data, start=1)
    else:  # leave node, no recursion
       return data

您可以将任何解码的 json 数据提供给它。

>>> treeify(json_file =  "name": "John", "items": [  "item_name": "lettuce", "price": 2.65, "units": "no" ,  "item_name": "ketchup", "price": 1.51, "units": "litres"  ] )
'name': 'John', 'items': 1: 'item_name': 'lettuce', 'price': 2.65, 'units': 'no', 2: 'item_name': 'ketchup', 'price': 1.51, 'units': 'litres'

要获得所需的漂亮打印输出,您可以使用一堆当前键遍历此结构。生成器适用于动态创建每个查询行:

def format_query(tree, stack=('content_json',)) -> str:
    if isinstance(tree, dict):  # build stack of keys
        for key, child in tree.items():
            yield from format_query(child, stack + (key,))
    else:  # print complete stack, discarding leaf data in tree
       *keys, field = stack
       path = ' -> '.join(
           str(key) if isinstance(key, int) else "'%s'" % key
           for key in keys
       )
       yield path + " ->> '%s' as %s" % (field, field)

鉴于您的第二个示例,这允许您获取查询行列表:

>>> list(format_query(treeify( "name": "John", "items": [  "item_name": "lettuce", "price": 2.65, "units": "no" ,  "item_name": "ketchup", "price": 1.51, "units": "litres"  ] )))
["'content_json' ->> 'name' as name",
 "'content_json' -> 'items' -> 1 ->> 'item_name' as item_name",
 "'content_json' -> 'items' -> 1 ->> 'price' as price",
 "'content_json' -> 'items' -> 1 ->> 'units' as units",
 "'content_json' -> 'items' -> 2 ->> 'item_name' as item_name",
 "'content_json' -> 'items' -> 2 ->> 'price' as price",
 "'content_json' -> 'items' -> 2 ->> 'units' as units"]

【讨论】:

感谢您的意见。请注意,我已经更改了嵌入式案例中的逻辑。链中的最后一项是前面有-&gt;&gt; 的最后一个键。 我在尝试复制您的答案时收到NameError: name 'keys' is not defined @balkon16 我的错,format_query 现在应该可以工作了。 感谢您的编辑。事实证明,'content_json' 必须不加引号 (content_json)。我决定采用一个不太优雅的解决方案:path = ' -&gt; '.join(str(key) if isinstance(key, int) else "'%s'" % key for key in keys).replace("'content_json'", "content_json")(添加了replace(... 部分。你碰巧知道一个更优雅的解决方案吗? @balkon16 如果您将 JSON 加载到 dict,那么您已经松散了那里的顺序。您必须使用 OrderedDict 加载 JSON 以表示具有排序的对象:***.com/questions/6921699/…

以上是关于Python 从 JSON 文件创建树的主要内容,如果未能解决你的问题,请参考以下文章

从python脚本调用scrapy而不创建JSON输出文件

python 根据现有文件树创建文件树

Python 操作 DOM(待更)

从 json 文件构造 boost 属性树的性能很差?

从变量而不是python中的文件读取json数据

从包含 JSON 的 CSV 文件创建 Pandas DataFrame