通过避免特定分支快速遍历lxml树

Posted 2023-03-06

技术标签:

【中文标题】通过避免特定分支快速遍历lxml树【英文标题】：Fast traverse through lxml tree by avoiding specific branch 【发布时间】：2021-05-28 15:30:06 【问题描述】：

假设我有一个如下的etree：

my_data.xml

<?xml version="1.0" encoding="UTF-8"?>
<data>
  <country name="Liechtenstein" xmlns="aaa:bbb:ccc:liechtenstein:eee">
    <rank updated="yes">2</rank>
    <holidays>
      <christmas>Yes</christmas>
    </holidays>
    <year>2008</year>
    <gdppc>141100</gdppc>
    <neighbor name="Austria" direction="E"/>
    <neighbor name="Switzerland" direction="W"/>
  </country>
  <country name="Singapore" xmlns="aaa:bbb:ccc:singapore:eee">
    <continent>Asia</continent>
    <holidays>
      <christmas>Yes</christmas>
    </holidays>
    <rank updated="yes">5</rank>
    <year>2011</year>
    <gdppc>59900</gdppc>
    <neighbor name="Malaysia" direction="N"/>
  </country>
  <country name="Panama" xmlns="aaa:bbb:ccc:panama:eee">
    <rank updated="yes">69</rank>
    <year>2011</year>
    <gdppc>13600</gdppc>
    <neighbor name="Costa Rica" direction="W"/>
    <neighbor name="Colombia" direction="E"/>
  </country>
  <ethnicity xmlns="aaa:bbb:ccc:ethnicity:eee">
    <malay>
      <holidays>
        <ramadan>Yes</ramadan>
      </holidays>
    </malay>
  </ethnicity>
</data>

解析：

xtree = etree.parse('my_data.xml')
xroot = xtree.getroot()

我想遍历树并对所有分支进行处理，除了某些分支。在这个例子中，我想排除 ethnicity 分支：

node_to_exclude = xroot.xpath('.//*[local-name()="ethnicity"]')
exclude_path = xtree.getelementpath(node_to_exclude[0])

for element in xroot.iter('*'):
   if exclude_path not in xtree.getelementpath(element ):
      ...do stuff...

但这仍然会遍历整个树。有没有比这更好/更快的方法（即一起忽略整个ethnicity 分支）？我正在寻找一种语法解决方案，而不是递归算法。

【问题讨论】：

【参考方案1】：

XPath 可以为您做到这一点

for element in xroot.xpath('.//*[not(ancestor-or-self::*[local-name()="ethnicity"])]'):
    # ...do stuff...

它可能——也可能不会——衡量它——提高性能以指定您所指的祖先。例如，如果<ethnicity xmlns="..."> 始终是***元素的子元素，即“倒数第二个祖先”，您可以这样做：

for element in xroot.xpath('.//*[not(ancestor-or-self::*[last()-1][local-name()="ethnicity"])]'):
    # ...do stuff...

当然你也可以这样做：

for child in xroot.getchildren()
    if 'ethnicity' in child.tag:
        continue
    for element in child.xpath('//*'):
        # ...do stuff...

【讨论】：

谢谢。我喜欢ancestor-or-self 解决方案！ @TristanTran 它很优雅，但它可能不是你能得到的最快的，因为 XPath 会检查每个元素的每个祖先。如果你知道你的文档结构，还有更好的方法，比如最后一个建议。

以上是关于通过避免特定分支快速遍历lxml树的主要内容，如果未能解决你的问题，请参考以下文章

如何将分支合并回主分支并避免树冲突 - TortoiseSVN

TCP怎么保证证包有序传输的，TCP的慢启动，拥塞避免，快速重传，快速恢复

about MySql Innodb Index

lxml（或lxml.html）：打印树结构

快速生成树

实验快速生成树配置