通过 Xpaths 从 R 中的文件中删除或过滤 XML 节点

Posted 2023-02-24

技术标签:

【中文标题】通过 Xpaths 从 R 中的文件中删除或过滤 XML 节点【英文标题】：Remove or filter XML nodes by Xpaths from file in R 【发布时间】：2021-11-04 11:59:11 【问题描述】：

我有非常大的复杂 xml 文件（看起来像 https://github.com/HL7/C-CDA-Examples/blob/master/General/Parent%20Document%20Replace%20Relationship/CCD%20Parent%20Document%20Replace%20(C-CDAR2.1).xml）要处理，但只需要特定 XPath（节点）的属性和值。通过去除不需要的节点，可以缩短处理时间，在详细处理之前过滤掉绒毛。

到目前为止，我已经尝试过使用：xml_remove

xmlfile <- paste0(dir,"xmlFiles/",filelist[k])
file<-read_xml(xmlfile)
file<-xml_ns_strip(file)

for(counx in 1:nrow(xpathTable))   
        xr <- xml_find_all(file, xpath =paste0('/',toString(xpathTable$xpaths[counx])) )
        xml_remove(xr, free = TRUE)
        file<-file

这适用于删除少数节点，但随着数量的增加 (>100) 会崩溃

下面显示一个我也想得到的例子

<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
    <book category="cooking">
        <title lang="en">Everyday Italian</title>
        <author>Giada De Laurentiis</author>
        <year>2005</year>
        <price>30.00</price>
    </book>
    <book category="children">
        <title lang="en">Harry Potter</title>
        <author>J K. Rowling</author>
        <year>2005</year>
        <price>29.99</price>
        <ISBN>
            <Random>12354</Random>
        </ISBN>
    </book>
    <book category="web">
        <title lang="en">XQuery Kick Start</title>
        <author>James McGovern</author>
        <author>Per Bothner</author>
        <author>Kurt Cagle</author>
        <author>James Linn</author>
        <author>Vaidyanathan Nagarajan</author>
        <year>2003</year>
        <price>49.99</price>
    </book>
    <book category="web">
        <title lang="en">Learning XML</title>
        <author>Erik T. Ray</author>
        <year>2003</year>
        <ISBN>
            <Random>12345</Random>
        </ISBN>
        <price>39.95</price>
    </book>
</bookstore>

按 XPath 过滤

/书店/书/书名 /书店/书/年 /bookstore/book/ISBN/随机

<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
    <book category="cooking">
        <title lang="en">Everyday Italian</title>       
        <year>2005</year>
    </book>
    <book category="children">
        <title lang="en">Harry Potter</title>
        <year>2005</year>
        <ISBN>
            <Random>12354</Random>
        </ISBN>
    </book>
    <book category="web">
        <title lang="en">XQuery Kick Start</title>
        <year>2003</year>
    </book>
    <book category="web">
        <title lang="en">Learning XML</title>
        <year>2003</year>
        <ISBN>
            <Random>12345</Random>
        </ISBN>
    </book>
</bookstore>

【问题讨论】：

【参考方案1】：

看起来像 XQuery 作业，例如你可以像这样重新创建你的文档

<bookstore>
  for $book in /bookstore/*
  return <book category="$book/@category">
    $book/title
    $book/year
    $book/ISBN
  </book>
</bookstore>

使用本书的例子来得到它下面的结果。您可以在此处使用 XQuery 作为选项在线测试此功能https://www.videlibri.de/cgi-bin/xidelcgi

可能有一些方法可以运行XQuery from R，但我宁愿在命令行的预处理步骤中使用xidel 之类的工具来运行。

【讨论】：

预处理应用程序不适用于我 - R、Python、SQL【参考方案2】：

所有元素都可以在一个适用于多种语言的 XPath 1.0 表达式中查找：

/bookstore/book/descendant::*[name()="title" or name()="year" or name()="Random"]

等价/相似表达：

/bookstore/book/title | /bookstore/book/year | /bookstore/book/ISBN/Random
 //book/@category | //book/year | //ISBN/Random

过滤掉元素：

//book/*[not(name()="title" or name()="year" or name()="ISBN" or name()="Random")]

对于带有命名空间的 XML，如果不使用命名空间处理，则可以使用 local-name() 代替 name()。

对于给定的示例和元素以及命令行上的测试：

echo 'cat /bookstore/book/descendant::*[name()="title" or name()="year" or name()="Random"]' | xmllint --shell test.xml

结果：

/ > cat /bookstore/book/descendant::*[name()="title" or name()="year" or name()="Random"]
 -------
<title lang="en">Everyday Italian</title>
 -------
<year>2005</year>
 -------
<title lang="en">Harry Potter</title>
 -------
<year>2005</year>
 -------
<Random>12354</Random>
 -------
<title lang="en">XQuery Kick Start</title>
 -------
<year>2003</year>
 -------
<title lang="en">Learning XML</title>
 -------
<year>2003</year>
 -------
<Random>12345</Random>
/ >

对于提到的 R 崩溃，值得一看 here。

【讨论】：

这可能适用于像示例这样不太复杂的文件，但我处理的文件看起来像 github.com/HL7/C-CDA-Examples/blob/master/General/… 嗯，xpath 应该有助于在几乎任何文件上查找节点。请在该文件上发布您想要/不想要的元素，我可以稍后查看。我想在justpaste.it/84boj删除值和属性的xpath列表您要删除元素或属性还是仅删除它们的值？有点对文档进行匿名化？此外，上面列表中的 xpaths 不包含任何 @ 属性，如 //book/@category @Bristle 刚刚添加了一个链接，指向我关于 R 崩溃的答案。

以上是关于通过 Xpaths 从 R 中的文件中删除或过滤 XML 节点的主要内容，如果未能解决你的问题，请参考以下文章