如何使用 PHP 解析带有冒号标签的 XML 节点

Posted 2023-02-24

技术标签:

【中文标题】如何使用 PHP 解析带有冒号标签的 XML 节点【英文标题】：How to parse an XML node with a colon tag using PHP 【发布时间】：2015-07-04 19:03:08 【问题描述】：

我正在尝试从 [此 URL（加载需要相当长的时间）][1] 获取以下节点的值。我感兴趣的元素是：

title, g:price and g:gtin

XML 开头是这样的：

<rss xmlns:g="http://base.google.com/ns/1.0" version="2.0">
  <channel>
    <title>PhotoSpecialist.de</title>
    <link>http://www.photospecialist.de</link>
    <description/>
    <item>
      <g:id>BEN107C</g:id>
      <title>Benbo Trekker Mk3 + Kugelkopf + Tasche</title>
      <description>
        Benbo Trekker Mk3 + Kugelkopf + Tasche Das Benbo Trekker Mk3 ist eine leichte Variante des beliebten Benbo 1. Sein geringes Gewicht macht das Trekker Mk3 zum idealen Stativ, wenn Sie viel draußen fotografieren und viel unterwegs sind. Sollten Sie in eine Situation kommen, in der maximale Stabilität zählt, verfügt das Benbo Trekker Mk3 über einen Haken an der Mittelsäule. An diesem können Sie das Stativ mit zusätzlichem Gewicht bei Bedarf beschweren. Dank der zwei besonderen Kamera-Befestigungsschrauben können Sie mit dem Benbo Trekker Mk3 sehr nah am Boden fotografieren. So nah, dass in vielen Fällen die einzige Einschränkung die Größe Ihrer Kamera darstellt. In diesem Set erhalten Sie das Benbo Trekker Mk3 zusammen mit einem Kugelkopf, Socket und einer Tasche für den sicheren und komfortablen Transport.
      </description>
      <link>
        http://www.photospecialist.de/benbo-trekker-mk3-kugelkopf-tasche?dfw_tracker=2469-16
      </link>
      <g:image_link>http://static.fotokonijnenberg.nl/media/catalog/product/b/e/benbo_trekker_mk3_tripod_kit_with_b__s_head__bag_ben107c1.jpg</g:image_link>
      <g:price>199.00 EUR</g:price>
      <g:condition>new</g:condition>
      <g:availability>in stock</g:availability>
      <g:identifier_exists>TRUE</g:identifier_exists>
      <g:brand>Benbo</g:brand>
      <g:gtin>5022361100576</g:gtin>
      <g:item_group_id>0</g:item_group_id>
      <g:product_type>Tripod</g:product_type>
      <g:mpn/>
      <g:google_product_category>Kameras & Optik</g:google_product_category>
    </item>
  ...
  </channel>
</rss>

为此，我编写了以下代码：

$z = new XMLReader;
$z->open('https://my.datafeedwatch.com/static/files/1248/8222ebd3847fbfdc119abc9ba9d562b2cdb95818.xml');

$doc = new DOMDocument;

while ($z->read() && $z->name !== 'item')
    ;

while ($z->name === 'item')

    $node = new SimpleXMLElement($z->readOuterXML());
    $a = $node->title;
    $b = $node->price;
    $c = $node->gtin;
    echo $a . $b . $c . "<br />";
    $z->next('item');

这只会返回标题...价格和 gtin 没有显示。

【问题讨论】：

我的错，你使用的是SimpleXMLElement to access the attributes with their own namespace。所以链接的副本并不完全正确（您可以使用XMLReader::expand() 直接获取DOMElement，通过dom_import_simplexml 转换为DOM，或者确保通过SimpleXML 直接访问命名空间属性，就像链接中的一样此评论中的问答）。 @hakre...我不能使用 simplexml，因为 XML 很大，所以要使用 XMLReader 嗯？您实际上在问题代码中使用了 SimpleXML。当我提到它时，我并不是说要离开 XMLReader。 @hakre...哎呀对不起...实际上我对这种 XML 编码很陌生...顺便说一句，你能帮我解决这个问题吗 【参考方案1】：

您询问的元素不是默认命名空间的一部分，而是在不同的命名空间中。您可以看到，因为它们的名称中有一个前缀，用冒号分隔：

  ...
  <channel>
    <title>PhotoSpecialist.de</title>
    <!-- title is in the default namespace, no colon in the name -->
    ...
    <g:price>199.00 EUR</g:price>
    ...
    <g:gtin>5022361100576</g:gtin>
    <!-- price and gtin are in a different namespace, colon in the name and prefixed by "g" -->
  ...

命名空间带有前缀，在您的情况下为“g”。命名空间所代表的前缀在这里的文档元素中定义：

<rss xmlns:g="http://base.google.com/ns/1.0" version="2.0">

所以命名空间是“http://base.google.com/ns/1.0”。

当您像当前一样使用 SimpleXMLElement 按名称访问子元素时：

$a = $node->title;
$b = $node->price;
$c = $node->gtin;

您只在默认命名空间中查找。所以只有第一个元素实际上包含文本，另外两个是 on-thy-fly 创建的，并且仍然是空的。

要访问命名空间的子元素，您需要使用children() 方法明确地告诉SimpleXMLElement。它创建了一个新的 SimpleXMLElement，其中包含该命名空间中的所有子代，而不是默认的：

$google = $node->children("http://base.google.com/ns/1.0");

$a = $node->title;
$b = $google->price;
$c = $google->gtin;

孤立的例子就这么多（是的，就是这样）。

一个完整的例子可能看起来像（包括阅读器上的节点扩展，你的代码有点生疏）：

<?php
/**
 * How to parse an XML node with a colon tag using PHP
 *
 * @link http://***.com/q/29876898/367456
 */
const HTTP_BASE_GOOGLE_COM_NS_1_0 = "http://base.google.com/ns/1.0";

$url = 'https://my.datafeedwatch.com/static/files/1248/8222ebd3847fbfdc119abc9ba9d562b2cdb95818.xml';

$reader = new XMLReader;
$reader->open($url);

$doc = new DOMDocument;

// move to first item element
while (($valid = $reader->read()) && $reader->name !== 'item') ;

while ($valid) 
    $default    = simplexml_import_dom($reader->expand($doc));
    $googleBase = $default->children(HTTP_BASE_GOOGLE_COM_NS_1_0);
    printf(
        "%s - %s - %s<br />\n"
        , htmlspecialchars($default->title)
        , htmlspecialchars($googleBase->price)
        , htmlspecialchars($googleBase->gtin)
    );

    // move to next item element
    $valid = $reader->next('item');
;

我希望这既能给出解释，也能拓宽对 XMLReader 使用的看法。

【讨论】：

@hakre..感谢您提供了这么好的信息性帖子...再次感谢我的教程一个更好的变体可能是使用 DOMXpath。但是我现在记起来太晚了 :) ThW 有这样一个 XMLReader 的例子，如果我找到链接，我会看看。 --- 编辑： 在这里，这个例子非常适合：***.com/a/23079179/367456 "只有第一个元素实际包含文本，另外两个是即时创建的，并且还为空" - 这不是真的；所有子元素或属性是按需检索（最终是here），只是对->elements($ns)或->attributes($ns)的调用告诉SimpleXML 哪些要检索。如果您将 SimpleXML 视为一个 API，就像 DOM 但更简单，而不是“包含”数据的对象，我发现它不会那么令人惊讶。 @IMSoP;我喜欢你的描述（我在 SimpleXML 标签中阅读了你最近的一些答案，真的非常写得很好，让我有点嫉妒，但希望我的英语能从阅读中受益）但其中一些元素是访问时也会创建，至少在您将数据写入其中时：eval.in/319535 - 这就是我所说的动态创建的意思。原始文档不包含该元素（这是我上面的答案中的$b 和$c）。 @hakre 啊，我想我明白你的意思了，但它们不会仅仅通过阅读它们来创建：eval.in/319537 因为问题只是关于阅读，所以你可以通过分配值来创建它们是一种由来。不过，有趣的一点是，引用它们并不是无效，只是对当前任务没有用。 :)【参考方案2】：

如果主标签是带冒号的字符串，则必须使用

$xml->next($xml->localName);

移动到下一个项目元素。

【讨论】：

以上是关于如何使用 PHP 解析带有冒号标签的 XML 节点的主要内容，如果未能解决你的问题，请参考以下文章

使用 jQuery $().find 解析带有命名空间的 XML

php怎么生成带冒号的节点和属性的,xml使用SimpleXMLElement类或其他php类

带有curl的php解析器xml，仅显示我想要的标签[重复]

谁用过Xstream 用其生成带有cdata标签的xml解析带有cdata标签的xml

php如何解析多级xml报文？

如何使用Nokogiri解析带有非对标签的XML