分解数组中的随机不可预测标签

Posted 2023-05-07

技术标签:

【中文标题】分解数组中的随机不可预测标签【英文标题】：Explode random unpredictagle tags in an array 【发布时间】：2014-06-18 11:46:20 【问题描述】：

下面是一些随机的、不可预测的标签集，它们包含在 div 标签中。如何分解所有子标签 innerhtml 保留其出现的顺序。

注意：如果是img，iframe标签只需要提取url。

 <div>
  <p>para-1</p>
  <p>para-2</p>
  <p>
    text-before-image
    <img src="text-image-src"/>
    text-after-image</p>
  <p>
    <iframe src="p-iframe-url"></iframe>
  </p>
  <iframe src="iframe-url"></iframe>
  <h1>header-1</h1>
  <img src="image-url"/>
  <p>
    <img src="p-image-url"/>
  </p>
  content not wrapped within any tags
  <h2>header-2</h2>
  <p>para-3</p>
  <ul>
    <li>list-item-1</li>
    <li>list-item-2</li>
  </ul>
  <span>span-content</span>
 content not wrapped within any tags
</div>

预期数组：

 ["para-1","para-2","text-before-image","text-image-src","text-after-image",
"p-iframe-url","iframe-url","header-1","image-url",
"p-image-url","content not wrapped within any tags","header-2","para-3",
"list-item-1","list-item-2","span-content","content not wrapped within any tags"]

相关代码：

 $dom     = new DOMDocument();
        @$dom->loadHTML( $content );
        $tags = $dom->getElementsByTagName( 'p' );
        // Get all the paragraph tags, to iterate its nodes.
        $j = 0;
        foreach ( $tags as $tag ) 
            // get_inner_html() to preserve the node's text & tags
            $con[ $j ] = $this->get_inner_html( $tag );
            // Check if the Node has html content or not
            if ( $con[ $j ] != strip_tags( $con[ $j ] ) )       
                // Check if the node contains html along with plain text with out any tags
                if ( $tag->nodeValue != '' ) 
                    /*
                     * DOM to get the Image SRC of a node
                     */
                    $domM      = new DOMDocument();
                    /*
                     * Setting encoding type http://in1.php.net/domdocument.loadhtml#74777
                     * Set after initilizing DomDocument();
                     */
                    $con[ $j ] = mb_convert_encoding( $con[ $j ], 'HTML-ENTITIES', "UTF-8" );
                    @$domM->loadHTML( $con[ $j ] );
                    $y = new DOMXPath( $domM );
                    foreach ( $y->query( "//img" ) as $node ) 
                        $con[ $j ] = "img=" . $node->getAttribute( "src" );
                        // Increment the Array size to accomodate bad text and image tags.
                        $j++;
                        // Node incremented, fetch the node value and accomodate the text without any tags.
                        $con[ $j ] = $tag->nodeValue;
                    
                    $domC      = new DOMDocument();
                    @$domC->loadHTML( $con[ $j ] );
                    $z = new DOMXPath( $domC );
                    foreach ( $z->query( "//iframe" ) as $node ) 
                        $con[ $j ] = "vid=http:" . $node->getAttribute( "src" );
                        // Increment the Array size to accomodate bad text and image tags.

                        $j++;
                        // Node incremented, fetch the node value and accomodate the text without any tags.
                        $con[ $j ] = $tag->nodeValue;
                    
                 else 
                    /*
                     * DOM to get the Image SRC of a node
                     */
                    $domA      = new DOMDocument();
                    @$domA->loadHTML( $con[ $j ] );
                    $x = new DOMXPath( $domA );
                    foreach ( $x->query( "//img" ) as $node ) 
                        $con[ $j ] = "img=" . $node->getAttribute( "src" );
                    

                    if ( $con[ $j ] != strip_tags( $con[ $j ] ) ) 
                        foreach ( $x->query( "//iframe" ) as $node ) 
                            $con[ $j ] = "vid=http:" . $node->getAttribute( "src" );
                        
                    
                
            
            // INcrement the node
            $j++;
        

        $this->content = $con;

【问题讨论】：

@jeroen 使用 dom api，成功地仅提取

标签 innerhtml 保留其出现。但是当存在 p 以外的标签时失败。

为什么不只是strip_tags()？这将抽出所有包含的 html 并只留下文本，并按照 html/文本在文件中出现的顺序执行。 @MarcB 如果只是 strip_tags()，iframe 和图像路径会发生什么您不想获得“innerHTML”。您想查看检索“属性”（例如 iframe src）的值以及元素的“文本内容”。这些关键字应该可以帮助您前进。如果您向我们展示了代码的相关部分，我们会更好地了解您选择了哪种方法，如果您没有遇到概念性错误（仅），我们甚至可能会发现错误。 【参考方案1】：

从 DOM 文档中提取有趣信息的一种快速简便的方法是使用 XPath。下面是一个基本示例，展示了如何从 div 元素中获取文本内容和属性文本。

<?php

// Pre-amble, scroll down to interesting stuff...
$html = '<div>
  <p>para-1</p>
  <p>para-2</p>
  <p>
    <iframe src="p-iframe-url"></iframe>
  </p>
  <iframe src="iframe-url"></iframe>
  <h1>header-1</h1>
  <img src="image-url"/>
  <p>
    <img src="p-image-url"/>
  </p>
  content not wrapped within any tags
  <h2>header-2</h2>
  <p>para-3</p>
  <ul>
    <li>list-item-1</li>
    <li>list-item-2</li>
  </ul>
  <span>span-content</span>
 content not wrapped within any tags
</div>';

$doc = new DOMDocument;
$doc->loadHTML($html);
$div = $doc->getElementsByTagName('div')->item(0);

// Interesting stuff:

// Use XPath to get all text nodes and attribute text
// $tests becomes a DOMNodeList filled with DOMText and DOMAttr objects
$xpath = new DOMXPath($doc);
$texts = $xpath->query('descendant-or-self::*/text()|descendant::*/@*', $div);

// You could only include/exclude specific attributes by looking at their name
// e.g. multiple paths: .//@src|.//@href
// or whitelist:        descendant::*/@*[name()="src" or name()="href"]
// or blacklist:        descendant::*/@*[not(name()="ignore")]

// Build an array of the text held by the DOMText and DOMAttr objects
// skipping any boring whitespace
$results = array();
foreach ($texts as $text) 
    $trimmed_text = trim($text->nodeValue);
    if ($trimmed_text !== '') 
        $results[] = $trimmed_text;
    


// Let's see what we have
var_dump($results);

【讨论】：

感谢您的回答。 sn-p 工作得很好。但是，该数组是用不需要的 DOMAttr 对象（样式、高度、宽度、alt、rel、..）构建的。如何丢弃？更改 XPath 表达式以仅匹配您感兴趣的属性。这可能涉及更改 @* 部分，或将谓词（过滤器）添加到白名单或黑名单属性名称。这完全取决于您。非常感谢@salathe，XPath 表达式$texts = $xpath->query('descendant-or-self::*/text()|descendant::*/@*[name()="src" or name()="href"]', $div); 解决了。现在，假设我必须将具有类 remove <p class="remove">Dont show me</p> 的 p 标签列入黑名单。我该如何过滤它？谢谢... 应该在 foreach($texts) 循环中过滤异常。如果 $text->nodeType 和/或 $text->nodeName 和/或 $text 的其他属性将节点标识为“坏”，请不要将其添加到结果中。您可以通过匹配不是该类的段落的元素来过滤掉该元素...例如descendant-or-self::*[not(self::p[@class="remove"])] -- 看起来您应该在周末花一些时间阅读 XPath。一个好的介绍是Tobias Schlitt's XPath overview。【参考方案2】：

尝试递归方法！在您的类实例上获取一个空数组$parts 和一个函数extractSomething(DOMNode $source)。您的函数应该处理每个单独的案例，然后返回。如果源是一个

TextNode：推送到 $parts 元素和 name=img：将其 href 推送到 $parts 其他特殊情况元素：对于每个 TextNode 或元素子调用 extractSomething(child)

现在，当对 extractSomenting(yourRootDiv) 的调用返回时，您将在 $this->parts 中获得列表。

请注意，您尚未定义 <p> sometext1 <img href="ref" /> sometext2 <p> 会发生什么，但上面的示例正在推动为其添加 3 个元素（“sometext1”、“ref”和“sometext2”）。

这只是解决方案的大致轮廓。关键是您需要处理树中的每个节点（可能并不真正考虑其位置），并且在以正确的顺序遍历它们的同时，通过将每个节点转换为所需的文本来构建数组。递归是最快的编码方式，但您也可以尝试使用宽度遍历或walker 工具。

底线是您必须完成两项任务：以正确的顺序遍历节点，将每个节点转换为所需的结果。

这基本上是处理树/图结构的经验法则。

【讨论】：

【参考方案3】：

最简单的方法是使用 DOMDocument： http://www.php.net/manual/en/domdocument.loadhtmlfile.php

【讨论】：

成功地仅提取

标签 innerhtml 保留其出现。但是当存在 p 以外的标签时失败。

请添加一些关于操作如何使用 DOMDocument 的解释，也许还有一些例子。

以上是关于分解数组中的随机不可预测标签的主要内容，如果未能解决你的问题，请参考以下文章