正则表达式和 PHP - 从 img 标签中隔离 src 属性

Posted 2023-02-24

技术标签:

【中文标题】正则表达式和 PHP - 从 img 标签中隔离 src 属性【英文标题】：Regex & PHP - isolate src attribute from img tag 【发布时间】：2011-01-08 09:53:01 【问题描述】：

使用 php，如何将 src 属性的内容与 $foo 隔离开来？我正在寻找的最终结果只会给我“http://example.com/img/image.jpg”

$foo = '<img class="foo bar test" title="test image" src="http://example.com/img/image.jpg"    />';

【问题讨论】：

在关于使用正则表达式解析 html 的愤怒之前。 @meagar - 在这个有限的范围内使用正则表达式是有效的（尽管不一定是最有效的途径）。不要使用正则表达式来解析 HTML。（不是讽刺！）我误用了原始帖子标题，不应该添加正则表达式。我真的很喜欢 karim79 的解决方案，但它需要添加一个非标准的类。 【参考方案1】：

如果您不想使用正则表达式（或任何非标准 PHP 组件），使用内置 DOMDocument class 的合理解决方案如下：

<?php
    $doc = new DOMDocument();
    $doc->loadHTML('<img src="http://example.com/img/image.jpg" ... />');
    $imageTags = $doc->getElementsByTagName('img');

    foreach($imageTags as $tag) 
        echo $tag->getAttribute('src');
    
?>

【讨论】：

不错！这与我最终所做的非常接近。我不知道 DOMDocument，但我会试一试。【参考方案2】：

代码

<?php
    $foo = '<img class="foo bar test" title="test image" src="http://example.com/img/image.jpg"    />';
    $array = array();
    preg_match( '/src="([^"]*)"/i', $foo, $array ) ;
    print_r( $array[1] ) ;

输出

http://example.com/img/image.jpg

【讨论】：

在结果中注意&amp; 实体引用和数字字符引用！如你所愿！ =) 这是另一种语法：/src="(.*?)"/i. HTML 允许使用单引号，只要它们匹配。并且“替代语法”可以匹配比预期更多的字符。最后，img 属性的开头和结尾可以有空格。应该是：/[sS][rR][cC]\s*=\s*['"]([^'"]+)['"]/i【参考方案3】：

我得到了这个代码：

$dom = new DOMDocument();
$dom->loadHTML($img);
echo $dom->getElementsByTagName('img')->item(0)->getAttribute('src');

假设只有一个 img :P

【讨论】：

【参考方案4】：

// Create DOM from string
$html = str_get_html('<img class="foo bar test" title="test image" src="http://example.com/img/image.jpg"    />');

// echo the src attribute
echo $html->find('img', 0)->src;

http://simplehtmldom.sourceforge.net/

【讨论】：

【参考方案5】：

我对此非常晚，但我有一个尚未提及的简单解决方案。使用simplexml_load_string 加载它（如果您启用了simplexml），然后将其翻阅json_encode 和json_decode。

$foo = '<img class="foo bar test" title="test image" src="http://example.com/img/image.jpg"    />';

$parsedFoo = json_decode(json_encode(simplexml_load_string($foo)), true);
var_dump($parsedFoo['@attributes']['src']); // output: "http://example.com/img/image.jpg"

$parsedFoo 作为

array(1) 
  ["@attributes"]=>
  array(6) 
    ["class"]=>
    string(12) "foo bar test"
    ["title"]=>
    string(10) "test image"
    ["src"]=>
    string(32) "http://example.com/img/image.jpg"
    ["alt"]=>
    string(10) "test image"
    ["width"]=>
    string(3) "100"
    ["height"]=>
    string(3) "100"

几个月来，我一直在使用它来解析 XML 和 HTML，它运行良好。我还没有遇到任何问题，尽管我不必用它解析一个大文件（我想像使用json_encode 和json_decode 这样的输入会变得更慢）。它很复杂，但它是迄今为止读取 HTML 属性的最简单方法。

【讨论】：

上周我确实发现了一个小问题。如果 XML 节点同时具有属性和值，则只能使用此方法访问值。我最终不得不编写一个简单的解析器，它将 simplexml 转换为一个数组，同时保留所有数据。【参考方案6】：

preg_match很好地解决了这个问题。

在这里查看我的答案：How to extract img src, title and alt from html using php?

【讨论】：

【参考方案7】：

这就是我最终做的事情，尽管我不确定这样做的效率如何：

$imgsplit = explode('"',$data);
foreach ($imgsplit as $item) 
    if (strpos($item, 'http') !== FALSE) 
        $image = $item;
        break;

【讨论】：

如果图像的 URL 是相对于文档的，这种方法会遇到问题，例如“../../img/something.jpg”【参考方案8】：

试试这个模式：

'/< \s* img [^\>]* src \s* = \s* [\""\']? ( [^\""\'\s>]* )/'

【讨论】：

如果 img 大写或标题包含“>”，这将不起作用。使用 HTML 解析器会更健壮。【参考方案9】：

你可以使用这个函数来解决这个问题：

函数 getTextBetween($start, $end, $text) $start_from = strpos($text, $start); $start_pos = $start_from + strlen($start); $end_pos = strpos($text, $end, $start_pos + 1); $subtext = substr($text, $start_pos, $end_pos); 返回$潜文本； $foo = ''; $img_src = getTextBetween('src="', '"', $foo);

【讨论】：

【参考方案10】：

我使用 preg_match_all 来捕获 HTML 文档中的所有图像：

preg_match_all("~<img.*src\s*=\s*[\"']([^\"']+)[\"'][^>]*>~i", $body, $matches);

这允许更宽松的声明语法，带有空格和不同的引号类型。

Regex 读起来像 （任何属性，如 style 或 border）src（可能的空格）= (可能的空格) (' or ") (任何非引号) (' or ") (任何直到>) (>)

【讨论】：

【参考方案11】：

<?php
    $html = '
        <img border="0" src="/images/image1.jpg"    />
        <img border="0" src="/images/image2.jpg"    />
        <img border="0" src="/images/image3.jpg"    />
        ';
    
    $get_Img_Src = '/<img[^>]*src=([\'"])(?<src>.+?)\1[^>]*>/i'; //for get img src path only...
    
    preg_match_all($get_Img_Src, $html, $result); 
    if (!empty($result)) 
        echo $result['src'][0];
        echo $result['src'][1];

也用于获取 img src 路径和替代文本 然后使用下面的正则表达式而不是上面...

]*src=(['"])(?.+?)\1[^>]alt=(['"])(?.+?)\2>

    $get_Img_Src = '/<img[^>]*src=([\'"])(?<src>.+?)\1[^>]*alt=([\'"])(?<alt>.+?)\2*>/i'; //for get img src path & alt text also
    
    preg_match_all($get_Img_Src, $html, $result); 
    if (!empty($result)) 
        echo $result['src'][0];
        echo $result['src'][1];
        echo $result['alt'][0];
        echo $result['alt'][1];

我从here, PHP extract link from a href tag 了解到这个很棒的解决方案

【讨论】：

用正则表达式解析有效的 html 是不必要的风险。是的，但是如果我们想要验证表单数据或操作 html 字符串，那么我们可以使用正则表达式进行抽象。我在我的项目中使用了上面的正则表达式。所以我为什么要为抽象 src 路径分享独特的正则表达式解决方案我仅出于知识目的分享解决方案【参考方案12】：

假设我使用

$text ='<img src="blabla.jpg"  />';

在

getTextBetween('src="','"',$text);

代码将返回：

blabla.jpg"

这是错误的，我们希望代码返回属性值引号之间的文本，即 attr = "value"。

所以

  function getTextBetween($start, $end, $text)
            
                // explode the start string
                $first_strip= end(explode($start,$text,2));

                // explode the end string
                $final_strip = explode($end,$first_strip)[0];
                return $final_strip;

成功了！

试试

   getTextBetween('src="','"',$text);

将返回：

blabla.jpg

同样感谢，因为您的解决方案让我对最终解决方案有了深入了解。

【讨论】：

我并不是真的想说你的方法不好，但我确实认为使用 domdocument 会更好地解决这个问题。参考这个例如：***.com/questions/6441448/… domdocument 库对于这么简单的任务来说太重了。这就像当您有弯刀替代品时使用推土机压碎蛇一样。

以上是关于正则表达式和 PHP - 从 img 标签中隔离 src 属性的主要内容，如果未能解决你的问题，请参考以下文章