从 xpath 获取包含某个单词的 img src

Posted

技术标签:

【中文标题】从 xpath 获取包含某个单词的 img src【英文标题】:Get img src that contains a certain word from xpath 【发布时间】:2021-11-13 22:03:04 【问题描述】:

我使用无头模式提取网页,这是输出的相关内部 html 部分。

<div class="product__aside">
\t\t\t\t<div class="slider-pdp">
\t\t\t\t\t<div class="slider__clip">
\t\t\t\t\t\t<div class="slides slick-initialized slick-slider slick-dotted" role="toolbar">
<div aria-live="polite" class="slick-list draggable" style="padding: 0px 24.47%;"><div class="slick-track" role="listbox" style="opacity: 1; width: 6010px; transform: translate3d(-1202px, 0px, 0px);"><div class="slide slick-slide slick-cloned" data-slick-index="-2" aria-hidden="true" tabindex="-1" style="width: 601px;"> 
\t<div class="slide__image">
\t\t<img src="https://tnuck.ips.photos/images/skus/P31637-PRODUCT_04--IMG_600-1812358633.jpg"  data-zoom-image="https://tnuck.ips.photos/images/skus/P31637-PRODUCT_04--IMG_1365--489680014.jpg"> 
\t</div>
</div><div class="slide slick-slide slick-cloned" data-slick-index="-1" aria-hidden="true" tabindex="-1" style="width: 601px;">
\t<div class="slide__image">
\t\t<img src="https://tnuck.ips.photos/images/skus/P31637-PRODUCT_05--IMG_600-251567441.jpg"  data-zoom-image="https://tnuck.ips.photos/images/skus/P31637-PRODUCT_05--IMG_1365--146353341.jpg"> 
\t</div>
</div><div class="slide slick-slide slick-current slick-active slick-center" data-slick-index="0" aria-hidden="false" tabindex="-1" role="option" aria-describedby="slick-slide00" style="width: 601px;"> 
\t<div class="slide__image">
\t\t<img src="https://tnuck.ips.photos/images/skus/P31637-PRODUCT_LEAD--IMG_600--951538759.jpg"  data-zoom-image="https://tnuck.ips.photos/images/skus/P31637-PRODUCT_LEAD--IMG_1365--973725436.jpg"> 
\t</div>
</div><div class="slide slick-slide" data-slick-index="1" aria-hidden="true" tabindex="-1" role="option" aria-describedby="slick-slide01" style="width: 601px;"> 
\t<div class="slide__image">
\t\t<img src="https://tnuck.ips.photos/images/skus/P31637-PRODUCT_01--IMG_600--1234110023.jpg"  data-zoom-image="https://tnuck.ips.photos/images/skus/P31637-PRODUCT_01--IMG_1365-140785407.jpg"> 
\t</div>
</div><div class="slide slick-slide" data-slick-index="2" aria-hidden="true" tabindex="-1" role="option" aria-describedby="slick-slide02" style="width: 601px;"> 
\t<div class="slide__image">
\t\t<img src="https://tnuck.ips.photos/images/skus/P31637-PRODUCT_02--IMG_600--150275930.jpg"  data-zoom-image="https://tnuck.ips.photos/images/skus/P31637-PRODUCT_02--IMG_1365-1432102351.jpg"> 
\t</div>
</div><div class="slide slick-slide" data-slick-index="3" aria-hidden="true" tabindex="-1" role="option" aria-describedby="slick-slide03" style="width: 601px;"> 
\t<div class="slide__image">
\t\t<img src="https://tnuck.ips.photos/images/skus/P31637-PRODUCT_03--IMG_600--102741357.jpg"  data-zoom-image="https://tnuck.ips.photos/images/skus/P31637-PRODUCT_03--IMG_1365-1955701010.jpg"> 
\t</div>
</div><div class="slide slick-slide" data-slick-index="4" aria-hidden="true" tabindex="-1" role="option" aria-describedby="slick-slide04" style="width: 601px;"> 
\t<div class="slide__image">
\t\t<img src="https://tnuck.ips.photos/images/skus/P31637-PRODUCT_04--IMG_600-1812358633.jpg"  data-zoom-image="https://tnuck.ips.photos/images/skus/P31637-PRODUCT_04--IMG_1365--489680014.jpg"> 
\t</div>
</div><div class="slide slick-slide" data-slick-index="5" aria-hidden="true" tabindex="-1" role="option" aria-describedby="slick-slide05" style="width: 601px;"> 
\t<div class="slide__image">
\t\t<img src="https://tnuck.ips.photos/images/skus/P31637-PRODUCT_05--IMG_600-251567441.jpg"  data-zoom-image="https://tnuck.ips.photos/images/skus/P31637-PRODUCT_05--IMG_1365--146353341.jpg"> 
\t</div>
</div><div class="slide slick-slide slick-cloned slick-center" data-slick-index="6" aria-hidden="true" tabindex="-1" style="width: 601px;">
\t<div class="slide__image">
\t\t<img src="https://tnuck.ips.photos/images/skus/P31637-PRODUCT_LEAD--IMG_600--951538759.jpg"  data-zoom-image="https://tnuck.ips.photos/images/skus/P31637-PRODUCT_LEAD--IMG_1365--973725436.jpg"> 
\t</div>
</div><div class="slide slick-slide slick-cloned" data-slick-index="7" aria-hidden="true" tabindex="-1" style="width: 601px;">
\t<div class="slide__image">
\t\t<img src="https://tnuck.ips.photos/images/skus/P31637-PRODUCT_01--IMG_600--1234110023.jpg"  data-zoom-image="https://tnuck.ips.photos/images/skus/P31637-PRODUCT_01--IMG_1365-140785407.jpg"> 
\t</div>
</div></div></div>

由此我需要获取其中包含“PRODUCT_LEAD”字符串的src 值。为了这样做,我编写了以下代码,如果我 dd($imgs) 它返回长度为 10。但它没有返回 for 循环中的 src 值。 $pageBody是网页的内部html。

                            $doc = new DOMDocument;
                            $doc->preserveWhiteSpace = false;
                            $doc->strictErrorChecking = false;
                            $doc->recover = true;

                            ini_set('user_agent', 'My-Application/2.5');
                            libxml_use_internal_errors(true);
                            $doc->loadHTML($pageBody);
                            $xpath = new \DOMXPath($doc);
                            $imgs  = $xpath->query('//*[@class="slide__image"]');
                            foreach($imgs as $img)
                            
                                $imgurl = $img->getAttribute('src');
                            
                            dd($imgurl); // This returns nothing

【问题讨论】:

【参考方案1】:

试试$imgs = $xpath-&gt;query('//*[@class="slide__image"]/img/@src[contains(., "PRODUCT_LEAD")]');

方括号中的部分是确定要选择哪些元素的“谓词”。 . 指的是当前节点。

【讨论】:

【参考方案2】:

试试这个代码:

$imgurl = [];

for($x = 0; $x < $imgs->length; $x++) 
    $imgurl[] = $imgs->item($x)->getAttribute('src');

【讨论】:

【参考方案3】:
$doc = new DOMDocument;
$doc->preserveWhiteSpace = false;
$doc->strictErrorChecking = false;
$doc->recover = true;

ini_set('user_agent', 'My-Application/2.5');
libxml_use_internal_errors(true);
$doc->loadHTML($pageBody);
$xpath = new \DOMXPath($doc);
$imgs  = $xpath->query('//*[@class="slide__image"]/img/@src');
$imgurl=[];
foreach($imgs as $img)

    if(str_contains($img->nodeValue,'PRODUCT_LEAD'))
    
       $leadImage = $img->nodeValue;
    

我修改了这样的代码,而不是getAttibute()。这很好用。但我想知道我是否可以直接从query() 获取此网址 类似//img[@src(contains())]

【讨论】:

以上是关于从 xpath 获取包含某个单词的 img src的主要内容,如果未能解决你的问题,请参考以下文章

最近在自学python girlphoto_urls = selector.xpath('//div/a/img/@src')这句是啥意思?大神

XPath 选择图像链接 - 仅当 img src 的父 href 链接存在时,否则选择 img src 链接

如何用js获取某个img标签节点的所有属性名

jquery获取div中所有img的src并各自加上某个字符

[Java] 通过XPath获取XML中某个节点的属性

scrapy