从 xpath 获取包含某个单词的 img src
Posted
技术标签:
【中文标题】从 xpath 获取包含某个单词的 img src【英文标题】:Get img src that contains a certain word from xpath 【发布时间】:2021-11-13 22:03:04 【问题描述】:我使用无头模式提取网页,这是输出的相关内部 html 部分。
<div class="product__aside">
\t\t\t\t<div class="slider-pdp">
\t\t\t\t\t<div class="slider__clip">
\t\t\t\t\t\t<div class="slides slick-initialized slick-slider slick-dotted" role="toolbar">
<div aria-live="polite" class="slick-list draggable" style="padding: 0px 24.47%;"><div class="slick-track" role="listbox" style="opacity: 1; width: 6010px; transform: translate3d(-1202px, 0px, 0px);"><div class="slide slick-slide slick-cloned" data-slick-index="-2" aria-hidden="true" tabindex="-1" style="width: 601px;">
\t<div class="slide__image">
\t\t<img src="https://tnuck.ips.photos/images/skus/P31637-PRODUCT_04--IMG_600-1812358633.jpg" data-zoom-image="https://tnuck.ips.photos/images/skus/P31637-PRODUCT_04--IMG_1365--489680014.jpg">
\t</div>
</div><div class="slide slick-slide slick-cloned" data-slick-index="-1" aria-hidden="true" tabindex="-1" style="width: 601px;">
\t<div class="slide__image">
\t\t<img src="https://tnuck.ips.photos/images/skus/P31637-PRODUCT_05--IMG_600-251567441.jpg" data-zoom-image="https://tnuck.ips.photos/images/skus/P31637-PRODUCT_05--IMG_1365--146353341.jpg">
\t</div>
</div><div class="slide slick-slide slick-current slick-active slick-center" data-slick-index="0" aria-hidden="false" tabindex="-1" role="option" aria-describedby="slick-slide00" style="width: 601px;">
\t<div class="slide__image">
\t\t<img src="https://tnuck.ips.photos/images/skus/P31637-PRODUCT_LEAD--IMG_600--951538759.jpg" data-zoom-image="https://tnuck.ips.photos/images/skus/P31637-PRODUCT_LEAD--IMG_1365--973725436.jpg">
\t</div>
</div><div class="slide slick-slide" data-slick-index="1" aria-hidden="true" tabindex="-1" role="option" aria-describedby="slick-slide01" style="width: 601px;">
\t<div class="slide__image">
\t\t<img src="https://tnuck.ips.photos/images/skus/P31637-PRODUCT_01--IMG_600--1234110023.jpg" data-zoom-image="https://tnuck.ips.photos/images/skus/P31637-PRODUCT_01--IMG_1365-140785407.jpg">
\t</div>
</div><div class="slide slick-slide" data-slick-index="2" aria-hidden="true" tabindex="-1" role="option" aria-describedby="slick-slide02" style="width: 601px;">
\t<div class="slide__image">
\t\t<img src="https://tnuck.ips.photos/images/skus/P31637-PRODUCT_02--IMG_600--150275930.jpg" data-zoom-image="https://tnuck.ips.photos/images/skus/P31637-PRODUCT_02--IMG_1365-1432102351.jpg">
\t</div>
</div><div class="slide slick-slide" data-slick-index="3" aria-hidden="true" tabindex="-1" role="option" aria-describedby="slick-slide03" style="width: 601px;">
\t<div class="slide__image">
\t\t<img src="https://tnuck.ips.photos/images/skus/P31637-PRODUCT_03--IMG_600--102741357.jpg" data-zoom-image="https://tnuck.ips.photos/images/skus/P31637-PRODUCT_03--IMG_1365-1955701010.jpg">
\t</div>
</div><div class="slide slick-slide" data-slick-index="4" aria-hidden="true" tabindex="-1" role="option" aria-describedby="slick-slide04" style="width: 601px;">
\t<div class="slide__image">
\t\t<img src="https://tnuck.ips.photos/images/skus/P31637-PRODUCT_04--IMG_600-1812358633.jpg" data-zoom-image="https://tnuck.ips.photos/images/skus/P31637-PRODUCT_04--IMG_1365--489680014.jpg">
\t</div>
</div><div class="slide slick-slide" data-slick-index="5" aria-hidden="true" tabindex="-1" role="option" aria-describedby="slick-slide05" style="width: 601px;">
\t<div class="slide__image">
\t\t<img src="https://tnuck.ips.photos/images/skus/P31637-PRODUCT_05--IMG_600-251567441.jpg" data-zoom-image="https://tnuck.ips.photos/images/skus/P31637-PRODUCT_05--IMG_1365--146353341.jpg">
\t</div>
</div><div class="slide slick-slide slick-cloned slick-center" data-slick-index="6" aria-hidden="true" tabindex="-1" style="width: 601px;">
\t<div class="slide__image">
\t\t<img src="https://tnuck.ips.photos/images/skus/P31637-PRODUCT_LEAD--IMG_600--951538759.jpg" data-zoom-image="https://tnuck.ips.photos/images/skus/P31637-PRODUCT_LEAD--IMG_1365--973725436.jpg">
\t</div>
</div><div class="slide slick-slide slick-cloned" data-slick-index="7" aria-hidden="true" tabindex="-1" style="width: 601px;">
\t<div class="slide__image">
\t\t<img src="https://tnuck.ips.photos/images/skus/P31637-PRODUCT_01--IMG_600--1234110023.jpg" data-zoom-image="https://tnuck.ips.photos/images/skus/P31637-PRODUCT_01--IMG_1365-140785407.jpg">
\t</div>
</div></div></div>
由此我需要获取其中包含“PRODUCT_LEAD”字符串的src
值。为了这样做,我编写了以下代码,如果我 dd($imgs)
它返回长度为 10。但它没有返回 for
循环中的 src
值。 $pageBody
是网页的内部html。
$doc = new DOMDocument;
$doc->preserveWhiteSpace = false;
$doc->strictErrorChecking = false;
$doc->recover = true;
ini_set('user_agent', 'My-Application/2.5');
libxml_use_internal_errors(true);
$doc->loadHTML($pageBody);
$xpath = new \DOMXPath($doc);
$imgs = $xpath->query('//*[@class="slide__image"]');
foreach($imgs as $img)
$imgurl = $img->getAttribute('src');
dd($imgurl); // This returns nothing
【问题讨论】:
【参考方案1】:试试$imgs = $xpath->query('//*[@class="slide__image"]/img/@src[contains(., "PRODUCT_LEAD")]');
方括号中的部分是确定要选择哪些元素的“谓词”。 .
指的是当前节点。
【讨论】:
【参考方案2】:试试这个代码:
$imgurl = [];
for($x = 0; $x < $imgs->length; $x++)
$imgurl[] = $imgs->item($x)->getAttribute('src');
【讨论】:
【参考方案3】:$doc = new DOMDocument;
$doc->preserveWhiteSpace = false;
$doc->strictErrorChecking = false;
$doc->recover = true;
ini_set('user_agent', 'My-Application/2.5');
libxml_use_internal_errors(true);
$doc->loadHTML($pageBody);
$xpath = new \DOMXPath($doc);
$imgs = $xpath->query('//*[@class="slide__image"]/img/@src');
$imgurl=[];
foreach($imgs as $img)
if(str_contains($img->nodeValue,'PRODUCT_LEAD'))
$leadImage = $img->nodeValue;
我修改了这样的代码,而不是getAttibute()
。这很好用。但我想知道我是否可以直接从query()
获取此网址
类似//img[@src(contains())]
【讨论】:
以上是关于从 xpath 获取包含某个单词的 img src的主要内容,如果未能解决你的问题,请参考以下文章
最近在自学python girlphoto_urls = selector.xpath('//div/a/img/@src')这句是啥意思?大神
XPath 选择图像链接 - 仅当 img src 的父 href 链接存在时,否则选择 img src 链接