Node-fetch 不提供正文中页面的所有 HTML
Posted
技术标签:
【中文标题】Node-fetch 不提供正文中页面的所有 HTML【英文标题】:Node-fetch not providing all HTML from a page in body 【发布时间】:2021-09-01 10:56:54 【问题描述】:我正在使用cheerio 和node-fetch 来获取特定URL 上的所有产品链接。
我返回了一系列链接,但列表不完整,因为正文缺少包含产品链接的 html。
fetch('https://shop.gossmanknives.com/shop?olsPage=products')
.then(res => res.text())
.then(body =>
$ = cheerio.load(body);
let snapshot = $("a, [data-ux='Link']")
.map((i, x) => $(x).attr('href'))
.toArray();
console.log(snapshot);
);
这是返回的数组:
['#', '/', '/', '/', '/shop','#', '/', '/shop', '/', '/shop', 'https://www.godaddy.com/websites/website-builder?isc=pwugc&utm_source=wsb&utm_medium=applications&utm_campaign=en-us_corp_applications_base']
这似乎很奇怪,因为有一个类似下面的元素 应该 被拾取,但 fetch() 返回的“正文”似乎缺少我在视图中看到的一堆 HTML资源。不知道为什么。也许数据是动态的,并且在 fetch() 运行时不在页面上?
<a rel="" typography="LinkAlpha" data-ux="Link" data-aid="PRODUCT_NAME_RENDERED_Orion" data-page="https://shop.gossmanknives.com/shop" data-page-query="olsPage=products/orion-aebl-black" href="https://shop.gossmanknives.com/shop?olsPage=products/orion-aebl-black" class="x-el x-el-a c2-9 c2-a c2-b c2-c c2-d c2-61 c2-f c2-3 c2-43 c2-4 c2-o c2-62 c2-63 c2-5 c2-6 c2-7 c2-8 x-d-ux x-d-aid x-d-page x-d-page-query" data-tccl="ux2.SHOP.shop1.Section.Default.Link.Default.43.click,click"><div data-ux="ProductCard" class="x-el x-el-div x-el c2-1 c2-2 c2-3 c2-4 c2-5 c2-6 c2-7 c2-8 x-d-ux c2-1 c2-2 c2-3 c2-4 c2-5 c2-6 c2-7 c2-8 x-d-ux"><div data-ux="ProductAsset" name="Orion" class="x-el x-el-div c2-1 c2-2 c2-1e c2-64 c2-65 c2-33 c2-4d c2-66 c2-2y c2-2z c2-30 c2-31 c2-3 c2-4 c2-5 c2-6 c2-7 c2-8 x-d-ux"><div id="guacBg20" role="img" data-ux="Background" data-aid="PRODUCT_IMAGE_RENDERED_Orion" treatmentdata="[object Object]" class="x-el x-el-div c2-1 c2-2 c2-67 c2-68 c2-69 c2-6a c2-1g c2-6b c2-6c c2-1t c2-1i c2-6d c2-71 c2-3 c2-4 c2-5 c2-6 c2-7 c2-8 x-d-ux x-d-aid" data-guac-image="loaded"><script>new guacImage('https://img1.wsimg.com/isteam/ip/94c95d7f-6505-4bfd-9837-ff1bcff87400/ols/IMG_0005-0002.JPG/:/rs=w:width,h:height,cg:false,m',document.getElementById('guacBg20'),"useTreatmentData":true,"backgroundLayers":["linear-gradient(to bottom, rgba(22, 22, 22, 0) 0%, rgba(22, 22, 22, 0) 100%)"])</script></div></div><div data-ux="ProductName" class="x-el x-el-div c2-1 c2-2 c2-6f c2-e c2-4j c2-g c2-3z c2-3 c2-4 c2-6g c2-5 c2-6 c2-7 c2-8 x-d-ux"><p typography="BodyAlpha" data-ux="Text" class="x-el x-el-p c2-1 c2-2 c2-c c2-d c2-4u c2-x c2-y c2-3y c2-6h c2-3 c2-6i c2-12 x-d-ux">Orion</p></div><div data-ux="ProductPrices" class="x-el x-el-div c2-1 c2-2 c2-6j c2-3y c2-3 c2-4 c2-5 c2-6 c2-7 c2-8 x-d-ux"><div typography="BodyAlpha" data-ux="Price" price="[object Object]" data-aid="PRODUCT_PRICE_RENDERED_Orion" class="x-el x-el-div c2-1 c2-2 c2-c c2-d c2-4u c2-x c2-y c2-t c2-3y c2-6k c2-3 c2-6i c2-12 x-d-ux x-d-aid">$365.00</div></div><p typography="DetailsAlpha" data-ux="ProductLabel" data-aid="PRODUCT_SHIP_FREE_RENDERED_Orion" class="x-el x-el-p c2-1 c2-1p c2-c c2-d c2-4u c2-6f c2-y c2-3y c2-28 c2-4r c2-3 c2-12 c2-29 c2-6q c2-2a c2-2b c2-2c x-d-ux x-d-aid">Free Shipping</p></div></a>
注意我使用的是https://www.npmjs.com/package/node-fetch
【问题讨论】:
【参考方案1】:您的选择器似乎有误,您正在搜索<a>
或任何具有[data-ux='Link']
属性的元素。所以你拿起了很多没有该属性的链接。要仅获取具有该属性的链接,只需传递 "a [data-ux='Link']"
那么导航到产品页面是通过URL查询。由于某种原因,cheerio 似乎从查询部分中删除了 URL。
请注意,数组中有很多 "/shop"
值,这些值可能是 "/shop?something=123..."
。尝试记录整个 <a>
元素,看看你能从那里做什么。
【讨论】:
感谢您的回复 - 原来身体缺少前 cheerio 的信息。调用在加载动态数据之前返回。用 puppeteer 解决。【参考方案2】:到正文的数据不存在,因为它是动态 HTML。
使用过木偶师 (original source here),一切正常。
getContents(url, name)
(async function main()
try
const browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.goto(url, waitUntil: 'networkidle0' );
const data = await page.evaluate(() => document.querySelector('*').outerHTML);
console.log(data);
await browser.close();
catch (err)
console.error(err);
)();
,
【讨论】:
以上是关于Node-fetch 不提供正文中页面的所有 HTML的主要内容,如果未能解决你的问题,请参考以下文章
node-fetch 从 html 网站获取 id 但我收到错误 200 undefined