使用cheerio（使用puppeteer）提取链接

Posted 2023-03-23

技术标签:

【中文标题】使用cheerio（使用puppeteer）提取链接【英文标题】：Extract links using cheerio (with puppeteer) 【发布时间】：2022-01-01 02:39:21 【问题描述】：

我正在使用 puppeteer 和 Cheerio，对此我很陌生。这里是相关的html页面源代码sn-p：

<section class="descr">
  <div class="center">
    <a class="mfp-image" href="https://site.pics/store/1234/cat/img.jpg" title="Full size: 642x642" target="_blank"><img class="lazy 123" src="/assets/images/blank.gif" data-src="https://site.pics/store/1234/cat/th_img.jpg" ></a>
  </div>
  <div class="info">JPG | 500px | 1MB 22.11.2021</div>
  <hr id='more-3948099'>
  <br>
  <div class="blockSpoiler dl-links"><span class="fixHeader" id="download-links"></span><i class="sa sa-download-spoiler pl1em"></i><span class="blockTitle pl0">Get from file storage </span></div>
  <div class="blockSpoiler-content txtleft c-dl-links"><a rel="external nofollow noopener" href="https://link1.net/file/a8eaa368334d6214a03e0e648f6e55d4/ssic4Bl4nkin.html" target="_blank">HOST1</a>
    <br><a rel="external nofollow noopener" href="https://link2.file/view/EB54B4FD06B9297/ssic4Bl4nkin" target="_blank">HOST2</a>
    <br><a rel="external nofollow noopener" href="http://www.link3.com/file/3xdhcvtkfnh4/fjJ3ssic4Bl4nkin" target="_blank">HOST3</a>
    <br><a rel="external nofollow noopener" href="https://www.link4.com/riwtuwz9vjr3" target="_blank">HOST4</a>
    <br>
  </div>

我需要这些链接：

https://site.pics/store/1234/cat/img.jpg https://link1.net/file/a8eaa368334d6214a03e0e648f6e55d4/ssic4Bl4nkin.html https://link2.file/view/EB54B4FD06B9297/ssic4Bl4nkin http://www.link3.com/file/3xdhcvtkfnh4/fjJ3ssic4Bl4nkin https://www.link4.com/riwtuwz9vjr3

请注意，在某些情况下也可能存在链接5（本例中未显示）

我在 Chrome 开发者工具中使用了这段代码：

document.querySelector("div.blockSpoiler-content.txtleft.c-dl-links").innerHTML

document.querySelector("div.blockSpoiler-content.txtleft.c-dl-links").outerHTML

我可以获得很多包含所需内容的文本，以及不需要的文本。我已经尝试了几个小时以上，但无法取得更多进展。

当我使用cheerio 编写代码时，我没有得到任何有用的输出：

const html = await page.content(); const $ =cheerio.load(html); console.log($("div.blockSpoiler-content.txtleft.c-dl-links")); console.log($("div.blockSpoiler-content.txtleft.c-dl-links").innerHTML); console.log($("div.blockSpoiler-content.txtleft.c-dl-links").outerHTML);

感谢任何帮助。

【问题讨论】：

同时使用 Cheerio 和 Puppeteer 有什么意义？ Puppeteer 已经有一个实时的 HTML 解析器和选择器；它在功能上是 Cheerio 的超集。 @ggorlen 你之前提到过，有些人只是更喜欢 Cheerio，它有一些有用的功能（例如嘶嘶声伪）。您也可以在调试器中评估事物，因为它不是异步的。 @pguardiario 是的，我仍然不相信。大多数使用这两种方法的问题对我来说质量都很低，我怀疑他们使用的是调试器或嘶嘶声选择器。对我来说似乎是xy situation——举证责任似乎是“为什么是cheerio”而不是“为什么不是cheerio”。您正在重新解析整个 DOM，并创建了 Cheerio 状态与实时页面不同步的情况。您是否有任何专家参考资料可以帮助我理解（或为什么）这是“一件事”？我很乐意纠正。 @ggorlen 你这样做了一次，而不是多次 cdp 访问浏览器并返回。有时这样会更好，有时你不需要 dom 更新。但对我来说，主要是调试器。 【参考方案1】：

这应该会有所帮助。

const $ = cheerio.load(html);
var urls = $('a[href]').map(function() return $(this).attr('href') || '';).toArray();
console.log('urls', urls);

【讨论】：

【参考方案2】：

不过，在这种情况下，使用 puppeteer 会更好：

let urls = await page.$$eval('a', as => as.map(a => a.href))

【讨论】：

以上是关于使用cheerio（使用puppeteer）提取链接的主要内容，如果未能解决你的问题，请参考以下文章