Puppeteer，保存网页和图像

Posted 2023-03-07

技术标签:

【中文标题】Puppeteer，保存网页和图像【英文标题】：Puppeteer, save webpage and images 【发布时间】：2019-05-07 12:01:13 【问题描述】：

我正在尝试保存网页，以供 Nodejs 和 puppeteer 离线使用。我看到很多例子：

await page.screenshot(path: 'example.png');

但是对于更大的网页，这不是一个选择。所以在 puppeteer 中一个更好的选择是加载页面然后保存如下：

const html = await page.content();
// ... write to file

好的，这行得通。现在我要像推特一样滚动页面。所以我决定屏蔽 puppeteer 页面中的所有图片：

page.on('request', request => 
    if (request.resourceType() === 'image') 
        const imgUrl = request.url()
        download(imgUrl, 'download').then((output) => 
            images.push(url: output.url, filename: output.filename)
        ).catch((err) => 
            console.log(err)
        )
        request.abort()
     else 
        request.continue()
    
)

好的，我现在使用“npm 下载”库来下载所有图像。是的，下载图片没问题：D。

现在当我保存内容时，我想将它指向源中的离线图像。

const html = await page.content();

但现在我喜欢替换所有的

<img src="/pic.png?id=123"> 
<img src="https://twitter.com/pics/1.png">

还有类似的东西：

<div style="background-image: url('this_also.gif')></div>

那么有没有办法（在 puppeteer 中）抓取一个大页面并离线存储整个内容？

javascript 和 CSS 也不错

更新

现在我将用 puppeteer 再次打开大 html 文件。

然后将所有文件截取为： https://dom.com/img/img.jpg, /file.jpg, ....

request.respond(
    status: 200,
    contentType: 'image/jpeg',
    body: '..'
);

我也可以使用 chrome 扩展来做到这一点。但是我喜欢有一些选项page.html()的函数，和page.pdf()一样

【问题讨论】：

我认为网页太动态了，不能做这样的事情......（取决于你想在上面花费多少时间）你的最终目标是什么，只是查看它？跨度> 请问如何操作html？如果是这样，您将使用 node 中的cheerio 或 page.evaluate 中的 jQuery。问题是如何指向本地下载。当你有 css、javascript 图像时。 @Cody，目标是拯救大型网站（如 Twitter、Facebook 等）。离线使用 【参考方案1】：

我们回到第一个，你可以用fullPage截图。

await page.screenshot(path: 'example.png', fullPage: true);

如果你真的想将所有资源下载到离线，是的，你可以：

const fse = require('fs-extra');

page.on('response', (res) => 
    // save all the data to SOMEWHERE_TO_STORE
    await fse.outputFile(SOMEWHERE_TO_STORE, await res.buffer());
);

然后，你就可以通过 puppeteer 离线浏览网站了。

await page.setRequestInterception(true);
page.on('request', (req) => 
    // handle the request by responding data that you stored in SOMEWHERE_TO_STORE
    // and of course, don't forget THE_FILE_TYPE
    req.respond(
        status: 200,
        contentType: THE_FILE_TYPE,
        body: await fse.readFile(SOMEWHERE_TO_STORE),
    );
);

【讨论】：

最好依赖requestfinished事件。【参考方案2】：

现在我将使用：

https://github.com/dosyago/22120

这个项目的目标：

This project literally makes your web browsing available COMPLETELY OFFLINE. 
Your browser does not even know the difference. It's literally that amazing. Yes.

【讨论】：

很好，但任何 MITM 代理都可以做到这一点。

以上是关于Puppeteer，保存网页和图像的主要内容，如果未能解决你的问题，请参考以下文章