下载网页的工作本地副本[关闭]

Posted

技术标签:

【中文标题】下载网页的工作本地副本[关闭]【英文标题】:Download a working local copy of a webpage [closed] 【发布时间】:2011-09-14 22:51:22 【问题描述】:

我想下载一个网页的本地副本并获取所有的 css、图像、javascript 等。

在之前的讨论中(比如here和here,都是两年多),一般会提出两个建议:wget -p和httrack。然而,这些建议都失败了。我非常感谢使用这些工具中的任何一个来完成任务的帮助;替代品也很可爱。

选项 1:wget -p

wget -p 成功下载了网页的所有先决条件(css、图像、js)。但是,当我在 Web 浏览器中加载本地副本时,该页面无法加载先决条件,因为这些先决条件的路径尚未从 Web 版本中修改。

例如:

在页面的 html 中,<link rel="stylesheet href="/stylesheets/foo.css" /> 需要更正以指向 foo.css 的新相对路径 在 css 文件中,background-image: url(/images/bar.png) 同样需要调整。

有没有办法修改wget -p 使路径正确?

选项 2:httrack

httrack 似乎是镜像整个网站的好工具,但我不清楚如何使用它来创建单个页面的本地副本。 httrack 论坛上有很多关于这个话题的讨论(例如here),但似乎没有人有一个万无一失的解决方案。

选项 3:其他工具?

有些人建议使用付费工具,但我简直不敢相信那里没有免费的解决方案。

【问题讨论】:

如果答案不起作用,请尝试:wget -E -H -k -K -p http://example.com - 只有这对我有用。信用:superuser.com/a/136335/94039 还有软件可以做到这一点,Teleport Pro。 wget --random-wait -r -p -e robots=off -U mozilla http://www.example.com download webpage and dependencies, including css images 的可能重复项。 这个问题的结束方式,迄今为止有 203K 的浏览量,对其他提议和链接的解决方案有明确的增量要求。 【参考方案1】:

wget 能够满足您的要求。只需尝试以下方法:

wget -p -k http://www.example.com/

-p 将为您提供正确查看网站所需的所有元素(css、图像等)。 -k 将更改所有链接(包括那些用于 CSS 和图像的链接),以允许您离线查看页面,就像它在线显示一样。

来自 Wget 文档:

‘-k’
‘--convert-links’
After the download is complete, convert the links in the document to make them
suitable for local viewing. This affects not only the visible hyperlinks, but
any part of the document that links to external content, such as embedded images,
links to style sheets, hyperlinks to non-html content, etc.

Each link will be changed in one of the two ways:

    The links to files that have been downloaded by Wget will be changed to refer
    to the file they point to as a relative link.

    Example: if the downloaded file /foo/doc.html links to /bar/img.gif, also
    downloaded, then the link in doc.html will be modified to point to
    ‘../bar/img.gif’. This kind of transformation works reliably for arbitrary
    combinations of directories.

    The links to files that have not been downloaded by Wget will be changed to
    include host name and absolute path of the location they point to.

    Example: if the downloaded file /foo/doc.html links to /bar/img.gif (or to
    ../bar/img.gif), then the link in doc.html will be modified to point to
    http://hostname/bar/img.gif. 

Because of this, local browsing works reliably: if a linked file was downloaded,
the link will refer to its local name; if it was not downloaded, the link will
refer to its full Internet address rather than presenting a broken link. The fact
that the former links are converted to relative links ensures that you can move
the downloaded hierarchy to another directory.

Note that only at the end of the download can Wget know which links have been
downloaded. Because of that, the work done by ‘-k’ will be performed at the end
of all the downloads. 

【讨论】:

我试过了,但不知何故,index.html#link-to-element-on-same-page 等内部链接停止工作。 整个网站:snipplr.com/view/23838/downloading-an-entire-web-site-with-wget 如果你在没有用户代理的情况下使用wget,一些服务器会响应403代码,你可以添加-U 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.6) Gecko/20070802 SeaMonkey/1.1.4' 如果您发现您仍然缺少图像等。然后尝试添加以下内容:-e robots=off ..... wget 实际上读取并尊重 robots.txt - 这真的成功了我很难弄清楚为什么没有任何效果! 从国外主机获取资源使用-H, --span-hosts

以上是关于下载网页的工作本地副本[关闭]的主要内容,如果未能解决你的问题,请参考以下文章

下载网页和依赖项,包括 CSS 图像 [关闭]

Python 开发轻量级爬虫05

本地下载后网页中断

java读取网页图片路径并下载到本地

如何将一个网页以及包含的文件全部整站下载到本地电脑里?

python爬虫 将在线html网页中的图片链接替换成本地链接并将html文件下载到本地