刮动态加载的网站

Posted 2021-03-28

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了刮动态加载的网站相关的知识，希望对你有一定的参考价值。

当我使用cURL加载页面“http://proxydb.net”，或尝试刮取页面时，响应正文为空。显然，该页面是使用javascript动态加载的。

仍然可以加载渲染的源代码的选项是什么？

我尝试使用Firefox驱动程序使用Selenium，但这会在15秒内将CPU使用率提高到100％。我想这不是一个可行的选择，特别是对于涉及使用Selenium抓取100,000多页的大型项目。

此外，要了解动态加载页面的概念。这些如何运作？需要什么代码才能使它们工作？

答案

When I load the page "http://proxydb.net" using cURL, or try to scrape the page, then the response body is empty - 因为此特定网站使用用户代理白名单，如果您的用户代理不在白名单中，您只需获得一个空白页面。据推测，所有主要的网络浏览器都列入白名单（Chrome，Internet Explorer，Edge，Safari，Opera等），但这里是一个列入白名单的特定用户代理：

Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (Khtml, like Gecko) Chrome/65.0.3325.181 Safari/537.36

（在Windows 7 x64上运行的Chrome 65的用户代理），因此，这有效：

curl 'http://proxydb.net/' -H 'User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'

关于如何动态加载内容，通常使用XMLHttpRequests或旧代码iframes。

Apparently, the page is dynamically loaded using JavaScript. - 错了，这些人没有动态加载代理列表，他们直接嵌入到首页（只要你使用白名单用户代理），模糊不清

var q = '42.86.831'.split('').reverse().join('');
var yy = /* */ atob('x4dx43x34x79x4dx54x67x3d'.replace(/\x([0-9A-Fa-f]{2})/g, function() {
    return String.fromCharCode(parseInt(arguments[1], 16))
}));
var pp = (3109 - ([] + [])) /**/ + (+document.querySelector('[data-numr]').getAttribute('data-numr')) - [] + [];
document.write('<a href="/' + q + yy + '/' + pp + '#http">' + q + yy + String.fromCharCode(58) + pp + '</a>');

（在这种情况下，它与data-numr div一起转换为138.68.240.218:3128 - 它实际上是加密的，而解密密钥在div中看起来像<div style="display:none" data-numr="19"></div>，这里的密钥是19.）

以上是关于刮动态加载的网站的主要内容，如果未能解决你的问题，请参考以下文章