nodejs爬虫设置动态userAgent

Posted 浅唱年华1920

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了nodejs爬虫设置动态userAgent相关的知识,希望对你有一定的参考价值。

动态 userAgent

这是我收集到的常用的浏览器头部信息,每次爬取的时候从中随机选取一个,并使用 superAgent 设置请求头部的 User-Agent 字段就好了。

userAgent.js

const userAgents = [
  Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12,
  Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506),
  Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (Khtml, like Gecko) Chrome/17.0.963.56 Safari/535.11,
  Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20,
  Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6,
  Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER,
  Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0) ,Lynx/2.8.5rel.1 libwww-FM/2.14 SSL-MM/1.4.1 GNUTLS/1.2.9,
  Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727),
  Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400),
  Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E),
  Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0b13pre) Gecko/20110307 Firefox/4.0b13pre,
  Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52,
  Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12,
  Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER),
  Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6,
  Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6,
  Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400),
  Opera/9.25 (Windows NT 5.1; U; en), Lynx/2.8.5rel.1 libwww-FM/2.14 SSL-MM/1.4.1 GNUTLS/1.2.9,
  Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36
]

export default userAgents;

app.js

import request from superagent;
import userAgents from ../src/userAgent;

async function doRequest() {
  let userAgent = userAgents[parseInt(Math.random() * userAgents.length)];
  request
    .get(http://www.xxx.com)
    .set({ User-Agent: userAgent })
    .timeout({ response: 5000, deadline: 60000 })
    .end(async (err, res) => {
      // 处理数据
    });
}

 

 

原文:https://andyliwr.github.io/2017/12/05/nodejs_spider_ip/

 

以上是关于nodejs爬虫设置动态userAgent的主要内容,如果未能解决你的问题,请参考以下文章

使用 selenium-webdriver/firefox (NodeJS) 设置 userAgent

scrapy按顺序启动多个爬虫代码片段(python3)

scrapy---反爬虫

javascript 用于在节点#nodejs #javascript内设置react app的代码片段

perl爬虫01-花瓣网相册

为啥都说爬虫PYTHON好