实战用request爬取拉勾网职位信息

Posted wangyue0925

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了实战用request爬取拉勾网职位信息相关的知识,希望对你有一定的参考价值。

from urllib import request
import urllib
import ssl
import json

url = ‘https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false‘
headers = 
    ‘User-Agent‘: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (Khtml, like Gecko) "
                  "Chrome/75.0.3770.100 Safari/537.36",
    ‘Referer‘: "https://www.lagou.com/jobs/list_Python?labelWords=&fromSearch=true&suginput=",
    ‘Origin‘: "https://www.lagou.com",
    ‘Accept‘: "application/json, text/javascript, */*; q=0.01",
    ‘Content-Type‘: "application/x-www-form-urlencoded; charset=UTF-8",
    ‘Accept-Language‘: "zh-CN,zh;q=0.9",
    ‘Connection‘: "keep-alive",
    ‘Content-Length‘: "25",
    ‘Cookie‘:"JSESSIONID=ABAAABAAAIAACBI7B0E6DD979133FD3E0688BD2A172D462; user_trace_token=20190625152253-372d4fd2-d2d9-4a1e-b1db-adbaf15de59b; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1561447375; _ga=GA1.2.502816238.1561447375; LGSID=20190625152254-0c9bc1d7-971a-11e9-a4bc-5254005c3644; LGUID=20190625152254-0c9bc483-971a-11e9-a4bc-5254005c3644; _gid=GA1.2.1461701224.1561447375; index_location_city=%E5%85%A8%E5%9B%BD; TG-TRACK-CODE=index_search; X_HTTP_TOKEN=d0da23584e25293624994416516081f1b40cdf8579; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1561449942; LGRID=20190625160542-0718c5c5-9720-11e9-a4bc-5254005c3644; SEARCH_ID=af21aa4087114adf8c011b4f809dc9bd",

data = 
    ‘first‘: ‘true‘,
    ‘pn‘: 1,
    ‘kd‘: ‘Python‘

new_data = urllib.parse.urlencode(data)
req = request.Request(url, headers=headers, data=new_data.encode(‘utf-8‘), method=‘POST‘)
context = ssl._create_unverified_context()
res = request.urlopen(req, context=context, timeout=60)
res_json = json.loads(res.read())
print(res_json)
print(res_json[‘content‘][‘positionResult‘][‘result‘])
with open(‘/Users/mac/PycharmProjects/TEST/TEST/爬虫day/file/lago.txt‘, ‘w‘) as f:
    f.write(res_json)

# 出现请求太频繁的解决  伪造浏览器 完善请求头

 

避免请求太频繁 方法

import requests
import time
import json

def main():
    url_start = "https://www.lagou.com/jobs/list_python?city=%E6%88%90%E9%83%BD&cl=false&fromSearch=true&labelWords=&suginput="
    url_parse = "https://www.lagou.com/jobs/positionAjax.json?city=天津&needAddtionalResult=false"
    headers = 
       ‘Accept‘: ‘application/json, text/javascript, */*; q=0.01‘,
       ‘Referer‘: "https://www.lagou.com/jobs/list_Python?labelWords=&fromSearch=true&suginput=",
       ‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36‘     
    for x in range(1, 5):
        data = 
             ‘first‘: ‘true‘,
             ‘pn‘: str(x),
             ‘kd‘: ‘Python‘
                
        s = requests.Session()  # 创建一个session对象
        s.get(url_start, headers=headers, timeout=3)  # 用session对象发出get请求,请求首页获取cookies
        cookie = s.cookies  # 为此次获取的cookies
        response = s.post(url_parse, data=data, headers=headers, cookies=cookie, timeout=3)  # 获取此次文本
        time.sleep(5)
        response.encoding = response.apparent_encoding
        text = json.loads(response.text)
        info = text["content"]["positionResult"]["result"]
        print(info)


if __name__ == ‘__main__‘:
    main()

 

以上是关于实战用request爬取拉勾网职位信息的主要内容,如果未能解决你的问题,请参考以下文章

通俗易懂的分析如何用Python实现一只小爬虫,爬取拉勾网的职位信息

用Python爬取拉勾网数据分析职位及可视化

爬取拉勾网

python3 爬取拉勾网1

PythonJava 薪资最高,C# 垫底:分析什么编程语言最赚钱!

使用selenium动态渲染爬取拉勾网上450个java职位的信息