爬取github项目。
Posted luobiao-114
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了爬取github项目。相关的知识,希望对你有一定的参考价值。
import requests from bs4 import BeautifulSoup url = ‘https://github.com/login‘ headers = { ‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (Khtml, like Gecko) Chrome/57.0.2987.133 Safari/537.36‘, ‘Referer‘: ‘https://github.com/‘, ‘Upgrade-Insecure-Requests‘: ‘1‘, # 此处的1 必须是字符串,不是数字 ‘Host‘: ‘github.com‘, ‘Connection‘: ‘keep-alive‘, ‘Accept-Language‘: ‘zh-CN,zh;q=0.8‘, ‘Accept-Encoding‘: ‘gzip, deflate, br‘, ‘Accept‘: ‘text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8‘} res1 = requests.get(url, headers=headers) # 检验 print(res1.status_code) print(res1.reason) # 通过解析页面来获取动态token soup = BeautifulSoup(res1.text, ‘lxml‘) tag_input = soup.find(name=‘input‘, attrs={‘name‘: ‘authenticity_token‘}) authenticity_token = tag_input.get(‘value‘) data = {‘commit‘: ‘Sign+in‘, ‘utf8‘: ‘?‘, ‘authenticity_token‘: authenticity_token, ‘login‘: ‘[email protected]‘, ‘password‘: ‘234523456345‘} cookies = res1.cookies.get_dict() # 这里的url是https://github.com/session,不是https://github.com/login res2 = requests.post(url=‘https://github.com/session‘, headers=headers, cookies=cookies, data=data) print(authenticity_token) print(res2.status_code) print(res2.reason) cookies.update(res2.cookies.get_dict()) res3 = requests.get(url=‘https://github.com/settings/repositories‘, cookies=cookies, headers=headers ) print(res3.url) print(res3.status_code) print(res3.reason) soup3 = BeautifulSoup(res3.text, ‘lxml‘) project = soup3.find(name=‘div‘, attrs={‘class‘: ‘listgroup‘}) print(project) project_list = project.find_all(name=‘a‘, attrs={‘class‘: ‘mr-1‘}) for i in project_list: project_name = i.text project_ = i.get(‘href‘) project_href = ‘https://github.com/‘ + project_.split(‘/‘, maxsplit=1)[1] print(‘项目名称:%s , 项目连接:%s‘ % (project_name, project_href), ‘ ‘) # 爬取github注意事项,1.以后携带的cookie使用的是登录后的cookie # 2.需要在登录页面找到token,该token是动态的需要使用bs4,或者正则表达式来获取动态值
以上是关于爬取github项目。的主要内容,如果未能解决你的问题,请参考以下文章
23个Python爬虫开源项目代码:爬取微信淘宝豆瓣知乎微博等