python爬取知乎首页问题
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了python爬取知乎首页问题相关的知识,希望对你有一定的参考价值。
我的代码如下:
import urllib.request
import http.cookiejar
url_a="https://www.zhihu.com/"
url_a="https://www.zhihu.com/explore"
url_b="https://www.zhihu.com/signup?next=%2F"
headers="User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (Khtml, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134"
data="username":"***",
"password":"***"
data_post=urllib.parse.urlencode(data).encode("utf_8")
req=urllib.request.Request(url_b,data=data_post,headers=headers)
cookie=http.cookiejar.CookieJar()
opener=urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cookie))
print(cookie)
opener.open(req)
print(cookie)
req=urllib.request.Request(url_a,headers=headers)
resq=opener.open(req)
print(cookie)
file=open("D://Desktop//txt.txt","wb")
file.write(resq.read())
file.close()
问题描述,如果爬取的目标网址是url_a="https://www.zhihu.com/explore",那么能正常爬取发现,但是如果目标网址是url_a="https://www.zhihu.com/",就不行了,这是怎么回事
唔 可能是你没有登录成功啊
因为发现-知乎这个链接是不用登录就能抓的
但是这个知乎没有登录不行
看了下知乎登录不是这么简单的 你没有登录成功
个人中心的东西都已经获取了,所以是登陆成功的
追答嗯。。。获得了个人中心的什么。。。登录不是这个知乎么。
爬取知乎Python中文社区信息
爬取知乎Python中文社区信息,https://zhuanlan.zhihu.com/zimei
1 import requests 2 from urllib.parse import urlencode 3 from pyquery import PyQuery as pq 4 from pymongo import MongoClient 5 import json 6 import time 7 8 base_url = ‘https://www.zhihu.com/api/v4/columns/zimei/articles?limit=10&‘ 9 headers = { 10 ‘authority‘: ‘www.zhihu.com‘, 11 ‘referer‘: ‘https://zhuanlan.zhihu.com/zimei‘, 12 ‘origin‘: ‘https://zhuanlan.zhihu.com‘, 13 ‘User-Agent‘: ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36‘, 14 } 15 16 client = MongoClient() 17 db = client[‘zhihu‘] 18 collection = db[‘zhihu‘] 19 max_page = 100 20 21 22 def get_page(page): 23 params = { 24 ‘offset‘: page*10 25 } 26 url = base_url + urlencode(params) 27 try: 28 response = requests.get(url, headers=headers) 29 if response.status_code == 200: 30 31 return response.json() 32 except requests.ConnectionError as e: 33 print(‘Error‘, e.args) 34 35 36 def parse_page(json_1): 37 if json_1: 38 items = json_1.get(‘data‘) 39 for item in items: 40 if page == 1 : 41 continue 42 else: 43 44 zhihu = {} 45 zhihu[‘name‘] = item.get(‘author‘).get(‘name‘) 46 zhihu[‘title‘] = item.get(‘title‘) 47 zhihu[‘text‘] = pq(item.get(‘excerpt‘)).text() 48 zhihu[‘comments‘] = item.get(‘comment_count‘) 49 zhihu[‘reposts‘] = item.get(‘voteup_count‘) 50 zhihu[‘data‘] = time.strftime(‘%Y-%m-%d %H%:%M‘,time.localtime(item.get(‘updated‘))) 51 yield zhihu 52 53 def write_to_file(content): 54 with open(‘zhihu.json‘,‘a‘,encoding=‘utf-8‘) as f: 55 f.write(json.dumps(content,ensure_ascii=False)+‘ ‘) 56 f.close() 57 58 def save_to_mongo(result): 59 if collection.insert(result): 60 print(‘Saved to Mongo‘) 61 62 63 if __name__ == ‘__main__‘: 64 for page in range(1, max_page + 1): 65 json_1 = get_page(page) 66 67 results = parse_page(json_1) 68 for result in results: 69 print(result) 70 write_to_file(result) 71 save_to_mongo(result)
以上是关于python爬取知乎首页问题的主要内容,如果未能解决你的问题,请参考以下文章