多线程爬取百度百科
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了多线程爬取百度百科相关的知识,希望对你有一定的参考价值。
- 前言:
EVERNOTE里的一篇笔记,我用了三个博客才学完...真的很菜...百度百科和故事网并没有太过不一样,修改下编码,debug下,就可以爬下来了,不过应该是我爬的东西太初级了,而且我爬到3000多条链接时,好像被拒绝了...爬取速度也很慢,估计之后要接触一些优化或者多进程,毕竟python是假的多线程。
本博客参照代码及PROJECT来源:http://kexue.fm/archives/4385/
- 源代码:
1 #! -*- coding:utf-8 -*- 2 import requests as rq 3 import re 4 import time 5 import datetime 6 from multiprocessing.dummy import Pool,Queue 7 import pymysql 8 from urllib import parse 9 import html 10 import importlib 11 from urllib.request import urlopen 12 from bs4 import BeautifulSoup 13 unescape = html.unescape #用来实现对HTML字符的转移 14 15 tasks = Queue() 16 tasks_pass = set() #已队列过的链接 17 tasks.put(‘http://baike.baidu.com/item/科学‘) 18 count = 0 #已爬取页面总数 19 20 url_split_re = re.compile(‘&|\+‘) 21 def clean_url(url): 22 url = parse.urlparse(url) 23 return url_split_re.split(parse.urlunparse((url.scheme, url.netloc, url.path, ‘‘, ‘‘, ‘‘)))[0] 24 25 def main(): 26 global count,tasks_pass 27 while True: 28 url = tasks.get() #取出一个url,并且在队列中删除掉 29 web = rq.get(url).content.decode(‘utf8‘,‘ignore‘) 30 urls = re.findall(u‘href="(/item/.*?)"‘, web) #查找所有站内链接 31 for u in urls: 32 try: 33 u = rq.get(u).content.decode(‘utf8‘) 34 except: 35 pass 36 u = ‘http://baike.baidu.com‘ + u 37 u = clean_url(u) 38 if (u not in tasks_pass): #把还没有队列过的链接加入队列 39 tasks.put(u) 40 tasks_pass.add(u) 41 web1 = rq.get(u).content.decode(‘utf8‘, ‘ignore‘) 42 bsObj = BeautifulSoup(web1, "lxml") 43 text = bsObj.title.get_text() 44 print(datetime.datetime.now(), ‘ ‘, u, ‘ ‘, text) 45 db = pymysql.connect("localhost", "testuser", "test123", "TESTDB", charset=‘utf8‘) 46 dbc = db.cursor() 47 sql = "insert ignore into baidubaike(url,title) values(%s,%s);" 48 data = (u, text) 49 dbc.execute(sql, data) 50 dbc.close() 51 db.commit() 52 count += 1 53 if count % 100 == 0: 54 print(u‘%s done.‘ % count) 55 56 pool = Pool(4, main) #多线程爬取,4是线程数 57 total = 0 58 while True: #这部分代码的意思是如果20秒内没有动静,那就结束脚本 59 time.sleep(60) 60 if len(tasks_pass) > total: 61 total = len(tasks_pass) 62 else: 63 break 64 65 pool.terminate() 66 print("terminated normally")
- BUG:
raise RemoteDisconnected("Remote end closed connection without" http.client.RemoteDisconnected: Remote end closed connection without response
问题在于没有伪装请求头
来源:http://blog.csdn.net/u013424864/article/details/60778031
以上是关于多线程爬取百度百科的主要内容,如果未能解决你的问题,请参考以下文章