Anaconda3里边自带了bs4的包,省的我自己安装了。
最近觉得模块化的写法可以让代码变得清晰易读。而且随着代码的增多,找bug也会更方便。(目前我还写不出这么多)而且模块化有种工具化的思想,拿来主义的思想在里面,使用工具可是人等少数智慧动物的专利啊。之后也要多学习使用[try - except]的写法,可以直观的看出错误。
初学网页爬虫,目前只会爬取豆瓣这样清晰好看的静态网页,对于复杂的js控制的动态网页,我现在还束手无策。
1 # -*- coding: utf-8 -*- 2 """ 3 Created on Tue Jan 2 17:44:30 2018 4 5 @author: xglc 6 找到豆瓣图书的【新书速递】内容 7 """ 8 import requests 9 from bs4 import BeautifulSoup 10 11 def _gethtml(): 12 try: 13 req = requests.get(‘https://book.douban.com/‘) 14 data1 = [] 15 data1.append(req.text) 16 except Exception as e: 17 raise e 18 return data1 19 20 def _getdata(html): 21 title = [] 22 author = [] 23 data2 = {} 24 soup = BeautifulSoup(html,‘html.parser‘) 25 for li in soup.find(‘ul‘,attrs={‘class‘:‘list-col list-col5 list-express slide-item‘}).find_all("li"): 26 title.append(li.find(‘div‘,class_=‘info‘).find(‘div‘,class_=‘title‘).text) 27 author.append(li.find(‘div‘,class_=‘info‘).find(‘div‘,class_=‘author‘).text) 28 data2[‘title‘] = title 29 data2[‘author‘] = author 30 # print (data2) 31 return data2 32 33 def _txt(data3): 34 with open(‘f://book.txt‘,‘w‘) as f: 35 for title in data[‘title‘]: 36 f.write(title) 37 f.close 38 39 if __name__ == ‘__main__‘: 40 htmls = _gethtml() 41 data = _getdata(htmls[0]) 42 _txt(data) 43 # print (data[‘title‘])