python简单爬虫编写
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了python简单爬虫编写相关的知识,希望对你有一定的参考价值。
1.主要学习这程序的编写思路
a.读取解释网站
b.找到相关页
c.找到图片链接的元素
d.保存图片到文件夹
.....
将每一个步骤都分解出来,然后用函数去实现,代码易读性高.
##代码尽快运行时会报错,还须修改
import urllib.request import os def url_open(url): #读取解释 req = urllib.request.Request(url) # req.add_header(\‘User-Agent\‘,\‘Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (Khtml, like Gecko) Chrome/36.0.1985.125 Safari/537.36\‘) response = urllib.request.urlopen(req) html = response.read() return html def get_page(url): #找到相关页 html = url_open(url) a = html.find(‘current-comment-page‘) b = html.find(a) return html[a:b] def find_imgs(url): #找到图片链接的元素 html = url_open(url) img_addrs = [] a = html.find(‘img src=‘) while a != -1: b = html.find(‘.jpg‘,a,a +255‘) if b !=-1: img_addrs.append(html[a+9:b+4]) else: b = a +9 a = html.find(‘img src=‘,b‘) return img_addrs def save_imgs(folder, img_addrs): #保存图片到文件夹 for each in img_addrs: filename = each.split(‘\‘/\‘‘) with open(filename,‘wb‘) as f: img =url_open(each) f.write(img) def download_mm(folder=‘OOXX‘,pages=10): os.mkdir(folder) os.chdir(folder) url = ‘http://jandan.net/ooxx/‘ page_num = int(get_page(url)) for i in range(pages): page_num -= i page_url = url + ‘page-‘ + str(page_num) + ‘#comments‘ img_addrs = find_imgs(page_url) save_imgs(img_addrs) if __name__ == ‘__main__‘: download_mm()
以上是关于python简单爬虫编写的主要内容,如果未能解决你的问题,请参考以下文章