制作简单文本爬虫
Posted 喵吉欧尼酱
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了制作简单文本爬虫相关的知识,希望对你有一定的参考价值。
爬去内容: 图片地址,下载文档中的图片到指定文件夹中
实现原理:
1、保存网页源代码
2、Python 读文件加载源代码
3、正则表达式提取图片地址
4、下载图片
源文件
# _*_ conding:utf-8 _*_ <!DOCTYPE html> <html lang=\'en\' xmlns=\'http://www.w3.org/1999/xhtml\'> <li class="uptodate"> <a href="/zhiye/course/140.html?type=49" target="_blank"> <img class="uptodate-img" src="https://jiuye-res.jikexueyuan.com/zhiye/showcase/attach-/20170929/72683326-3cda-45bc-8cd7-7969b93fcfcf.jpg" alt=""> <p class="uptodate-title">����ѧϰ��ʯ��ƪ��һ��</p> <p class="uptodate-info"> ְҵ <span>|</span>1�ſ� </p> </a> </li> <li class="uptodate"> <a href="/zhiye/course/143.html?type=38" target="_blank"> <img class="uptodate-img" src="https://jiuye-res.jikexueyuan.com/zhiye/showcase/attach-/20171101/b12ae422-fd63-4b7d-a0d3-13c3ab4479c5.jpg" alt=""> <p class="uptodate-title">��ʵս��Python����Ϣ�м��</p> <p class="uptodate-info"> �м� <span>|</span>4�ſ� </p> </a> </li> <li class="uptodate"> <a href="/zhiye/course/135.html?type=50" target="_blank"> <img class="uptodate-img" src="https://jiuye-res.jikexueyuan.com/zhiye/showcase/attach-/20170928/8cc3edeb-0115-43ea-a46f-db6c6e9255ca.jpg" alt=""> <p class="uptodate-title">Keras�������ʵս</p> <p class="uptodate-info"> ���� <span>|</span>4�ſ� </p> </a> </li> <li class="uptodate"> <a href="/zhiye/course/132.html?type=42" target="_blank"> <img class="uptodate-img" src="https://jiuye-res.jikexueyuan.com/zhiye/showcase/attach-/20170927/3b4b951a-1b27-494f-9ec8-01e7ce7b7de8.jpg" alt=""> <p class="uptodate-title">MVC��MVVM������</p> <p class="uptodate-info"> �м� <span>|</span>4�ſ� <span>|</span>4��ʵս��Ŀ </p> </a> </li> </ul> </div> <div class="zhiye"> <h2>ְҵ����</h2> <a href="/zhiye/" target="_blank">����ְҵ���� >></a> <ul>
源代码
import re,requests #读取文件 with open(\'ts.txt\',\'r\',encoding=\'utf-8\') as f: html=f.read() f.close() # 匹配图片地址 pic_url=re.findall(r\'<img class="uptodate-img" src="(.*?)"\',html,re.S) i=0 for each in pic_url: print(\'现在开始下载第%s张\'%i) pic=requests.get(each) fp=open(\'picture\\\\\'+str(i)+\'.png\',\'wb\') fp.write(pic.content) fp.close() i+=1
以上是关于制作简单文本爬虫的主要内容,如果未能解决你的问题,请参考以下文章
零基础简单爬虫制作(以wjyt-china企业黄页爬虫为例)(上)