扒虫程序

Posted 2020-10-23 有挫败才有成长

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了扒虫程序相关的知识，希望对你有一定的参考价值。

一、使用字符串查找的方法find扒取教师姓名

 1 import urllib2
 2 response=urllib2.urlopen("http://www.whit.edu.cn/list/650")
 3 html=response.read()
 4 # print html[:1000]
 5 result=[]
 6 temp=html
 7 str_begin=\'<div class="tea_title">\' #每个教师的姓名都在class为tea_title的div下的a标签中
 8 while 1:
 9     pos_begin=temp.find(str_begin)
10     if pos_begin==-1:
11         break
12     len_begin=len(str_begin)
13     temp=temp[(pos_begin+len_begin):]#从上一处结束的位置开始下一次查找
14     
15     str_begin2=\'target="_blank">\'#a的开标签尾部字符串
16     str_end2=\'</a>\' #a的闭标签
17     len_begin2=len(str_begin2)
18     pos_begin2=temp.find(str_begin2) #a的开标签尾部字符串的位置
19     pos_end2=temp.find(str_end2) #a的闭标签的位置
20     result.append(temp[pos_begin2+len_begin2:pos_end2]) #从被a包住的就是姓名
21     
22 i=1   
23 for x in result:
24     print "%d.%s"%(i,x)
25     i+=1

二、使用正则表达式的方法扒取教师姓名

1 import urllib2
2 import re
3 response=urllib2.urlopen("http://www.whit.edu.cn/list/650")
4 html=response.read()
5 name_list=re.findall(r\'<div class="tea_title">.*target="_blank">(.+)</a>\',html)
6 i=1
7 for name in name_list:
8     print "%d.%s"%(i,name)
9     i+=1

三、使用beautifusoup扒取学院名称

from bs4 import BeautifulSoup
import urllib
obj_html=urllib.urlopen("http://zs.whit.edu.cn/list/297")
# print type(obj_html)
str_html=obj_html.read()
# print type(str_html)
soup=BeautifulSoup(str_html,\'lxml\')
# print type(soup)
# print soup.prettify()
tags_a=soup.find_all("a",class_="lf") #或者tags=soup.select("a.lf")#可根据css选择器来查找元素
# print type(tags_a)
i=1
base_url="http://zs.whit.edu.cn"
for tag_a in tags_a:
    print "%d.%s %s"%(i,tag_a.text,base_url+tag_a.attrs["href"])
    i+=1

四、使用lxml+xpath扒取京东的所有笔记本信息

 1 # coding:UTF-8
 2 from lxml import etree #须添加lxml库
 3 import urllib2
 4 url1="https://list.jd.com/list.html?cat=670,671,672&page="
 5 url2="&sort=sort_totalsales15_desc&trans=1&JL=6_0_0#J_main"
 6 headers={"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3298.4 Safari/537.36"}
 7 f=open("notebook.txt",\'w\')
 8 #一共有1101页
 9 for page in range(1,1102):
10     url=url1+str(page)+url2
11     request=urllib2.Request(url=url,headers=headers)
12     response=urllib2.urlopen(request)
13     html=response.read()
14     selector=etree.HTML(html) #得到html元素,是etree.Element类的对象,该类对象有xpath方法进行查询
15     list_product=selector.xpath(\'//div[@id="plist"]//li[@class="gl-item"]//div[@class="p-name"]//a//em/text()\')
16     #因为xpath最终查到的是text的，所以list_product是一个字符串的列表
17     for p in list_product:
18         f.write(p.strip().encode("utf-8")) #因为含有汉字，所以采用utf-8编码转换
19         f.write("\\n")
20     f.write("---------------------------------------"+str(page)+"--------------------------------------------\\n")
21 f.close()

五、使用webdriver扒取京东笔记本电脑的价格

 1 import time
 2 from selenium import webdriver
 3 driver=webdriver.Chrome()
 4 driver.maximize_window()
 5 url="https://list.jd.com/list.html?cat=670,671,672"
 6 driver.get(url)
 7 # print driver.page_source
 8 driver.execute_script(\'window.scrollBy(0,document.body.scrollHeight);\')
 9 time.sleep(2)
10 print driver.page_source
11 elememts=driver.find_elements_by_css_selector("div#plist li.gl-item div.p-price i")
12 print type(elememts)

六、scrapy

1、创建项目

　　$>scrapy startproject <项目名>　　#在当前目录下创建

2、创建爬虫格式

　　$>scrapy genspider <文件名> <url>

　　#url指定爬取的范围

3、运行爬虫

　　$>scrapy crawl <文件名>

以上是关于扒虫程序的主要内容，如果未能解决你的问题，请参考以下文章