Python实战男生梦寐以求且随时会被和谐的妹子图网站爬取
Posted 日常分享Python
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Python实战男生梦寐以求且随时会被和谐的妹子图网站爬取相关的知识,希望对你有一定的参考价值。
图片爬虫,爬整个 网站的图片
工具环境:
- 谷歌
- python
- pycharm
依赖库:
requests 发送http请求,下载图片,lxml 解析html文件
- grequests 基于gevent的异步http请求库,加快爬取速度源文件
- get_image.py 每次发送一个请求
- get_image_gevent.py 每次发送五个请求
注:可以在get_images函数中修改图片存放目录
全部代码:
# -*- coding: utf-8 -*-
# 使用grequests 重写,提高爬图速度
import os
import requests
import grequests
import time
from lxml import html
def get_response(url):
headers = {
"headers" : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36"
}
response = requests.get(url, headers = headers)
return response
# 获取每个页面的url
def get_page_urls():
start_url = 'http://girl-atlas.com/'
response = get_response(start_url)
page_urls = []
page_urls.append(start_url)
while True:
parsed_body = html.fromstring(response.text)
next_url = parsed_body.xpath('//a[@class="btn-form next"]/@href')
if not next_url:
break
next_url = start_url + next_url[0]
page_urls.append(next_url)
response = get_response(next_url)
print "get_page_urls done!!!"
return page_urls
# 获取每个girl专辑的url
def get_girl_urls(page_urls):
girl_urls = []
# 采用grequests,建立5个并发连接
rs = (grequests.get(url) for url in page_urls)
responses = grequests.map(rs, size = 5)
for response in responses:
parsed_body = html.fromstring(response.text)
girl = parsed_body.xpath('//div[@class="grid_title"]/a/@href')
girl_urls.extend(girl)
return girl_urls
def get_image_urls(girl_urls):
girl_list = []
# 建立5个并发连接
rs = (grequests.get(url) for url in girl_urls)
responses = grequests.map(rs, size = 5)
for response in responses:
parsed_body = html.fromstring(response.text)
girl_title = parsed_body.xpath('//title/text()')
image_urls = parsed_body.xpath('//li[@class="slide "]/img/@src | //li[@class="slide "]/img/@delay')
# print image_urls
girl_dict = {girl_title[0] : image_urls}
girl_list.append(girl_dict)
print "get_girl_urls done!!!"
return girl_list
def get_images(girl_list):
count = 1
# 图片的默认存储目录
start_dir = '/home/pein/Pictures/'
for girl in girl_list:
dir_name = start_dir + girl.keys()[0]
urls = girl.values()[0]
if not os.path.exists(dir_name):
os.makedirs(dir_name)
rs = (grequests.get(url) for url in urls)
responses = grequests.map(rs)
image_dict = dict(zip(urls, responses))
for url in image_dict:
print url
with open(dir_name + '/' + url.split('/')[-1], 'wb') as f:
r = image_dict[url]
f.write(r.content)
print
print count, girl.keys()[0] + " done!!!"
count += 1
print
if __name__ == '__main__':
page_urls = get_page_urls()
start_time = time.time()
girl_urls = get_girl_urls(page_urls)
girl_list = get_image_urls(girl_urls)
print "girl %s" % len(girl_urls)
get_images(girl_list)
elapsed_time = time.time() - start_time
print
print "elasped %s seconds!!!!" % elapsed_time
注意:网址可能被和谐随时会被更改
总结:
以上这些这不是一份详尽的清单,如果大家有使用其他软件包,可以在评论区一起分享交流哦!
下面是我整理的一些学习资料
Python学习路线
Python自学视频
300本电子书加学霸笔记
100个游戏源码、项目案例
Python安装包以及教程和激活码
君羊号:【881744585】获取
广告勿加【否则你做什么就亏什么,永远赚不到钱】
最后请给小编一个一键三连好嘛,在这里希望大家技术能力能越来越好收入越来越多
这都是我总结出来的宝贝,给需要的小伙伴,希望大家的努力都不负所望。
以上是关于Python实战男生梦寐以求且随时会被和谐的妹子图网站爬取的主要内容,如果未能解决你的问题,请参考以下文章