shell脚本——爬取域名一级页面元素并判断其可缓存性
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了shell脚本——爬取域名一级页面元素并判断其可缓存性相关的知识,希望对你有一定的参考价值。
来了一个域名如何判断其缓存与否,高大上的专业爬虫当然可以做分析,如果不是很严谨的分析,通过shell脚本也可以实现,来看看我这个一层页面的小爬吧,哈哈哈,先脚本执行后的结果图:
在处理的时候,会用elinks把页面上所有的元素爬出来,并做统计,用curl探测头信息,通过cache-control头来做判断是否可缓存,如果一个域名下超过70%的url可缓存,那么我就简单认为这个host是可以缓存的,虽然比较粗糙,但是做一个粗略的参考和学习应该足够。
脚本程序如下:
#/bin/sh #### 分析域名一级页面元素的可缓存的情况############### #writer:gaolixu showprint(){ tput clear echo -e `tput bold`"The web has the url host is(more than 70% item can be cache is yes):" echo -e `tput bold;tput setaf 1;`"Times\tHost\t\t\t\tIp\t\t\t\t\t\t\tyes%\t\tno%\t\tunknown%\tCache(yes/no)" tput sgr0 i=2 cat /tmp/cachetest/cache.test |awk -F‘+‘ ‘{print $1}‘ |sort|uniq -c |sort -nr | while read url do url_host=`echo $url |awk ‘{print $2}‘` url_yes=`cat /tmp/cachetest/cache.test|egrep $url_host 2>/dev/null |awk -F‘+‘ ‘{print $5}‘|egrep -c yes` url_no=`cat /tmp/cachetest/cache.test|egrep $url_host 2>/dev/null |awk -F‘+‘ ‘{print $5}‘|egrep -c no` url_unknown=`cat /tmp/cachetest/cache.test|egrep $url_host 2>/dev/null |awk -F‘+‘ ‘{print $5}‘|egrep -c unknown` ((url_sum=url_yes+url_no+url_unknown)) url_yes_b=`awk -v url_yes=$url_yes -v url_sum=$url_sum ‘BEGIN{printf "%.1f", url_yes/url_sum*100}‘` url_no_b=`awk -v url_no=$url_no -v url_sum=$url_sum ‘BEGIN{printf "%.1f", url_no/url_sum*100}‘` url_unknown_b=`awk -v url_unknown=$url_unknown -v url_sum=$url_sum ‘BEGIN{printf "%.1f", url_unknown/url_sum*100}‘` url_ip=`cat /tmp/cachetest/cache.test|egrep $url_host 2>/dev/null |head -1|awk -F‘+‘ ‘{print $3}‘` url_status=`awk -v url_yes_b=$url_yes_b ‘BEGIN{if (url_yes_b>70) print "yes";else print "no"}‘` echo -n -e "$url"|sed ‘s/ /\t/‘ tput cup $i 40 echo -n "$url_ip" tput cup $i 96 echo -n "$url_yes_b%" tput cup $i 112 echo -n "$url_no_b%" tput cup $i 128 echo "$url_unknown_b%" tput cup $i 144 echo "$url_status" ((i+=1)) done } host $1 &>/dev/null || { echo "The url is error,can‘t host!!";exit;} mkdir /tmp/cachetest &> /dev/null [ -f /tmp/cachetest/cache.test ] && rm /tmp/cachetest/cache.test num=`elinks --dump $1 |egrep $2|sed ‘/http:/s/http:/\nhttp:/‘|awk -F‘[/|,]‘ ‘/^http/{print $0}‘ |sort|wc -l` echo `tput bold`"The sum links is $num!!" echo `tput bold`"The analysis is running..." tput sgr0 elinks --dump $1 |egrep $2|sed ‘/http:/s/http:/\nhttp:/‘|awk -F‘[/|,]‘ ‘/^http/{print $0}‘ |sort| while read url do url_host=`echo $url|awk -F‘[/|,]‘ ‘/^http/{print $3}‘` url_url=`echo $url|sed "s/$url_host/\t/"|awk ‘{print $2}‘` url_ip=`host $url_host | egrep address |awk ‘{print $NF}‘|sed ‘{1h;1!H;$g;$!d;s/\n/\//g}‘` cc=`curl -s -I -m 3 $url |egrep Cache-Control|head -1|egrep Cache-Control|sed ‘{1h;1!H;$g;$!d;s/\n/ /g}‘` if [ "$cc" ];then cc_n=${#cc} ((cc_n-=1)) cc=`echo $cc|cut -b1-$cc_n` else cc="no cache-control flags or time out" fi cc_i=`echo $cc |sed ‘s/max-age=/\n/‘|awk -F"[,| ]" ‘NR==2{print $1}‘` if echo $cc|egrep no-cache &>/dev/null || echo $cc|egrep no-store &>/dev/null;then cc_status="no" elif [[ $cc_i = 0 ]];then cc_status="no" elif [[ $cc_i > 0 ]];then cc_status="yes" else cc_status="unknown" fi echo -e "$url_host+$url_url+$url_ip+$cc+$cc_status" >> /tmp/cachetest/cache.test echo -n "#" done sleep 2 (showprint)
本文出自 “奔跑的linux” 博客,请务必保留此出处http://benpaozhe.blog.51cto.com/10239098/1747543
以上是关于shell脚本——爬取域名一级页面元素并判断其可缓存性的主要内容,如果未能解决你的问题,请参考以下文章
Shell脚本根据Hash值判断web服务器页面是不是被更改