爬虫之初试

Posted I can do this all day

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了爬虫之初试相关的知识,希望对你有一定的参考价值。

  MD,这个星期简直了~~有时候真是一步错步步错啊,各种套路,好了,废话不多说,因为急求数据,于是按照同事的方法,自己写了一份爬虫,算是初试吧。

  1 <?php
  2 header("Content-type: text/html; charset=utf-8");
  3 include_once "Connection.php";
  4 include_once "autoload.php";
  5 
  6 $db = NewADOConnection(‘pgsql‘);
  7 $db->Connect(~~~~~);
  8 $db->Execute("set names utf8;");
 14 $city= [
 15     [‘beijing‘,‘北京‘],
 16     [‘tianjin‘,‘天津‘],
 17     [‘shanghai‘,‘上海‘],
 18     [‘chongqing‘,‘重庆‘],
 19     [‘guangzhou‘,‘广州‘],
 20     [‘shenzhen‘,‘深圳‘],
 21     [‘shijiazhuang‘,‘石家庄‘],
 22     [‘taiyuan‘,‘太原‘],
 23     [‘huhehaote‘,‘呼和浩特‘],
 24     [‘shenyang‘,‘沈阳‘],
 25     [‘changchun‘,‘长春‘],
 26     [‘haerbin‘,‘哈尔滨‘],
 27     [‘nanjing‘,‘南京‘],
 28     [‘hangzhou‘,‘杭州‘],
 29     [‘hefei‘,‘合肥‘],
 30     [‘fujianfuzhou‘,‘福州‘],
 31     [‘nanchang‘,‘南昌‘],
 32     [‘jinan‘,‘济南‘],
 33     [‘zhengzhou‘,‘郑州‘],
 34     [‘wuhan‘,‘武汉‘],
 35     [‘changsha‘,‘长沙‘],
 36     [‘guangzhou‘,‘广州‘],
 37     [‘nanning‘,‘南宁‘],
 38     [‘haikou‘,‘海口‘],
 39     [‘chengdu‘,‘成都‘],
 40     [‘guiyang‘,‘贵阳‘],
 41     [‘kunming‘,‘昆明‘],
 42     [‘lasa‘,‘拉萨‘],
 43     [‘xian‘,‘西安‘],
 44     [‘lanzhou‘,‘兰州‘],
 45     [‘xining‘,‘西宁‘],
 46     [‘yinchuan‘,‘银川‘],
 47     [‘wulumuqi‘,‘乌鲁木齐‘]
 48 ];
 49 
 50 
 51 for($i = 1; $i<count($city);$i++){
 52     $cityData = (getCityData($city[$i][0]));
 53     var_dump($cityData);
 54     $res = toDataBase($cityData,$city[$i][1]);
 55     if($res === false){
 56         var_dump($city[$i]);
 57         break;
 58     }
 59     var_dump($res);
 60 }
 61 
 62 
 63 
 64 //获取城市12月数据
 65 function getCityData($city="beijing"){
 67     $beginYear = "2014";
 68     $month = ["01","02","03","04","05","06","07","08","09","10","11","12"];
 69 
 70     $result = [];
 71     for($i = 0; $i<12;$i++){
 72         $time = $beginYear.$month[$i];
 73         $url = "http://www.tianqihoubao.com/lishi/".$city."/month/".$time.".html";   //地址拼接
 74         $output = httpcurl($url);
 75         $output = iconv("GBK", "UTF-8", $output);   //转码
 76         $output = pregData($output);
 77         $result[]  = $output;
 78     }
 79     return $result;
 80 }
 81 
 82 function toDataBase($data, $city){
 83     if(!$data){
 84         return false;
 85     }
 86     $sql = "insert into air_report(date,state,temperature,wind,city) values ";
 87     $monthLength = count($data);
 88     $outerValues = "";
 89     for($i=0; $i<$monthLength; $i++ ){
 90         $dayLength = count($data[$i]);
 91         $values = "";
 92         for($j=0; $j<$dayLength; $j++){
 93             $values .= "(";
 94             $values .= implode($data[$i][$j],‘,‘).",‘".$city."‘";
 95             $values.="),";
 96 //            var_dump($values);
 97         }
 98         $outerValues = $outerValues.$values;
 99     }
100     $outerValues = substr($outerValues,0,strlen($outerValues)-1);
101     $sql = $sql.$outerValues;
102     global $db;
103     $res = $db->Execute($sql);
104     if($res == false){
105         return false;
106     }
107 //    var_dump($db->errorMsg());
108     return $res;
109 }
110 
111 
112 //解析数据
113 function pregData($str){
114     $rule = "/<table width=\"100%\".*?>.*?<\/table>/ism";
115     $ruleTable = "/<td.*?>.*?<\/td>/ism";
116     $ruleTr = "/<tr.*?>.*?<\/tr>/ism";
117     $output = "";
118     preg_match_all($rule, $str, $output);
119     preg_match_all($ruleTable, $output[0][0], $output);
120     $output = $output[0];
121     $output = array_slice($output, 4);
122     $trLength = count($output);
123     $info = [];
124     for($i = 0; $i<$trLength; $i++){
125         $output[$i] = "‘".trim(strip_tags($output[$i]))."‘";    //直接strip_tags()取数据
126     }
127     $info = array_chunk($output, 4);
128     return $info;
129 }
130 
131 
132 function httpcurl($url, $post_data = null)
133 {
134     $ch = curl_init();
135     curl_setopt($ch, CURLOPT_URL, $url);//x
136     curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
137 //    curl_setopt($ch, CURLOPT_POST, 1);
138 //    curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
139 //    curl_setopt($ch, CURLOPT_POSTFIELDS, $post_data);
140     $output = curl_exec($ch);
141     curl_close($ch);
142     return $output;
143 }

  数据有时取出来后格式还是有问题的,刚好碰上的一个取出来的值里面包含不可见的字符,想要去除暂时还没有想到合适的方法。

以上是关于爬虫之初试的主要内容,如果未能解决你的问题,请参考以下文章

scrapy爬虫-1-初试页面抓取

爬虫初试

python爬虫框架scrapy初试

Python 初试爬虫博客园

爬虫初试

Python爬虫实战,零基础初试爬虫下载图片(附源码和分析过程)