Python网络爬虫之基本项目:爬取网易新闻排行榜
Posted 日常分享Python
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Python网络爬虫之基本项目:爬取网易新闻排行榜相关的知识,希望对你有一定的参考价值。
1. 最基本的抓取
抓取大多数情况属于get请求,即直接从对方服务器上获取数据。
首先,Python中自带urllib及urllib2这两个模块,基本上能满足一般的页面抓取。另外,requests也是非常有用的包,与此类似的,还有httplib2等等。
Requests:
import requests
response = requests.get(url)
content = requests.get(url).content
print "response headers:", response.headers
print "content:", content
Urllib2:
import urllib2
response = urllib2.urlopen(url)
content = urllib2.urlopen(url).read()
print "response headers:", response.headers
print "content:", content
Httplib2:
import httplib2
http = httplib2.Http()
response_headers, content = http.request(url, 'GET')
print "response headers:", response_headers
print "content:", content
此外,对于带有查询字段的url,get请求一般会将来请求的数据附在url之后,以?分割url和传输数据,多个参数用&连接。
data = {'data1':'XXXXX', 'data2':'XXXXX'}
Requests:data为dict,json
import requests
response = requests.get(url=url, params=data)
Urllib2:data为string
import urllib, urllib2
data = urllib.urlencode(data)
full_url = url+'?'+data
response = urllib2.urlopen(full_url)
抓取网易新闻排行榜项目一些说明:
-
使用urllib2或requests包来爬取页面。
-
使用正则表达式分析一级页面,使用Xpath来分析二级页面。
-
将得到的标题和链接,保存为本地文件。
代码分享:
# -*- coding: utf-8 -*-
import os
import sys
import urllib2
import requests
import re
from lxml import etree
def StringListSave(save_path, filename, slist):
if not os.path.exists(save_path):
os.makedirs(save_path)
path = save_path+"/"+filename+".txt"
with open(path, "w+") as fp:
for s in slist:
fp.write("%s\\t\\t%s\\n" % (s[0].encode("utf8"), s[1].encode("utf8")))
def Page_Info(myPage):
'''Regex'''
mypage_Info = re.findall(r'<div class="titleBar" id=".*?"><h2>(.*?)</h2><div class="more"><a href="(.*?)">.*?</a></div></div>', myPage, re.S)
return mypage_Info
def New_Page_Info(new_page):
'''Regex(slowly) or Xpath(fast)'''
# new_page_Info = re.findall(r'<td class=".*?">.*?<a href="(.*?)\\.html".*?>(.*?)</a></td>', new_page, re.S)
# # new_page_Info = re.findall(r'<td class=".*?">.*?<a href="(.*?)">(.*?)</a></td>', new_page, re.S) # bugs
# results = []
# for url, item in new_page_Info:
# results.append((item, url+".html"))
# return results
dom = etree.HTML(new_page)
new_items = dom.xpath('//tr/td/a/text()')
new_urls = dom.xpath('//tr/td/a/@href')
assert(len(new_items) == len(new_urls))
return zip(new_items, new_urls)
def Spider(url):
i = 0
print "downloading ", url
myPage = requests.get(url).content.decode("gbk")
# myPage = urllib2.urlopen(url).read().decode("gbk")
myPageResults = Page_Info(myPage)
save_path = u"网易新闻抓取"
filename = str(i)+"_"+u"新闻排行榜"
StringListSave(save_path, filename, myPageResults)
i += 1
for item, url in myPageResults:
print "downloading ", url
new_page = requests.get(url).content.decode("gbk")
# new_page = urllib2.urlopen(url).read().decode("gbk")
newPageResults = New_Page_Info(new_page)
filename = str(i)+"_"+item
StringListSave(save_path, filename, newPageResults)
i += 1
if __name__ == '__main__':
print "start"
start_url = "http://news.163.com/rank/"
Spider(start_url)
print "end"
①2000多本Python电子书有
②Python开发环境安装教程有
③Python400集+自学视频有
④软件开发常用词汇有
⑤Python学习路线图有
⑥项目游戏源码案例分享有
如果你用得到的话可以直接拿走,在我的QQ技术交流群里(技术交流和资源共享,广告勿
入,不要让我搞废你的群)可以自助拿走,群号是924403856。
以上是关于Python网络爬虫之基本项目:爬取网易新闻排行榜的主要内容,如果未能解决你的问题,请参考以下文章