R 网页数据爬虫1

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了R 网页数据爬虫1相关的知识,希望对你有一定的参考价值。

For collecting and analyzing data.

【启示】本处所分享的内容均是笔者从一些专业书籍中学习所得,也许会有一些自己使用过程中的技巧、心得、小经验一类的,但远比不上书中所讲述的精彩翔实。只因自己在学习过程中深感在R爬虫应用中互联网可搜索的公开资源并不如其它知识丰富,特此稍作分享以供后来者鉴,也因此关于这一块的内容不做原创声明,欢迎朋友们一起交流学习、批评指正,以期共同进步。EMAIL:[email protected]

1.WHY R?

即使对于非专业人员而言,也多少耳闻目前的R在爬虫应用的表现也远不如其它软件,R既非专业适合的软件、而八爪鱼一类的简单应用也完全可以满足我们这些"偶尔的用户",那么为什么需要用R爬虫呢?我认为每一个来搜索R爬虫技巧的朋友都有自己的答案。

提醒几个个优势:

#1.FOR a software environment with a primarily statistical focus.

#2.there will be an amazing visual work.

#May be a complete set of operational procedures.

2.About basics.

we need threw ourselves into the preparation with some basic knowledge of html, XML and the logic of regular expressions and Xpath, BUT the operations are executed from WIHTIN R!

3.RECOMMENDATION

http://www.r-datacollection.com

4.A little case study.

#爬取电影票房信息
library(stringr)
library(XML)
library(maps)
#htmlParse()用来interpreting HTML
#创建一个object
movie_parsed<-htmlParse("http://58921.com/boxoffice/wangpiao/20161004",
                        encoding = "UTF-8")
#the next step:extract tables/data
#readHTMLTable() for identifying and reading out those tables
tables<-readHTMLTable(movie_parsed,stringsAsFactors=FALSE)
is.matrix(tables)
is.character(tables)
is.data.frame(tables)
is.list(tables)
#so we got an "list" format#

 

因为R对于中文的支持不是很好,所以碰到一些中文乱码是正常的,所以我们需要more advanced text manipulation tools.(本例中出现了部分列信息的完全丢失是因为该网站的某些列的数据是以.png格式放置的。)

5.ABC‘s of...

For browsing the Web, there is a hidden standard behind the scenes that structures how information is displayed.

#HTML or the hypertext markup language

Not a dedicated data storage format, but usually contains the useful information. And in general HTML is used to shape the display of information.

#XML the extensible markup language or XML

The main purpose of XML is to storage data. Thus HTML documents are interpreted and transformed in to pretty-looking output by browsers, whereas XML is "just" data wrapped in user-defined tags. The user-defined tags make XML much more flexible for storing data than HTML. Both HTML and XML-style document offer natrual, often hierarchical, structures for data storage. 

(unfinished......)

 

以上是关于R 网页数据爬虫1的主要内容,如果未能解决你的问题,请参考以下文章

数据爬虫:AJAX与网页动态加载 | R语千寻

python用通用代码爬取,没有反应,该如何处理?

python爬虫:如何爬网页数据并将其放在文本

怎么利用爬虫技术抓取淘宝搜索页面的产品信息

R语言静态网页爬虫

获取豆瓣电影数据(R与API获取网页数据)