R语言静态网页爬虫
Posted 熊彼特的厨房
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了R语言静态网页爬虫相关的知识,希望对你有一定的参考价值。
分享一段小代码,等有时间了会整合到讲义中
# install.packages('stringr')
# install.packages('RCurl')
# install.packages('XML')
library(RCurl)
library(XML)
library(stringr)
# function to parse infomation
gettp <- function(x,j){
return(str_trim(unlist(str_split(x,pattern = "\\|"))[j]))
}
xujiahui <- data.frame()
for(i in 1:31){
if(i == 1){
url <- "https://sh.lianjia.com/ershoufang/rs%E5%BE%90%E5%AE%B6%E6%B1%87/"
}else{
url <- paste0('https://sh.lianjia.com/ershoufang/pg',i,'rs%E5%BE%90%E5%AE%B6%E6%B1%87/')
}
doc <- getURL(url)
parsed_doc <- htmlParse(doc)
community <- xpathSApply(parsed_doc, '//a[@data-el="region"]',xmlValue)
totalprice <- xpathSApply(parsed_doc, '//*[contains(concat( " ", @class, " " ), concat( " ", "totalPrice", " " ))]//span',xmlValue)
type_all <- xpathSApply(parsed_doc, '//*[contains(concat( " ", @class, " " ), concat( " ", "houseInfo", " " ))]',xmlValue)
# for(i in 1:30){
# print(str_trim(unlist(str_split(type_all[i],pattern = "\\|"))[1]))
# }
type <- unlist(lapply(type_all,function(x) gettp(x,1)))
size <- unlist(lapply(type_all,function(x) gettp(x,2)))
year <- unlist(lapply(type_all,function(x) gettp(x,6)))
avgprice <- xpathSApply(parsed_doc, '//*[contains(concat( " ", @class, " " ), concat( " ", "unitPrice", " " ))]//span',xmlValue)
result <- data.frame(community,totalprice,type,size,year,avgprice)
xujiahui <- rbind(xujiahui,result)
print(i)
}
以上是关于R语言静态网页爬虫的主要内容,如果未能解决你的问题,请参考以下文章