r Web 抓取:无法读取主表
Posted
技术标签:
【中文标题】r Web 抓取:无法读取主表【英文标题】:r Web scraping: Unable to read the main table 【发布时间】:2020-04-22 13:04:43 【问题描述】:我是网络抓取的新手。我正在尝试使用以下代码 scrape 一个表。但我无法得到它。数据来源是
https://www.investing.com/stock-screener/?sp=country::6|sector::a|industry::a|equityType::a|exchange::a%3Ceq_market_cap;1
url <- "https://www.investing.com/stock-screener/?sp=country::6|sector::a|industry::a|equityType::a|exchange::a%3Ceq_market_cap;1"
urlYAnalysis <- paste(url, sep = "")
webpage <- readLines(urlYAnalysis)
html <- htmlTreeParse(webpage, useInternalNodes = TRUE, asText = TRUE)
tableNodes <- getNodeSet(html, "//table")
Tab <- readHTMLTable(tableNodes[[1]])
我从链接 (Web scraping of key stats in Yahoo! Finance with R) 复制了此方法,该方法应用于雅虎财务数据。
在我看来,在 readHTMLTable(tableNodes[[12]]) 中,应该是 Table 12。但是当我尝试给 tableNodes[[12]] 时,它总是给我一个错误。
Error in do.call(data.frame, c(x, alis)) :
variable names are limited to 10000 bytes
请建议我如何提取表格并结合其他选项卡中的数据(基础、技术和性能)。
【问题讨论】:
【参考方案1】:此数据以 json 格式动态返回。在 R(与 Python 请求的行为不同)中,您会获得 html,您可以从中提取给定页面的结果为 json。一个页面包括所有选项卡信息和 50 条记录。从第一页开始,您将获得总记录数,因此可以计算要循环的总页数以获得所有结果。也许在循环期间将它们组合到最终数据帧的总页数;您可以在其中将 XHR POST 正文的 pn
参数更改为适当的页码,以获得每个新 POST 请求中的所需结果。有两个必需的标头。
编写一个在签名中接受页码并将给定页面的 json 作为数据帧返回的函数可能是个好主意。通过 tidyverse 包应用它来处理循环并将结果组合到最终数据帧?
library(httr)
library(jsonlite)
library(magrittr)
library(rvest)
library(stringr)
headers = c(
'User-Agent' = 'Mozilla/5.0',
'X-Requested-With' = 'XMLHttpRequest'
)
data = list(
'country[]' = '6',
'sector' = '7,5,12,3,8,9,1,6,2,4,10,11',
'industry' = '81,56,59,41,68,67,88,51,72,47,12,8,50,2,71,9,69,45,46,13,94,102,95,58,100,101,87,31,6,38,79,30,77,28,5,60,18,26,44,35,53,48,49,55,78,7,86,10,1,34,3,11,62,16,24,20,54,33,83,29,76,37,90,85,82,22,14,17,19,43,89,96,57,84,93,27,74,97,4,73,36,42,98,65,70,40,99,39,92,75,66,63,21,25,64,61,32,91,52,23,15,80',
'equityType' = 'ORD,DRC,Preferred,Unit,ClosedEnd,REIT,ELKS,OpenEnd,Right,ParticipationShare,CapitalSecurity,PerpetualCapitalSecurity,GuaranteeCertificate,IGC,Warrant,SeniorNote,Debenture,ETF,ADR,ETC,ETN',
'exchange[]' = '109',
'exchange[]' = '127',
'exchange[]' = '51',
'exchange[]' = '108',
'pn' = '1', # this is page number and should be altered in a loop over all pages. 50 results per page i.e. rows
'order[col]' = 'eq_market_cap',
'order[dir]' = 'd'
)
r <- httr::POST(url = 'https://www.investing.com/stock-screener/Service/SearchStocks', httr::add_headers(.headers=headers), body = data)
s <- r %>%read_html()%>%html_node('p')%>% html_text()
page1_data <- jsonlite::fromJSON(str_match(s, '(\\[.*\\])' )[1,2])
total_rows <- str_match(s, '"totalCount\":(\\d+),' )[1,2]%>%as.integer()
num_pages <- ceiling(total_rows/50)
我目前的组合尝试,我欢迎反馈。这是所有页面的所有返回列,我必须处理丢失的列和不同的列排序以及 1 列是 data.frame。由于返回的数字远大于页面上可见的数字,您可以简单地修改为带有掩码的返回列的子集,仅针对选项卡中存在的列。
library(httr)
library(jsonlite)
library(magrittr)
library(rvest)
library(stringr)
library(tidyverse)
library(data.table)
headers = c(
'User-Agent' = 'Mozilla/5.0',
'X-Requested-With' = 'XMLHttpRequest'
)
data = list(
'country[]' = '6',
'sector' = '7,5,12,3,8,9,1,6,2,4,10,11',
'industry' = '81,56,59,41,68,67,88,51,72,47,12,8,50,2,71,9,69,45,46,13,94,102,95,58,100,101,87,31,6,38,79,30,77,28,5,60,18,26,44,35,53,48,49,55,78,7,86,10,1,34,3,11,62,16,24,20,54,33,83,29,76,37,90,85,82,22,14,17,19,43,89,96,57,84,93,27,74,97,4,73,36,42,98,65,70,40,99,39,92,75,66,63,21,25,64,61,32,91,52,23,15,80',
'equityType' = 'ORD,DRC,Preferred,Unit,ClosedEnd,REIT,ELKS,OpenEnd,Right,ParticipationShare,CapitalSecurity,PerpetualCapitalSecurity,GuaranteeCertificate,IGC,Warrant,SeniorNote,Debenture,ETF,ADR,ETC,ETN',
'exchange[]' = '109',
'exchange[]' = '127',
'exchange[]' = '51',
'exchange[]' = '108',
'pn' = '1', # this is page number and should be altered in a loop over all pages. 50 results per page i.e. rows
'order[col]' = 'eq_market_cap',
'order[dir]' = 'd'
)
get_data <- function(page_number)
data['pn'] = page_number
r <- httr::POST(url = 'https://www.investing.com/stock-screener/Service/SearchStocks', httr::add_headers(.headers=headers), body = data)
s <- r %>% read_html() %>% html_node('p') %>% html_text()
if(page_number==1) return(s)
elsereturn(data.frame(jsonlite::fromJSON(str_match(s, '(\\[.*\\])' )[1,2])))
clean_df <- function(df)
interim <- df['viewData']
df_minus <- subset(df, select = -c(viewData))
df_clean <- cbind.data.frame(c(interim, df_minus))
return(df_clean)
initial_data <- get_data(1)
df <- clean_df(data.frame(jsonlite::fromJSON(str_match(initial_data, '(\\[.*\\])' )[1,2])))
total_rows <- str_match(initial_data, '"totalCount\":(\\d+),' )[1,2] %>% as.integer()
num_pages <- ceiling(total_rows/50)
dfs <- map(.x = 2:num_pages,
.f = ~clean_df(get_data(.)))
r <- rbindlist(c(list(df),dfs),use.names=TRUE, fill=TRUE)
write_csv(r, 'data.csv')
【讨论】:
以上是关于r Web 抓取:无法读取主表的主要内容,如果未能解决你的问题,请参考以下文章
无法打开网站web服务器上似乎未安装frontpage服务器扩展 无法读取microsoft internet information server
使用 Selenium Python 解析 HTML 和读取 HTML 表
读取 kafka 主题并通过 Rest API 公开数据以供 prometheus 抓取(Nodejs)
R:“raster”无法读取“readGDAL”能够读取的 GeoTIFF