r Web 抓取:无法读取主表

Posted

技术标签:

【中文标题】r Web 抓取:无法读取主表【英文标题】:r Web scraping: Unable to read the main table 【发布时间】:2020-04-22 13:04:43 【问题描述】:

我是网络抓取的新手。我正在尝试使用以下代码 scrape 一个表。但我无法得到它。数据来源是

https://www.investing.com/stock-screener/?sp=country::6|sector::a|industry::a|equityType::a|exchange::a%3Ceq_market_cap;1


url <- "https://www.investing.com/stock-screener/?sp=country::6|sector::a|industry::a|equityType::a|exchange::a%3Ceq_market_cap;1"
urlYAnalysis <- paste(url, sep = "")
webpage <- readLines(urlYAnalysis)
html <- htmlTreeParse(webpage, useInternalNodes = TRUE, asText = TRUE)
tableNodes <- getNodeSet(html, "//table")
Tab <- readHTMLTable(tableNodes[[1]])

我从链接 (Web scraping of key stats in Yahoo! Finance with R) 复制了此方法,该方法应用于雅虎财务数据。

在我看来,在 readHTMLTable(tableNodes[[12]]) 中,应该是 Table 12。但是当我尝试给 tableNodes[[12]] 时,它总是给我一个错误。

Error in do.call(data.frame, c(x, alis)) : 
  variable names are limited to 10000 bytes

请建议我如何提取表格并结合其他选项卡中的数据(基础、技术和性能)。

【问题讨论】:

【参考方案1】:

此数据以 json 格式动态返回。在 R(与 Python 请求的行为不同)中,您会获得 html,您可以从中提取给定页面的结果为 json。一个页面包括所有选项卡信息和 50 条记录。从第一页开始,您将获得总记录数,因此可以计算要循环的总页数以获得所有结果。也许在循环期间将它们组合到最终数据帧的总页数;您可以在其中将 XHR POST 正文的 pn 参数更改为适当的页码,以获得每个新 POST 请求中的所需结果。有两个必需的标头。

编写一个在签名中接受页码并将给定页面的 json 作为数据帧返回的函数可能是个好主意。通过 tidyverse 包应用它来处理循环并将结果组合到最终数据帧?

library(httr)
library(jsonlite)
library(magrittr)
library(rvest)
library(stringr)

headers = c(
  'User-Agent' = 'Mozilla/5.0',
  'X-Requested-With' = 'XMLHttpRequest'
)

data = list(
  'country[]' = '6',
  'sector' = '7,5,12,3,8,9,1,6,2,4,10,11',
  'industry' = '81,56,59,41,68,67,88,51,72,47,12,8,50,2,71,9,69,45,46,13,94,102,95,58,100,101,87,31,6,38,79,30,77,28,5,60,18,26,44,35,53,48,49,55,78,7,86,10,1,34,3,11,62,16,24,20,54,33,83,29,76,37,90,85,82,22,14,17,19,43,89,96,57,84,93,27,74,97,4,73,36,42,98,65,70,40,99,39,92,75,66,63,21,25,64,61,32,91,52,23,15,80',
  'equityType' = 'ORD,DRC,Preferred,Unit,ClosedEnd,REIT,ELKS,OpenEnd,Right,ParticipationShare,CapitalSecurity,PerpetualCapitalSecurity,GuaranteeCertificate,IGC,Warrant,SeniorNote,Debenture,ETF,ADR,ETC,ETN',
  'exchange[]' = '109',
  'exchange[]' = '127',
  'exchange[]' = '51',
  'exchange[]' = '108',
  'pn' = '1', # this is page number and should be altered in a loop over all pages. 50 results per page i.e. rows
  'order[col]' = 'eq_market_cap',
  'order[dir]' = 'd'
)

r <- httr::POST(url = 'https://www.investing.com/stock-screener/Service/SearchStocks', httr::add_headers(.headers=headers), body = data)
s <- r %>%read_html()%>%html_node('p')%>% html_text()
page1_data <- jsonlite::fromJSON(str_match(s, '(\\[.*\\])' )[1,2])
total_rows <- str_match(s, '"totalCount\":(\\d+),' )[1,2]%>%as.integer()
num_pages <- ceiling(total_rows/50)

我目前的组合尝试,我欢迎反馈。这是所有页面的所有返回列,我必须处理丢失的列和不同的列排序以及 1 列是 data.frame。由于返回的数字远大于页面上可见的数字,您可以简单地修改为带有掩码的返回列的子集,仅针对选项卡中存在的列。

library(httr)
library(jsonlite)
library(magrittr)
library(rvest)
library(stringr)
library(tidyverse)
library(data.table)

headers = c(
  'User-Agent' = 'Mozilla/5.0',
  'X-Requested-With' = 'XMLHttpRequest'
)

data = list(
  'country[]' = '6',
  'sector' = '7,5,12,3,8,9,1,6,2,4,10,11',
  'industry' = '81,56,59,41,68,67,88,51,72,47,12,8,50,2,71,9,69,45,46,13,94,102,95,58,100,101,87,31,6,38,79,30,77,28,5,60,18,26,44,35,53,48,49,55,78,7,86,10,1,34,3,11,62,16,24,20,54,33,83,29,76,37,90,85,82,22,14,17,19,43,89,96,57,84,93,27,74,97,4,73,36,42,98,65,70,40,99,39,92,75,66,63,21,25,64,61,32,91,52,23,15,80',
  'equityType' = 'ORD,DRC,Preferred,Unit,ClosedEnd,REIT,ELKS,OpenEnd,Right,ParticipationShare,CapitalSecurity,PerpetualCapitalSecurity,GuaranteeCertificate,IGC,Warrant,SeniorNote,Debenture,ETF,ADR,ETC,ETN',
  'exchange[]' = '109',
  'exchange[]' = '127',
  'exchange[]' = '51',
  'exchange[]' = '108',
  'pn' = '1', # this is page number and should be altered in a loop over all pages. 50 results per page i.e. rows
  'order[col]' = 'eq_market_cap',
  'order[dir]' = 'd'
)

get_data <- function(page_number)
  data['pn'] = page_number
  r <- httr::POST(url = 'https://www.investing.com/stock-screener/Service/SearchStocks', httr::add_headers(.headers=headers), body = data)
  s <- r %>% read_html() %>% html_node('p') %>% html_text()
  if(page_number==1) return(s) 
  elsereturn(data.frame(jsonlite::fromJSON(str_match(s, '(\\[.*\\])' )[1,2])))



clean_df <- function(df)
  interim <- df['viewData']
  df_minus <- subset(df, select = -c(viewData))
  df_clean <- cbind.data.frame(c(interim, df_minus))
  return(df_clean)


initial_data <- get_data(1)
df <- clean_df(data.frame(jsonlite::fromJSON(str_match(initial_data, '(\\[.*\\])' )[1,2])))
total_rows <- str_match(initial_data, '"totalCount\":(\\d+),' )[1,2] %>% as.integer()
num_pages <- ceiling(total_rows/50)

dfs <- map(.x = 2:num_pages,
    .f = ~clean_df(get_data(.))) 

r <- rbindlist(c(list(df),dfs),use.names=TRUE, fill=TRUE)
write_csv(r, 'data.csv')

【讨论】:

以上是关于r Web 抓取:无法读取主表的主要内容,如果未能解决你的问题,请参考以下文章

无法打开网站web服务器上似乎未安装frontpage服务器扩展 无法读取microsoft internet information server

使用 Selenium Python 解析 HTML 和读取 HTML 表

读取 kafka 主题并通过 Rest API 公开数据以供 prometheus 抓取(Nodejs)

R:“raster”无法读取“readGDAL”能够读取的 GeoTIFF

r R:从Web下载文件,使用XLConnect读取整个XLS工作簿,访问电子表格

如何利用sql 读取辅表的最大max 和第二最大max。。。。