网络抓取 ESPN 的 NBA 数据时,有没有办法修复 HTTP 错误 403?
Posted
技术标签:
【中文标题】网络抓取 ESPN 的 NBA 数据时,有没有办法修复 HTTP 错误 403?【英文标题】:Is there a way to fix HTTP error 403 when webscraping ESPN's NBA data? 【发布时间】:2021-12-15 04:33:00 【问题描述】:我正在尝试从this 网站抓取数据,但针对每支 NBA 球队。
但是,当我运行以下代码时,我不断收到 HTTP 错误 403,具体来说,
"open.connection(x, "rb") 中的错误:HTTP 错误 403。
“我不知道如何解决这个问题,因为我看到其他项目使用相同的确切代码毫无问题地抓取同一个确切的网站。
library(rvest)
library(lubridate)
library(tidyverse)
library(stringr)
library(zoo)
library(h2o)
library(lubridate)
teams<-c("tor", "mil", "den", "gs", "ind", "phi", "okc", "por", "bos", "hou", "lac", "sa",
"lal", "utah", "mia", "sac", "min", "bkn", "dal", "no", "cha", "mem", "det", "orl",
"wsh", "atl", "phx", "ny", "chi", "cle")
teams_fullname<-c("Toronto", "Milwaukee", "Denver", "Golden State", "Indiana", "Philadelphia", "Oklahoma City","Portland",
"Boston", "Houston", "LA", "San Antonio", "Los Angeles", "Utah", "Miami", "Sacramento", "Minnesota", "***lyn",
"Dallas", "New Orleans", "Charlotte", "Memphis", "Detroit", "Orlando", "Washington", "Atlanta", "Phoenix",
"New York", "Chicago", "Cleveland")
by_team<-
for (i in 1:length(teams))
url<-paste0("http://www.espn.com/nba/team/schedule/_/name/", teams[i])
#print(url)
webpage <- read_html(url)
team_table <- html_nodes(webpage, 'table')
team_c <- html_table(team_table, fill=TRUE, header = TRUE)[[1]]
team_c<-team_c[1:which(team_c$RESULT=="TIME")-1,]
team_c$URLTeam<-toupper(teams[i])
team_c$FullURLTeam<-(teams_fullname[i])
by_team<-rbind(by_team, team_c)
# remove the postponed games
by_team<-by_team%>%filter(RESULT!='Postponed')
我只是想知道为什么会发生这种情况和/或如何解决此错误。任何帮助表示赞赏。
【问题讨论】:
【参考方案1】:越来越少的网站允许直接 rvest::read_html(url)。 首先使用 httr::GET(url) 或 httr::RETRY('GET', url)。 (对于新管道,R>=4.1)
webpage <- url |>
httr::GET() |>
rvest::read_html()
【讨论】:
以上是关于网络抓取 ESPN 的 NBA 数据时,有没有办法修复 HTTP 错误 403?的主要内容,如果未能解决你的问题,请参考以下文章
从 stats.nba.com 抓取数据,curl::curl_fetch_memory(url, handle = handle) 中出现错误