网络抓取 ESPN 的 NBA 数据时，有没有办法修复 HTTP 错误 403？

Posted 2023-03-31

技术标签:

【中文标题】网络抓取 ESPN 的 NBA 数据时，有没有办法修复 HTTP 错误 403？【英文标题】：Is there a way to fix HTTP error 403 when webscraping ESPN's NBA data? 【发布时间】：2021-12-15 04:33:00 【问题描述】：

我正在尝试从this 网站抓取数据，但针对每支 NBA 球队。

但是，当我运行以下代码时，我不断收到 HTTP 错误 403，具体来说，

"open.connection(x, "rb") 中的错误：HTTP 错误 403。

“我不知道如何解决这个问题，因为我看到其他项目使用相同的确切代码毫无问题地抓取同一个确切的网站。

library(rvest)
library(lubridate)
library(tidyverse)
library(stringr)
library(zoo)
library(h2o)
library(lubridate)



teams<-c("tor", "mil", "den", "gs", "ind", "phi", "okc", "por", "bos", "hou", "lac", "sa",
         "lal", "utah", "mia", "sac", "min", "bkn", "dal", "no", "cha", "mem", "det", "orl",
         "wsh", "atl", "phx", "ny", "chi", "cle")

teams_fullname<-c("Toronto", "Milwaukee", "Denver", "Golden State", "Indiana", "Philadelphia", "Oklahoma City","Portland",
                  "Boston", "Houston", "LA", "San Antonio", "Los Angeles", "Utah", "Miami", "Sacramento", "Minnesota", "***lyn",
                  "Dallas", "New Orleans", "Charlotte", "Memphis", "Detroit", "Orlando", "Washington", "Atlanta", "Phoenix",
                  "New York", "Chicago", "Cleveland")

by_team<-
for (i in 1:length(teams)) 
  url<-paste0("http://www.espn.com/nba/team/schedule/_/name/", teams[i])
  #print(url)
  webpage <- read_html(url)
  team_table <- html_nodes(webpage, 'table')
  team_c <- html_table(team_table, fill=TRUE, header = TRUE)[[1]]
  team_c<-team_c[1:which(team_c$RESULT=="TIME")-1,]
  team_c$URLTeam<-toupper(teams[i])
  team_c$FullURLTeam<-(teams_fullname[i])
  by_team<-rbind(by_team, team_c)


# remove the postponed games
by_team<-by_team%>%filter(RESULT!='Postponed')

我只是想知道为什么会发生这种情况和/或如何解决此错误。任何帮助表示赞赏。

【问题讨论】：

【参考方案1】：

越来越少的网站允许直接 rvest::read_html(url)。首先使用 httr::GET(url) 或 httr::RETRY('GET', url)。（对于新管道，R>=4.1）

webpage <- url |>
  httr::GET() |>
  rvest::read_html()

【讨论】：

以上是关于网络抓取 ESPN 的 NBA 数据时，有没有办法修复 HTTP 错误 403？的主要内容，如果未能解决你的问题，请参考以下文章

从 stats.nba.com 抓取数据，curl::curl_fetch_memory(url, handle = handle) 中出现错误

python 网络分析类 - 打开ESPN网站并输出大学橄榄球队的排名

AE粒子效果运动轨迹拖尾的特效实现

不知道如何分离我抓取的一列数据

使用名称从网站上抓取数据表

有没有办法配置数据抓取将'ssf'转换为'select * from'