如何在 R 中抓取地址更改的文件

Posted 2023-03-12

技术标签:

【中文标题】如何在 R 中抓取地址更改的文件【英文标题】：How to webscrape a file which adress changes in R 【发布时间】：2021-03-14 14:47:32 【问题描述】：

我对这个结构不变的excel文件感兴趣：https://rigcount.bakerhughes.com/static-files/cc0aed5c-b4fc-440d-9522-18680fb2ef6a

我可以从这个页面获得：https://rigcount.bakerhughes.com/na-rig-count

最后一个网址不会随时间变化，而第一个网址会。

但我猜文件的url位于固定网页的元素中的某个地方，即使它被改变了，并且文件名的生成遵循一个重复的过程。

因此，在 R 中，有没有一种方法可以自动获取文件（大约每周更新一次），而无需每次手动下载？

【问题讨论】：

【参考方案1】：

你跳过了问题中关于你所做的事情的部分。或者在网上搜索教程。但这样做很容易。您必须查找 rvest 教程以获得更多解释。

library(rvest) # to allow easy scraping
library(magrittr) # to allow %>% pipe commands

page <- read_html("https://rigcount.bakerhughes.com/na-rig-count")

# Find links that match excel type files as defined by the page
links <- page %>%
  html_nodes("span.file--mime-application-vnd-ms-excel-sheet-binary-macroEnabled-12") %>%
  html_nodes("a")

links_df <- data.frame(
  title = links %>% html_attr("title"),
  link = links %>% html_attr("href")
)

links_df
title
# 1              north_america_rotary_rig_count_jan_2000_-_current.xlsb
# 2 north_american_rotary_rig_count_pivot_table_feb_2011_-_current.xlsb
# link
# 1 https://rigcount.bakerhughes.com/static-files/cc0aed5c-b4fc-440d-9522-18680fb2ef6a
# 2 https://rigcount.bakerhughes.com/static-files/c7852ea5-5bf5-4c47-b52c-f025597cdddf

【讨论】：

我迷失在谷歌浏览器的开发工具中，试图找到一些 json 文件，所以不太接近类似的东西。谢谢在网络选项卡中查找数据文件是一个很好的起点。甚至可能有一个。但是这个网站非常简单，每天只有一页。所以在页面上找到一个工作的 html 类来识别会更容易。

以上是关于如何在 R 中抓取地址更改的文件的主要内容，如果未能解决你的问题，请参考以下文章