网页抓取R中的多个页面

Posted

技术标签:

【中文标题】网页抓取R中的多个页面【英文标题】:Web scraping multiple pages in R 【发布时间】:2022-01-21 05:41:23 【问题描述】:

我是 R 新手,希望能得到一些帮助。我正在尝试从网站上抓取有关犬种的数据。

品种列表的链接在这里: https://dogtime.com/dog-breeds/profiles

每个品种资料的网址都以https://dogtime.com/dog-breeds/ 为基础,然后添加品种名称(例如https://dogtime.com/dog-breeds/golden-retriever)。

我已使用以下代码成功抓取了一个品种的数据,但我现在想收集网站上所有 392 个品种的数据并将结果存储在数据框中。

library(rvest)
library(dplyr)
library(purrr)

# Create a vector of URLs
dog_links <- page %>% html_nodes(".list-item-title") %>%
  html_attr("href") 

# Create a new variable for the website link
link = "https://dogtime.com/dog-breeds/golden-retriever"
 
# Get HTML code from this website
page <- read_html(link)

# Create variables for each of the attributes
breed <- page %>% html_nodes("h1") %>% html_text()
adaptability = page %>% html_nodes(".title-box+ .paws .parent-characteristic .characteristic-star-block") %>% html_text()
apartment_living = page %>% html_nodes(".title-box+ .paws .parent-characteristic+ .child-characteristic .characteristic-star-block") %>% html_text()
novice_owners = page %>% html_nodes(".title-box+ .paws .child-characteristic:nth-child(3) .characteristic-star-block") %>% html_text()
sensitivity_level = page %>% html_nodes(".title-box+ .paws .child-characteristic:nth-child(4) .characteristic-star-block") %>% html_text()
tolerates_alone = page %>% html_nodes(".title-box+ .paws .child-characteristic:nth-child(5) .characteristic-star-block") %>% html_text()
tolerates_cold = page %>% html_nodes(".title-box+ .paws .child-characteristic:nth-child(6) .characteristic-star-block") %>% html_text()
tolerates_hot = page %>% html_nodes(".title-box+ .paws .child-characteristic:nth-child(7) .characteristic-star-block") %>% html_text()
friendliness = page %>% html_nodes(".paws:nth-child(3) .parent-characteristic .characteristic-star-block") %>% html_text()
affectionate = page %>% html_nodes(".paws:nth-child(3) .parent-characteristic+ .child-characteristic .characteristic-star-block") %>% html_text()
kid_friendly = page %>% html_nodes(".paws:nth-child(3) .child-characteristic:nth-child(3) .characteristic-star-block") %>% html_text()
dog_friendly = page %>% html_nodes(".paws:nth-child(3) .child-characteristic:nth-child(4) .characteristic-star-block") %>% html_text()
stranger_friendly = page %>% html_nodes(".paws:nth-child(3) .child-characteristic:nth-child(5) .characteristic-star-block") %>% html_text()
health_grooming = page %>% html_nodes(".paws:nth-child(4) .parent-characteristic .characteristic-star-block") %>% html_text()
shedding = page %>% html_nodes(".paws:nth-child(4) .parent-characteristic+ .child-characteristic .characteristic-star-block") %>% html_text()
drooling = page %>% html_nodes(".paws:nth-child(4) .child-characteristic:nth-child(3) .characteristic-star-block") %>% html_text()
easy_groom = page %>% html_nodes(".paws:nth-child(4) .child-characteristic:nth-child(4) .characteristic-star-block") %>% html_text()
general_health = page %>% html_nodes(".paws:nth-child(4) .child-characteristic:nth-child(5) .characteristic-star-block") %>% html_text
weight_gain = page %>% html_nodes(".paws:nth-child(4) .child-characteristic:nth-child(6) .characteristic-star-block") %>% html_text()
size = page %>% html_nodes(".paws:nth-child(4) .child-characteristic:nth-child(7) .characteristic-star-block") %>% html_text()
trainability = page %>% html_nodes("#cf_hagn+ .paws .parent-characteristic .characteristic-star-block") %>% html_text()
easy_train = page %>% html_nodes("#cf_hagn+ .paws .parent-characteristic+ .child-characteristic .characteristic-star-block") %>% html_text()
intelligence = page %>% html_nodes("#cf_hagn+ .paws .child-characteristic:nth-child(3) .characteristic-star-block") %>% html_text()
mouthiness = page %>% html_nodes("#cf_hagn+ .paws .child-characteristic:nth-child(4) .characteristic-star-block") %>% html_text()
prey_drive = page %>% html_nodes("#cf_hagn+ .paws .child-characteristic:nth-child(5) .characteristic-star-block") %>% html_text()
barking = page %>% html_nodes("#cf_hagn+ .paws .child-characteristic:nth-child(6) .characteristic-star-block") %>% html_text()
wanderlust = page %>% html_nodes("#cf_hagn+ .paws .child-characteristic:nth-child(7) .characteristic-star-block") %>% html_text()
physical_needs = page %>% html_nodes("#cf_hagn~ .paws+ .paws .parent-characteristic .characteristic-star-block") %>% html_text()
energy_level = page %>% html_nodes("#cf_hagn~ .paws+ .paws .parent-characteristic+ .child-characteristic .characteristic-star-block") %>% html_text()
intensity = page %>% html_nodes("#cf_hagn~ .paws+ .paws .child-characteristic:nth-child(3) .characteristic-star-block") %>% html_text()
exercise_needs = page %>% html_nodes("#cf_hagn~ .paws+ .paws .child-characteristic:nth-child(4) .characteristic-star-block") %>% html_text()
playfulness = page %>% html_nodes("#cf_hagn~ .paws+ .paws .child-characteristic:nth-child(5) .characteristic-star-block") %>% html_text()
breed_group = page %>% html_nodes(".vital-stat-box:nth-child(1)") %>% html_text()
height = page %>% html_nodes(".vital-stat-box:nth-child(2)") %>% html_text()
weight = page %>% html_nodes(".vital-stat-box:nth-child(3)") %>% html_text()
life_span = page %>% html_nodes(".vital-stat-box:nth-child(4)") %>% html_text() 

# Create a data frame
dogs = data.frame(breed, adaptability, apartment_living, novice_owners, sensitivity_level, tolerates_alone, tolerates_cold, tolerates_hot, friendliness, affectionate, kid_friendly, dog_friendly, stranger_friendly, health_grooming, shedding, drooling, easy_groom, general_health, weight_gain, size, trainability, easy_train, intelligence, mouthiness, prey_drive, barking, wanderlust, physical_needs, energy_level, intensity, exercise_needs, playfulness, breed_group, height, weight, life_span, stringsAsFactors = FALSE)

# view data frame
View(dogs)

抱歉,代码中有很多变量要存储。我想我需要使用一个 for 循环来遍历各个品种的每个不同的 url,但我不确定我会如何写这个,因为 'i' 值是字符而不是数字。

谁能告诉我这是否是最好的方法,如果是,我将如何做到这一点?

非常感谢您的帮助,

詹姆斯

【问题讨论】:

【参考方案1】:

我们可以使用html_attr('href') 来获取所有犬种的链接,

library(rvest)
library(dplyr)

url = url = 'https://dogtime.com/dog-breeds/profiles'
url %>% read_html() %>% 
  html_nodes(".list-item-img") %>%
  html_attr('href')

输出

  [1] "https://dogtime.com/dog-breeds/afador"                            
  [2] "https://dogtime.com/dog-breeds/affenhuahua"                       
  [3] "https://dogtime.com/dog-breeds/affenpinscher"                     
  [4] "https://dogtime.com/dog-breeds/afghan-hound"                      
  [5] "https://dogtime.com/dog-breeds/airedale-terrier" 

你可以用你的代码循环链接。

此外,我建议您使用class 来获取数据,因为它可以将大块代码缩减为小块,

url = "https://dogtime.com/dog-breeds/golden-retriever"
url %>% read_html() %>% 
  html_nodes(".characteristic-title") %>%
  html_text()
 [1] " Adaptability"                   "Adapts Well To Apartment Living" "Good For Novice Owners"         
 [4] "Sensitivity Level"               "Tolerates Being Alone"           "Tolerates Cold Weather"         
 [7] "Tolerates Hot Weather"           " All Around Friendliness"        "Affectionate With Family"       
[10] "Kid-Friendly"                    "Dog Friendly"                    "Friendly Toward Strangers"      
[13] " Health And Grooming Needs"      "Amount Of Shedding"              "Drooling Potential"             
[16] "Easy To Groom"                   "General Health"                  "Potential For Weight Gain"      
[19] "Size"                            " Trainability"                   "Easy To Train"                  
[22] "Intelligence"                    "Potential For Mouthiness"        "Prey Drive"                     
[25] "Tendency To Bark Or Howl"        "Wanderlust Potential"            " Physical Needs"                
[28] "Energy Level"                    "Intensity"                       "Exercise Needs"                 
[31] "Potential For Playfulness"  



url %>% read_html() %>% 
  html_nodes(".characteristic-star-block") %>% html_nodes('.star') %>% 
  html_text()

[1] ""  "2" "3" "5" "1" "3" "3" ""  "5" "5" "5" "5" ""  "5" "4" "2" "2" "5" "3" ""  "5" "5" "5" "3" "3" "2" ""  "5" "2" "5" "5"

【讨论】:

【参考方案2】:

如果您对金毛猎犬的代码感到满意,这将为您提供所有狗的字符向量,

link = "https://dogtime.com/dog-breeds/"
dogs <- page %>% html_nodes(".list-item") %>% html_text()

然后你可以paste如下,

dog_urls <- paste0(link, dogs)

并在循环中使用您现有的代码来恢复所有狗。

dog_data <- list()
for dog_url in dog_urls 
  dog_data <- append(dog_data, scrape_function(dog_url)

【讨论】:

感谢大卫,这无疑让我更进一步。我在你的代码和一个for循环中添加了如下:`#创建一个狗的向量 % html_nodes(".list-item-title") %>% html_text() dog_urls dogtime.com/dog-breeds" , dogs) for (i in dogs) breed = ... 但是,生成的数据框仅包含大约 20 个品种(都重复了几次),并且来自底层 url 的分数是相同的在每一行。你对这是为什么有什么建议吗? 我不确定,如果您没有像上面那样更改链接,可能是这样。有了这个,我得到了 392 条狗。上面的 Nad Pat 就如何使用类简化抓取功能提供了很好的建议。

以上是关于网页抓取R中的多个页面的主要内容,如果未能解决你的问题,请参考以下文章

网页抓取(在 R 中?)

如何抓取网页中的动态数据

使用 R 从网页中抓取可下载文件的链接地址?

感谢网页使用python或pyspark抓取多个页面

python 如何抓取动态页面内容?

jsoup抓取页面源码的问题、源码被隐藏、