如何从网页的不同超链接中提取数据
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了如何从网页的不同超链接中提取数据相关的知识,希望对你有一定的参考价值。
我想从此网页的不同超链接中提取数据
我正在使用以下代码来提取超链接表。
url <- "https://www.maritime-database.com/company.php?cid=66304"
webpage<-read_html(URL)
df <-
webpage %>%
html_node("table") %>%
html_table(fill=TRUE)
通过此代码,我能够提取表中的所有超链接,但是我不知道如何从此超链接中提取数据。
此链接的EX-,我想提取如图所示的数据[![示例中提供的链接中的数据] [1]] [1]
=
答案
让我们先加载一些我们需要的库。.>
library(rvest) library(tidyverse) library(stringr)
然后,我们可以打开所需的页面并提取所有链接:
url <- "https://www.maritime-database.com/company.php?cid=66304" webpage<-read_html(url) urls <- webpage %>% html_nodes("a") %>% html_attr("href")
让我们看一下我们发现的内容...
> head(urls,100) [1] "/" "/areas/" [3] "/countries/" "/ports/" [5] "/ports/topports.php" "/addcompany.php" [7] "/aboutus.php" "/activity.php?aid=28" [9] "/activity.php?aid=9" "/activity.php?aid=16" [11] "/activity.php?aid=24" "/activity.php?aid=27" [13] "/activity.php?aid=29" "/activity.php?aid=25" [15] "/activity.php?aid=5" "/activity.php?aid=11" [17] "/activity.php?aid=19" "/activity.php?aid=17" [19] "/activity.php?aid=2" "/activity.php?aid=31" [21] "/activity.php?aid=1" "/activity.php?aid=13" [23] "/activity.php?aid=23" "/activity.php?aid=18" [25] "/activity.php?aid=22" "/activity.php?aid=12" [27] "/activity.php?aid=4" "/activity.php?aid=26" [29] "/activity.php?aid=10" "/activity.php?aid=14" [31] "/activity.php?aid=7" "/activity.php?aid=30" [33] "/activity.php?aid=21" "/activity.php?aid=20" [35] "/activity.php?aid=8" "/activity.php?aid=6" [37] "/activity.php?aid=15" "/activity.php?aid=3" [39] "/africa/" "/centralamerica/" [41] "/northamerica/" "/southamerica/" [43] "/asia/" "/caribbean/" [45] "/europe/" "/middleeast/" [47] "/oceania/" "company-contact.php?cid=66304" [49] "http://www.quadrantplastics.com" "/company.php?cid=313402" [51] "/company.php?cid=262400" "/company.php?cid=262912" [53] "/company.php?cid=263168" "/company.php?cid=263424" [55] "/company.php?cid=67072" "/company.php?cid=263680" [57] "/company.php?cid=67328" "/company.php?cid=264192" [59] "/company.php?cid=67840" "/company.php?cid=264448" [61] "/company.php?cid=264704" "/company.php?cid=68352" [63] "/company.php?cid=264960" "/company.php?cid=68608" [65] "/company.php?cid=265216" "/company.php?cid=68864" [67] "/company.php?cid=265472" "/company.php?cid=200192" [69] "/company.php?cid=265728" "/company.php?cid=69376" [71] "/company.php?cid=200448" "/company.php?cid=265984" [73] "/company.php?cid=200704" "/company.php?cid=266240"
经过一些检查,我们发现我们只对以
/company.php
开头的网址感兴趣”>然后让我们找出其中有多少,并为我们的结果创建一个占位符列表:
numcompanies <- length(which(!is.na(str_extract(urls, '/company.php')))) mylist = vector("list", numcompanies )
我们发现我们需要抓取40034个公司网址。这需要一段时间...
> numcompanies 40034
现在,只需要一个一个地循环遍历每个匹配的URL,然后保存文本即可。
i = 0 for(u in urls){ if(!is.na(str_match(u, '/company.php'))){ Sys.sleep(1) i = i + 1 companypage <-read_html(paste0('https://www.maritime-database.com', u)) cat(paste('page nr', i, '; saved text from: ', u, ' ')) text <- companypage %>% html_nodes('.txt') %>% html_text() names(mylist)[i] <- u mylist[[i]] <- text } }
在上面的循环中,我们利用了这样的观察:我们想要的信息始终具有
class="txt"
(请参见下面的屏幕截图)。假设打开页面大约需要1秒钟,则抓取所有页面大约需要11个小时。
以上是关于如何从网页的不同超链接中提取数据的主要内容,如果未能解决你的问题,请参考以下文章
Python - Win32Com - 如何从 Excel 电子表格单元格中提取超链接?