使用 R 从搜索结果 URL 中提取文本

Posted 2023-02-19

技术标签:

【中文标题】使用 R 从搜索结果 URL 中提取文本【英文标题】：Extract text from search result URLs using R 【发布时间】：2018-02-05 03:10:52 【问题描述】：

我对 R 有点了解，但不是专业人士。我正在使用 R 进行文本挖掘项目。

我用关键字“通货膨胀”搜索了美联储的网站。搜索结果的第二页有 URL：(https://search.newyorkfed.org/board_public/search?start=10&Search=&number=10&text=inflation)。

此页面有 10 个搜索结果（10 个 URL）。我想在 R 中编写一个代码，它将“读取”与这 10 个 URL 中的每一个对应的页面，并将这些网页中的文本提取到 .txt 文件中。我唯一的输入是上面提到的 URL。

感谢您的帮助。如果有任何类似的旧帖子，也请参考我。谢谢你。

【问题讨论】：

【参考方案1】：

给你。对于主搜索页面，您可以使用正则表达式，因为 URL 在源代码中很容易识别。

（在https://statistics.berkeley.edu/computing/r-reading-webpages的帮助下）

library('RCurl')
library('stringr')
library('XML')

pageToRead <- readLines('https://search.newyorkfed.org/board_public/search?
start=10&Search=&number=10&text=inflation')
urlPattern <- 'URL: <a href="(.+)">'
urlLines <- grep(urlPattern, pageToRead, value=TRUE)

getexpr <- function(s,g)substring(s, g, g + attr(g, 'match.length') - 1)
gg <- gregexpr(urlPattern, urlLines)
matches <- mapply(getexpr, urlLines, gg)
result = gsub(urlPattern,'\\1', matches)
names(result) = NULL


for (i in 1:length(result)) 
  subURL <- result[i]

  if (str_sub(subURL, -4, -1) == ".htm") 
    content <- readLines(subURL)
    doc <- htmlParse(content, asText=TRUE)
    doc <- xpathSApply(doc, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]", xmlValue)
    writeLines(doc, paste("inflationText_", i, ".txt", sep=""))

但是，您可能已经注意到，这仅解析 .htm 页面，对于搜索结果中链接的 .pdf 文档，我建议您去看看：http://data.library.virginia.edu/reading-pdf-files-into-r-for-text-mining/

【讨论】：

非常感谢，文森特。它非常有用，对我帮助很大！【参考方案2】：

这是关于如何报废此页面的基本思路。虽然如果有很多页面要报废，它可能会很慢。现在你的问题有点模棱两可。您希望最终结果是 .txt 文件。有pdf的网页呢？？？好的。您仍然可以使用此代码并将具有 pdf 的网页的文件扩展名更改为 pdf。

 library(xml2)
 library(rvest)

 urll="https://search.newyorkfed.org/board_public/search?start=10&Search=&number=10&text=inflation"

  urll%>%read_html()%>%html_nodes("div#results a")%>%html_attr("href")%>%
       .[!duplicated(.)]%>%lapply(function(x) read_html(x)%>%html_nodes("body"))%>%  
         Map(function(x,y) write_html(x,tempfile(y,fileext=".txt"),options="format"),.,
           c(paste("tmp",1:length(.))))

这是上面代码的细分：您要从中删除的 url：

 urll="https://search.newyorkfed.org/board_public/search?start=10&Search=&number=10&text=inflation"

获取所有你需要的url：

  allurls <- urll%>%read_html()%>%html_nodes("div#results a")%>%html_attr("href")%>%.[!duplicated(.)]

你想在哪里保存你的文本？创建临时文件：

 tmps <- tempfile(c(paste("tmp",1:length(allurls))),fileext=".txt")

按照现在。您的 allurls 是班级角色。您必须将其更改为 xml 才能废弃它们。然后最后将它们写入上面创建的 tmp 文件中：

  allurls%>%lapply(function(x) read_html(x)%>%html_nodes("body"))%>%  
         Map(function(x,y) write_html(x,y,options="format"),.,tmps)

请不要遗漏任何内容。例如在..."format"), 之后有一个句点。考虑到这一点。现在您的文件已写入 tempdir。要确定它们的位置，只需在控制台上键入命令tempdir()，它就会为您提供文件的位置。同时，您可以在tempfile 命令中更改报废文件的位置。

希望这会有所帮助。

【讨论】：

以上是关于使用 R 从搜索结果 URL 中提取文本的主要内容，如果未能解决你的问题，请参考以下文章