如何从 R 访问***?

Posted

技术标签:

【中文标题】如何从 R 访问***?【英文标题】:How to access Wikipedia from R? 【发布时间】:2011-08-31 01:22:43 【问题描述】:

是否有任何 R 包允许查询 Wikipedia(很可能使用 Mediawiki API)以获取与此类查询相关的可用文章列表,以及导入所选文章以进行文本挖掘?

【问题讨论】:

您可能会发现以下有用:ragtag.info/2011/feb/10/processing-every-wikipedia-article 【参考方案1】:

wikifacts 包(在 CRAN 上)是一种新的巨大可能性:

library(wikifacts)
wiki_define('R (programming language)')
## R (programming language) 
## "R is a programming language and free software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians and data miners for developing statistical software and data analysis. Polls, data mining surveys, and studies of scholarly literature databases show substantial increases in popularity; as of April 2021, R ranks 16th in the TIOBE index, a measure of popularity of programming languages.The official R software environment is a GNU package.\nIt is written primarily in C, Fortran, and R itself (thus, it is partially self-hosting) and is freely available under the GNU General Public License. Pre-compiled executables are provided for various operating systems."

【讨论】:

【参考方案2】:

WikipediR, 'A MediaWiki API wrapper in R'

library(devtools)
install_github("Ironholds/WikipediR")
library(WikipediR)

它包括以下功能:

ls("package:WikipediR")
 [1] "wiki_catpages"      "wiki_con"           "wiki_diff"          "wiki_page"         
 [5] "wiki_pagecats"      "wiki_recentchanges" "wiki_revision"      "wiki_timestamp"    
 [9] "wiki_usercontribs"  "wiki_userinfo"  

这里正在使用,获取一堆用户的贡献详情和用户详情:

library(RCurl)
library(XML)

# scrape page to get usernames of users with highest numbers of edits
top_editors_page <- "http://en.wikipedia.org/wiki/Wikipedia:List_of_Wikipedians_by_number_of_edits"
top_editors_table <- readhtmlTable(top_editors_page)
very_top_editors <- as.character(top_editors_table[[3]][1:5,]$User)

# setup connection to wikimedia project 
con <- wiki_con("en", project = c("wikipedia"))

# connect to API and get last 50 edits per user
user_data <- lapply(very_top_editors,  function(i) wiki_usercontribs(con, i) )
# and get information about the users (registration date, gender, editcount, etc)
user_info <- lapply(very_top_editors,  function(i) wiki_userinfo(con, i) )

【讨论】:

【参考方案3】:

使用RCurl 包获取信息,使用XMLRJSONIO 包解析响应。

如果您使用代理,请设置您的选项。

opts <- list(
  proxy = "136.233.91.120", 
  proxyusername = "mydomain\\myusername", 
  proxypassword = 'whatever', 
  proxyport = 8080
)

使用getForm函数访问the API。

search_example <- getForm(
  "http://en.wikipedia.org/w/api.php", 
  action  = "opensearch", 
  search  = "Te", 
  format  = "json",
  .opts   = opts
)

解析结果。

fromJSON(rawToChar(search_example))

【讨论】:

我在将其用于某些搜索词时遇到问题,但我怀疑这是我所在网络的问题。我需要志愿者检查search参数中不同字符串的示例代码。

以上是关于如何从 R 访问***?的主要内容,如果未能解决你的问题,请参考以下文章

如何访问 R 中的嵌套 SQL 表?

如何访问R中表中的单个元素

运行 R 内核时如何在 google Colab 中访问 shell

如何从反射访问中过滤特定字段?

如何从 firebase 云功能访问 bigquery 数据?

如何从类变量数组中访问类成员?