如何使用 R 中的 rvest 通过以下方式从 Wikipedia 获取“类别”?
Posted
技术标签:
【中文标题】如何使用 R 中的 rvest 通过以下方式从 Wikipedia 获取“类别”?【英文标题】:How can I obtain "Categories" from Wikipedia in the following way using rvest in R? 【发布时间】:2018-02-15 16:40:20 【问题描述】:我希望使用 R 中的 rvest 获取类别(***页面的最底部)。我使用 SelectorGadget 来识别用于类别提取的 html 节点。我使用的代码如下
thepage <- read_html("https://en.wikipedia.org/wiki/San_Diego")
Categories <- thepage %>%
html_nodes("#mw-normal-catlinks") %>%
html_text()
Categories
得到的结果如下:
"Categories: San Diego1769 establishments in California1850 establishments in CaliforniaCities in San Diego County, CaliforniaCounty seats in CaliforniaIncorporated cities and towns in CaliforniaPopulated coastal places in CaliforniaPopulated places established in 1769San Antonio-San Diego Mail LineSan Diego County, CaliforniaSan Diego metropolitan areaSpanish mission settlements in North AmericaSpecial economic zones of the United StatesStagecoach stops in the United States"
如您所见,类别之间没有分隔符。第一类是“圣地亚哥”,第二类是“加利福尼亚州的 1769 家机构”。如何在列表中获取这些类别或以某种方式分隔?
【问题讨论】:
【参考方案1】:每个类别都是一个列表项,所以需要进入列表:
thepage %>%
html_nodes(".mw-normal-catlinks ul li") %>%
html_text()
[1] "San Diego" "1769 establishments in California"
[3] "1850 establishments in California" "Cities in San Diego County, California"
[5] "County seats in California" "Incorporated cities and towns in California"
[7] "Populated coastal places in California" "Populated places established in 1769"
[9] "San Antonio-San Diego Mail Line" "San Diego County, California"
[11] "San Diego metropolitan area" "Spanish mission settlements in North America"
[13] "Special economic zones of the United States" "Stagecoach stops in the United States"
【讨论】:
以上是关于如何使用 R 中的 rvest 通过以下方式从 Wikipedia 获取“类别”?的主要内容,如果未能解决你的问题,请参考以下文章