使用 tm_map 的自定义函数维护用户定义的元数据
Posted
技术标签:
【中文标题】使用 tm_map 的自定义函数维护用户定义的元数据【英文标题】:Mantain user defined meta data with customised functions for tm_map 【发布时间】:2014-01-31 14:56:00 【问题描述】:我有一个函数,用于根据键/值字典翻译标记。
dictionary <- c("casa", "barco", "carro", "arbol")
names(dictionary) <- c("home", "boat", "car", "tree")
translate2 <- function (text, dictionary)
text_out <- character(0)
for (i in 1:length(text))
text.split <- strsplit(text[i], "\\s")
translation <- dictionary[unlist(text.split)]
text_out <- append(text_out, paste(translation, sep="", collapse=" "))
PlainTextDocument(text_out, id = ID(text), author = Author(text))
此功能适用于元 `Author:
library(tm)
text <- "My car is on the tree next to my home under the boat"
corpus <- Corpus(VectorSource(text))
meta(corpus, "Author", type="local") <- "Kant"
meta(corpus, "TextID", type="local") <- "121212"
meta(corpus[[1]], "Author")
# [1] "Kant"
corpus <- tm_map(corpus, translate2, dictionary)
meta(corpus[[1]], "Author")
# [1] "Kant"
corpus[[1]]
# NA carro NA NA NA arbol NA NA NA casa NA NA barco
但是当我尝试使用稍微修改过的函数版本来传递用户定义的元数据时,例如 TextID
translate1 <- function (text, dictionary)
text_out <- character(0)
for (i in 1:length(text))
text.split <- strsplit(text[i], "\\s")
translation <- dictionary[unlist(text.split)]
text_out <- append(text_out, paste(translation, sep="", collapse=" "))
PlainTextDocument(text_out, id = ID(text), author = Author(text),
TextID = TextID(text))
我明白了
text <- "My car is on the tree next to my home under the boat"
corpus <- Corpus(VectorSource(text))
meta(corpus, "Author", type="local") <- "Kant"
meta(corpus, "TextID", type="local") <- "121212"
meta(corpus[[1]], "Author")
# [1] "Kant"
meta(corpus[[1]], "TextID")
# [1] "121212"
corpus <- tm_map(corpus, translate1, dictionary)
# Error in PlainTextDocument(text_out, id = ID(text), author = Author(text), :
# unused argument (TextID = TextID(text))
【问题讨论】:
【参考方案1】:您的方法存在一些问题:
PlainTextDocument
没有参数 TextID
(这导致了你的错误)
没有名为TextID
的函数
来自?PlainTextDocument
,您要查找的参数似乎称为localmetadata
。
这是translate1
的一个版本,似乎可以按预期工作:
translate1 <- function (text, dictionary)
text_out <- character(0)
for (i in 1:length(text))
text.split <- strsplit(text[i], "\\s")
translation <- dictionary[unlist(text.split)]
text_out <- append(text_out, paste(translation, sep="", collapse=" "))
PlainTextDocument(text_out, id = ID(text), author = Author(text),
localmetadata = list(TextID = meta(text, "TextID")))
text <- "My car is on the tree next to my home under the boat"
corpus <- Corpus(VectorSource(text))
meta(corpus, "Author", type="local") <- "Kant"
meta(corpus, "TextID", type="local") <- "121212"
meta(corpus[[1]], "Author")
# [1] "Kant"
meta(corpus[[1]], "TextID")
# [1] "121212"
corpus <- tm_map(corpus, translate1, dictionary)
meta(corpus[[1]], "Author")
# [1] "Kant"
meta(corpus[[1]], "TextID")
# [1] "121212"
corpus[[1]]
# NA carro NA NA NA arbol NA NA NA casa NA NA barco
【讨论】:
以上是关于使用 tm_map 的自定义函数维护用户定义的元数据的主要内容,如果未能解决你的问题,请参考以下文章