为无监督学习生成合成数据
Posted
技术标签:
【中文标题】为无监督学习生成合成数据【英文标题】:Generating synthetic data for unsupervised learning 【发布时间】:2012-08-02 23:12:01 【问题描述】:我想为使用随机森林的无监督学习准备数据。 程序如下:
获取数据并将值为 1 的属性“类”添加到所有示例 从原始数据生成合成数据: 虽然您没有与原始数据构建示例中相同数量的示例: 从原始数据中该属性的所有值中采样新的属性值 对所有属性执行此操作并将它们组合到新示例中 分配给合成数据值 2 的属性“类” 将两个数据绑定在一起最后是这样的:
... Class
|1
Original |1
Data |1
|1
--------------
|2
Synthetic |2
Data |2
|2
我的 R 代码如下所示:
library(gtools) #for smartbind()
sample1 <- function(X) sample(X, replace=T)
g1 <- function(dat) apply(dat,2,sample1)
data$class <- rep(1, times=nrow(data)) #add attribute 'class' with value 1
synthData<-data.frame(g1(data[,1:ncol(data)])) #generate synthetic data with sampling from data
synthData$class <- rep(2, times=nrow(synthData)) #attribute 'class' is 2
colnames(synthData) <- colnames(data)
newData <- smartbind(data, synthData) #bind the data together
很明显,我对 R 真的很陌生,但它确实有效——只有一个问题:合成数据中的属性类型与原始数据中的不同。如果原来它们是 nums,那么现在它们变成了因数。如何在生成合成数据时保留相同的类型?
谢谢!
Data1(数字变成因子):
结构(列表(V2 = c(1.51793, 1.51711, 1.51645, 1.51916, 1.51131 ), V3 = c(13.21, 12.89, 13.44, 14.15, 13.69), V4 = c(3.48, 3.62, 3.61, 0, 3.2), V5 = c(1.41, 1.57, 1.54, 2.09, 1.81), V6 = c(72.64, 72.96, 72.39, 72.74, 72.81), V7 = c(0.59, 0.61, 0.66, 0, 1.76 ), V8 = c(8.43, 8.11, 8.03, 10.88, 5.43), V9 = c(0, 0, 0, 0, 1.19), V10 = c(0, 0, 0, 0, 0), realClass = 结构(c(1L, 2L, 2L, 5L, 6L), .Label = c("1", "2", "3", "5", "6", "7"), class= "因子")), .Names = c( “V2”, “V3”、“V4”、“V5”、“V6”、“V7”、“V8”、“V9”、“V10”、“realClass”),row.names = c(27L, 138L, 77L, 183L, 186L), class= "data.frame")
Data2(因子变为 chrs):
结构(列表(realClass =结构(c(2L,2L,2L,1L,2L),.Label = c(“e”, "p"), class= "因子"), V2 = 结构(c(6L, 3L, 4L, 6L, 6L), .Label = c("b", “c”、“f”、“k”、“s”、“x”)、class= “因子”)、V3 = 结构(c(4L, 4L, 3L, 1L, 1L), .Label = c("f", "g", "s", "y"), class= "因子"), V4 =结构(c(5L,5L,5L,3L,4L),.Label = c(“b”,“c”, “e”,“g”,“n”,“p”,“r”,“u”,“w”,“y”),class=“因子”), V5 =结构(c(1L,1L,1L,2L,1L),.Label = c(“f”,“t” ), class= "因子"), V6 = 结构(c(3L, 9L, 3L, 6L, 3L ), .Label = c("a", "c", "f", "l", "m", "n", "p", "s", "y" ), class= "因子"), V7 = 结构(c(2L, 2L, 2L, 2L, 2L ), .Label = c("a", "f"), class= "factor"), V8 = structure(c(1L, 1L, 1L, 1L, 1L), .Label = c("c", "w"), class= "因子"), V9 =结构(c(2L,2L,2L,1L,1L),.Label = c(“b”,“n” ), class= "因子"), V10 = 结构(c(1L, 1L, 1L, 10L, 4L), .Label = c("b", "e", "g", "h", "k", "n", "o", "p", "r", "u", "w", "y"), class= "因子"), V11 = 结构(c(2L, 2L, 2L, 2L, 1L), .Label = c("e", "t"), class= "factor"), V12 =结构(c(NA,NA,NA,1L,1L),.Label = c(“b”,“c”, "e", "r"), class= "因子"), V13 = 结构(c(3L, 2L, 3L, 3L, 2L), .Label = c("f", "k", "s", "y"), class= "因子"), V14 =结构(c(3L,3L,2L,3L,2L),.Label = c(“f”,“k”, "s", "y"), class= "因子"), V15 = 结构(c(7L, 8L, 7L, 4L, 7L), .Label = c("b", "c", "e", "g", "n", "o", "p", "w", "y"), class= "因子"), V16 = 结构(c(7L, 7L, 8L, 4L, 1L), .Label = c("b", "c", "e", "g", "n", "o", "p", "w", "y" ), class= "因子"), V17 = 结构(c(1L, 1L, 1L, 1L, 1L ), .Label = "p", class= "因子"), V18 = 结构(c(3L, 3L, 3L, 3L, 3L), .Label = c("n", "o", "w", "y"), class= "因子"), V19 =结构(c(2L,2L,2L,2L,2L),.Label = c(“n”,“o”, "t"), class= "因子"), V20 = 结构(c(1L, 1L, 1L, 5L, 3L), .Label = c("e", "f", "l", "n", "p"), class= "因子"), V21 = 结构(c(8L, 8L, 8L, 4L, 2L), .Label = c("b", "h", “k”,“n”,“o”,“r”,“u”,“w”,“y”),class=“因子”),V22 =结构(c(5L, 5L, 5L, 5L, 6L), .Label = c("a", "c", "n", "s", "v", "y"), class= "因子"), V23 =结构(c(3L,3L,5L,1L,2L),.Label = c(“d”,“g”, “l”,“m”,“p”,“u”,“w”),class=“因子”)),.Names = c(“realClass”, “V2”、“V3”、“V4”、“V5”、“V6”、“V7”、“V8”、“V9”、“V10”、“V11”、 “V12”、“V13”、“V14”、“V15”、“V16”、“V17”、“V18”、“V19”、“V20”、 “V21”,“V22”,“V23”),row.names = c(4105L,6207L,6696L,2736L, 3756L), class= "data.frame")
【问题讨论】:
由于您没有显示您的数据,因此您不知道为什么要用因子代替数字,但您可以这样做numcol <- as.numeric(as.character(factcol))
是的,这行得通。是否有更通用的解决方案,以便无论属性是什么类型,它们在处理后都保持不变?
通过可重现的示例更容易找到答案。在这种情况下,我们对您的数据了解不多(str(data)
或更好的dput(data)
)。
谢谢,同时我发布了一个答案。看看它是否能解决你的问题
【参考方案1】:
你总是可以使用这个技巧来设置数字列
numcol <- as.numeric(as.character(factcol))
但我怀疑您的 data.frame 中有因子变量。
由于apply
返回一个矩阵,如果您的数据中有一个因子,那么所有数值变量也将被强制为因子。
这里是一个例子,使用玩具数据集
set.seed(123)
toydat <- data.frame(A = 1:10, B = rnorm(10), C = LETTERS[1:10])
str(toydat)
## 'data.frame': 10 obs. of 3 variables:
## $ A: int 1 2 3 4 5 6 7 8 9 10
## $ B: num -0.5605 -0.2302 1.5587 0.0705 0.1293 ...
## $ C: Factor w/ 10 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10
set.seed(1)
str(data.frame(apply(toydat[,1:2], 2, sample, replace = TRUE)))
## 'data.frame': 10 obs. of 2 variables:
## $ A: num 3 4 6 10 3 9 10 7 7 1
## $ B: num 1.5587 -0.2302 0.4609 0.0705 -1.2651 ...
# with the factor column C
set.seed(2)
str(data.frame(apply(toydat[,1:3], 2, sample, replace = TRUE)))
## 'data.frame': 10 obs. of 3 variables:
## $ A: Factor w/ 6 levels "10"," 2"," 5",..: 2 5 4 2 1 1 2 6 3 4
## $ B: Factor w/ 8 levels " 0.129288","-0.230177",..: 8 7 6 2 1 5 3 7 1 4
## $ C: Factor w/ 6 levels "B","D","E","G",..: 4 2 5 1 2 3 1 2 6 1
这是 plyr
包变得有用的地方,因为您可以控制输出(使用 **ply)。但在这种情况下,colwise
函数就足够了
require(plyr)
set.seed(2)
mysamplingfun <- colwise(function(x) sample(x, replace = TRUE))
str(mysamplingfun(toydat[,1:3]))
## 'data.frame': 10 obs. of 3 variables:
## $ A: int 2 8 6 2 10 10 2 9 5 6
## $ B: num 1.715 1.559 -1.265 -0.23 0.129 ...
## $ C: Factor w/ 10 levels "A","B","C","D",..: 7 4 9 2 4 5 2 4 10 2
【讨论】:
是的,colwise 可以满足我的需求。谢谢你,非常感谢你为我提供的帮助。以上是关于为无监督学习生成合成数据的主要内容,如果未能解决你的问题,请参考以下文章