在R中练习k-meas聚类
Posted 积水成渊数据分析
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了在R中练习k-meas聚类相关的知识,希望对你有一定的参考价值。
算法优势:适用于绝大多数的数据类型,简洁和快速
算法劣势:需要知道准确的 k 值,并且不能处理异形簇,比如球形簇,不同尺寸及密度的簇,环形簇等。
一、分析目标
以数据集字段进行客户分群
二、流程
数据获取,毕业年份、性别、年龄、交友数量、关注的热点词(原本是一个list是否关注了这些运动或者热点词,已经以哑变量展开)
数据探索
确认数据结构:整体都是数值型的,
1、性别是分类变量,这样的话该变量不能被K聚类识别,需要对性别进行哑变量编码
2、数据里边存在缺失
> str(teens) \'data.frame\': 30000 obs. of 40 variables: $ gradyear : int 2006 2006 2006 2006 2006 2006 2006 2006 2006 2006 ... $ gender : Factor w/ 2 levels "F","M": 2 1 2 1 NA 1 1 2 1 1 ... $ age : num 19 18.8 18.3 18.9 19 ... $ friends : int 7 0 69 0 10 142 72 17 52 39 ... $ basketball : int 0 0 0 0 0 0 0 0 0 0 ... $ football : int 0 1 1 0 0 0 0 0 0 0 ... $ soccer : int 0 0 0 0 0 0 0 0 0 0 ... $ softball : int 0 0 0 0 0 0 0 1 0 0 ... $ volleyball : int 0 0 0 0 0 0 0 0 0 0 ... $ swimming : int 0 0 0 0 0 0 0 0 0 0 ... $ cheerleading: int 0 0 0 0 0 0 0 0 0 0 ... $ baseball : int 0 0 0 0 0 0 0 0 0 0 ... $ tennis : int 0 0 0 0 0 0 0 0 0 0 ... $ sports : int 0 0 0 0 0 0 0 0 0 0 ... $ cute : int 0 1 0 1 0 0 0 0 0 1 ... $ sex : int 0 0 0 0 1 1 0 2 0 0 ... $ sexy : int 0 0 0 0 0 0 0 1 0 0 ... $ hot : int 0 0 0 0 0 0 0 0 0 1 ... $ kissed : int 0 0 0 0 5 0 0 0 0 0 ... $ dance : int 1 0 0 0 1 0 0 0 0 0 ... $ band : int 0 0 2 0 1 0 1 0 0 0 ... $ marching : int 0 0 0 0 0 1 1 0 0 0 ... $ music : int 0 2 1 0 3 2 0 1 0 1 ... $ rock : int 0 2 0 1 0 0 0 1 0 1 ... $ god : int 0 1 0 0 1 0 0 0 0 6 ... $ church : int 0 0 0 0 0 0 0 0 0 0 ... $ jesus : int 0 0 0 0 0 0 0 0 0 2 ... $ bible : int 0 0 0 0 0 0 0 0 0 0 ... $ hair : int 0 6 0 0 1 0 0 0 0 1 ... $ dress : int 0 4 0 0 0 1 0 0 0 0 ... $ blonde : int 0 0 0 0 0 0 0 0 0 0 ... $ mall : int 0 1 0 0 0 0 2 0 0 0 ... $ shopping : int 0 0 0 0 2 1 0 0 0 1 ... $ clothes : int 0 0 0 0 0 0 0 0 0 0 ... $ hollister : int 0 0 0 0 0 0 2 0 0 0 ... $ abercrombie : int 0 0 0 0 0 0 0 0 0 0 ... $ die : int 0 0 0 0 0 0 0 0 0 0 ... $ death : int 0 0 1 0 0 0 0 0 0 0 ... $ drunk : int 0 0 0 0 1 1 0 0 0 0 ... $ drugs : int 0 0 0 0 1 0 0 0 0 0 ...
查看缺失情况
> summary(teens) gradyear gender age friends basketball Min. :2006 F :22054 Min. : 3.086 Min. : 0.00 Min. : 0.0000 1st Qu.:2007 M : 5222 1st Qu.: 16.312 1st Qu.: 3.00 1st Qu.: 0.0000 Median :2008 NA\'s: 2724 Median : 17.287 Median : 20.00 Median : 0.0000 Mean :2008 Mean : 17.994 Mean : 30.18 Mean : 0.2673 3rd Qu.:2008 3rd Qu.: 18.259 3rd Qu.: 44.00 3rd Qu.: 0.0000 Max. :2009 Max. :106.927 Max. :830.00 Max. :24.0000 NA\'s :5086 football soccer softball volleyball Min. : 0.0000 Min. : 0.0000 Min. : 0.0000 Min. : 0.0000 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.0000 Median : 0.0000 Median : 0.0000 Median : 0.0000 Median : 0.0000 Mean : 0.2523 Mean : 0.2228 Mean : 0.1612 Mean : 0.1431 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.0000 Max. :15.0000 Max. :27.0000 Max. :17.0000 Max. :14.0000 swimming cheerleading baseball tennis Min. : 0.0000 Min. :0.0000 Min. : 0.0000 Min. : 0.00000 1st Qu.: 0.0000 1st Qu.:0.0000 1st Qu.: 0.0000 1st Qu.: 0.00000 Median : 0.0000 Median :0.0000 Median : 0.0000 Median : 0.00000 Mean : 0.1344 Mean :0.1066 Mean : 0.1049 Mean : 0.08733 3rd Qu.: 0.0000 3rd Qu.:0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.00000 Max. :31.0000 Max. :9.0000 Max. :16.0000 Max. :15.00000 sports cute sex sexy Min. : 0.00 Min. : 0.0000 Min. : 0.0000 Min. : 0.0000 1st Qu.: 0.00 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.0000 Median : 0.00 Median : 0.0000 Median : 0.0000 Median : 0.0000 Mean : 0.14 Mean : 0.3229 Mean : 0.2094 Mean : 0.1412 3rd Qu.: 0.00 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.0000 Max. :12.00 Max. :18.0000 Max. :114.0000 Max. :18.0000 hot kissed dance band Min. : 0.0000 Min. : 0.0000 Min. : 0.0000 Min. : 0.0000 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.0000 Median : 0.0000 Median : 0.0000 Median : 0.0000 Median : 0.0000 Mean : 0.1266 Mean : 0.1032 Mean : 0.4252 Mean : 0.2996 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.0000 Max. :10.0000 Max. :26.0000 Max. :30.0000 Max. :66.0000 marching music rock god Min. : 0.0000 Min. : 0.0000 Min. : 0.0000 Min. : 0.0000 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.0000 Median : 0.0000 Median : 0.0000 Median : 0.0000 Median : 0.0000 Mean : 0.0406 Mean : 0.7378 Mean : 0.2433 Mean : 0.4653 3rd Qu.: 0.0000 3rd Qu.: 1.0000 3rd Qu.: 0.0000 3rd Qu.: 1.0000 Max. :11.0000 Max. :64.0000 Max. :21.0000 Max. :79.0000 church jesus bible hair Min. : 0.0000 Min. : 0.0000 Min. : 0.00000 Min. : 0.0000 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.00000 1st Qu.: 0.0000 Median : 0.0000 Median : 0.0000 Median : 0.00000 Median : 0.0000 Mean : 0.2482 Mean : 0.1121 Mean : 0.02133 Mean : 0.4226 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.00000 3rd Qu.: 0.0000 Max. :44.0000 Max. :30.0000 Max. :11.00000 Max. :37.0000 dress blonde mall shopping Min. :0.000 Min. : 0.0000 Min. : 0.0000 Min. : 0.000 1st Qu.:0.000 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.000 Median :0.000 Median : 0.0000 Median : 0.0000 Median : 0.000 Mean :0.111 Mean : 0.0989 Mean : 0.2574 Mean : 0.353 3rd Qu.:0.000 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 1.000 Max. :9.000 Max. :327.0000 Max. :12.0000 Max. :11.000 clothes hollister abercrombie die Min. :0.0000 Min. :0.00000 Min. :0.00000 Min. : 0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.: 0.0000 Median :0.0000 Median :0.00000 Median :0.00000 Median : 0.0000 Mean :0.1485 Mean :0.06987 Mean :0.05117 Mean : 0.1841 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.: 0.0000 Max. :8.0000 Max. :9.00000 Max. :8.00000 Max. :22.0000 death drunk drugs Min. : 0.0000 Min. :0.00000 Min. : 0.00000 1st Qu.: 0.0000 1st Qu.:0.00000 1st Qu.: 0.00000 Median : 0.0000 Median :0.00000 Median : 0.00000 Mean : 0.1142 Mean :0.08797 Mean : 0.06043 3rd Qu.: 0.0000 3rd Qu.:0.00000 3rd Qu.: 0.00000 Max. :14.0000 Max. :8.00000 Max. :16.00000
缺失处理
> table(teens$gender,useNA = \'ifany\')#返回缺失个数 F M <NA> 22054 5222 2724
> teens$female<-ifelse(teens$gender==\'F\'&!is.na(teens$gender),1,0)#将性别进行哑变量处理,首先将非空的女性赋值为1,NA或者男赋值为0
> teens$no_gender<-ifelse(is.na(teens$gender),1,0)#性别缺失为1,否则为0,这样的话相当于把男女哑变量处理,同时缺失也纳入考量
> teens$gender<-NULL
年龄处理:高中生不可能106岁,也不可能有3岁,同时存在较多缺失
> summary(teens$age) Min. 1st Qu. Median Mean 3rd Qu. Max. NA\'s 3.086 16.310 17.290 17.990 18.260 106.900 5086
> teens$age <- ifelse(teens$age >= 16 & teens$age < 20,teens$age,NA)#将16-20岁视为高中生的正常年龄,其他数据视为异常值为空 > aggregate(data=teens,age~gradyear,mean,na.rm=TRUE)#求不同毕业年份的年龄均值,越最近毕业的人越年轻 gradyear age 1 2006 18.68560 2 2007 17.71878 3 2008 16.78118 4 2009 16.33252
# 按年份生成与原数据集合等长的向量
> ave_age <- ave(teens$age, teens$gradyear, FUN = function(x) { return(mean(x, na.rm = TRUE)) })
> teens$age <- ifelse(is.na(teens$age), ave_age, teens$age)#将NA的年龄赋值为均值,非空的保持原值 > summary(teens$age) Min. 1st Qu. Median Mean 3rd Qu. Max. 16.00 16.33 17.25 17.38 18.22 20.00
数据划分
> interests <- teens[, 4:39]
> interests_z <- data.frame(lapply(interests, scale))#标准化数据,减少量纲差异
k均值建模
set.seed(11)#设定随机种子 teen_clusters <- kmeans(interests_z, 5)
评估模型
> head(teen_clusters$cluster)#聚类标签 [1] 2 1 2 2 5 2 > teen_clusters$size#5个类别的数量 [1] 5233 22182 731 803 1051 > teen_clusters$centers#聚类中心 basketball football soccer softball volleyball swimming cheerleading baseball 1 0.4531058 0.43339400 -0.05953070 0.375996747 0.40342075 0.30482805 0.46479149 0.24682651 2 -0.1290187 -0.12977694 -0.14899365 -0.097191693 -0.10066663 -0.09079804 -0.11397570 -0.06918139 3 0.2457823 0.23577386 5.02937767 0.033589676 0.09057228 0.13165120 -0.02646162 0.03223472 4 -0.1109113 0.03602234 -0.14104152 -0.007481003 -0.07966170 0.04595234 -0.12500102 -0.11042346 5 0.3807668 0.38961431 0.05069711 0.161530303 0.11384059 0.27191272 0.20520869 0.29309699 tennis sports cute sex sexy hot kissed dance 1 0.13565618 0.2229282 0.65570947 0.005138338 0.230127643 0.50391601 -0.02252810 0.55362283 2 -0.04152949 -0.1020853 -0.17622878 -0.094412671 -0.077760392 -0.13325067 -0.13207336 -0.15193168 3 0.11156263 0.3998271 0.01361179 -0.056096990 -0.008332101 0.06411155 -0.06295344 -0.02447142 4 0.01655266 -0.1041384 -0.04851623 -0.046718513 -0.019765548 -0.06929059 -0.06814111 0.02098573 5 0.11082240 0.8460723 0.48219547 2.041764818 0.516256470 0.31165437 2.99550632 0.45107250 band marching music rock god church jesus 1 -0.04813421 -0.12344684 0.26060636 0.18108418 0.3304953087 0.51972951 0.24097725 2 -0.12756333 -0.13513757 -0.13216778 -0.10905451 -0.1001591088 -0.13566360 -0.06601809 3 -0.04525084 -0.09376867 0.04330389 0.09518405 -0.0001367759 0.10922246 0.01664811 4 3.37541403 4.63879117 0.39528646 0.16181449 0.0735799460 0.03691392 0.07066919 5 0.38450719 -0.01216509 1.15977396 1.21019696 0.4122385221 0.17132449 0.12793743 bible hair dress blonde mall shopping clothes 1 0.22258000 0.325589087 0.444453842 0.04528480 0.64138010 0.85834274 0.51465115 2 -0.05930602 -0.198168525 -0.132551874 -0.02830573 -0.17908734 -0.22075867 -0.18246937 3 0.03613298 -0.004854472 0.005732568 0.03146825 0.03512586 0.21142729 -0.04501739 4 0.04788738 -0.041196192 0.049581302 -0.01631323 -0.09605527 -0.06611646 0.01252734 5 0.08173007 2.596190010 0.542753982 0.36251061 0.63523179 0.28896231 1.31038466 hollister abercrombie die death drunk drugs 1 0.57661846 0.557958327 0.043023482 0.12784283 0.01420576 -0.05433707 2 -0.15090261 -0.146583539 -0.089778750 -0.07512888 -0.08530937 -0.11084192 3 0.06677613 0.002922202 -0.003454604 -0.01723961 -0.04902532 -以上是关于在R中练习k-meas聚类的主要内容,如果未能解决你的问题,请参考以下文章