在R中练习k-meas聚类

Posted 积水成渊数据分析

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了在R中练习k-meas聚类相关的知识,希望对你有一定的参考价值。

算法优势:适用于绝大多数的数据类型,简洁和快速

算法劣势:需要知道准确的 k 值,并且不能处理异形簇,比如球形簇,不同尺寸及密度的簇,环形簇等。

一、分析目标

以数据集字段进行客户分群

二、流程

数据获取,毕业年份、性别、年龄、交友数量、关注的热点词(原本是一个list是否关注了这些运动或者热点词,已经以哑变量展开)

数据探索

确认数据结构:整体都是数值型的,

1、性别是分类变量,这样的话该变量不能被K聚类识别,需要对性别进行哑变量编码

2、数据里边存在缺失

> str(teens)
\'data.frame\':    30000 obs. of  40 variables:
 $ gradyear    : int  2006 2006 2006 2006 2006 2006 2006 2006 2006 2006 ...
 $ gender      : Factor w/ 2 levels "F","M": 2 1 2 1 NA 1 1 2 1 1 ...
 $ age         : num  19 18.8 18.3 18.9 19 ...
 $ friends     : int  7 0 69 0 10 142 72 17 52 39 ...
 $ basketball  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ football    : int  0 1 1 0 0 0 0 0 0 0 ...
 $ soccer      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ softball    : int  0 0 0 0 0 0 0 1 0 0 ...
 $ volleyball  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ swimming    : int  0 0 0 0 0 0 0 0 0 0 ...
 $ cheerleading: int  0 0 0 0 0 0 0 0 0 0 ...
 $ baseball    : int  0 0 0 0 0 0 0 0 0 0 ...
 $ tennis      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ sports      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ cute        : int  0 1 0 1 0 0 0 0 0 1 ...
 $ sex         : int  0 0 0 0 1 1 0 2 0 0 ...
 $ sexy        : int  0 0 0 0 0 0 0 1 0 0 ...
 $ hot         : int  0 0 0 0 0 0 0 0 0 1 ...
 $ kissed      : int  0 0 0 0 5 0 0 0 0 0 ...
 $ dance       : int  1 0 0 0 1 0 0 0 0 0 ...
 $ band        : int  0 0 2 0 1 0 1 0 0 0 ...
 $ marching    : int  0 0 0 0 0 1 1 0 0 0 ...
 $ music       : int  0 2 1 0 3 2 0 1 0 1 ...
 $ rock        : int  0 2 0 1 0 0 0 1 0 1 ...
 $ god         : int  0 1 0 0 1 0 0 0 0 6 ...
 $ church      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ jesus       : int  0 0 0 0 0 0 0 0 0 2 ...
 $ bible       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ hair        : int  0 6 0 0 1 0 0 0 0 1 ...
 $ dress       : int  0 4 0 0 0 1 0 0 0 0 ...
 $ blonde      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ mall        : int  0 1 0 0 0 0 2 0 0 0 ...
 $ shopping    : int  0 0 0 0 2 1 0 0 0 1 ...
 $ clothes     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ hollister   : int  0 0 0 0 0 0 2 0 0 0 ...
 $ abercrombie : int  0 0 0 0 0 0 0 0 0 0 ...
 $ die         : int  0 0 0 0 0 0 0 0 0 0 ...
 $ death       : int  0 0 1 0 0 0 0 0 0 0 ...
 $ drunk       : int  0 0 0 0 1 1 0 0 0 0 ...
 $ drugs       : int  0 0 0 0 1 0 0 0 0 0 ...

查看缺失情况

> summary(teens)
    gradyear     gender           age             friends         basketball     
 Min.   :2006   F   :22054   Min.   :  3.086   Min.   :  0.00   Min.   : 0.0000  
 1st Qu.:2007   M   : 5222   1st Qu.: 16.312   1st Qu.:  3.00   1st Qu.: 0.0000  
 Median :2008   NA\'s: 2724   Median : 17.287   Median : 20.00   Median : 0.0000  
 Mean   :2008                Mean   : 17.994   Mean   : 30.18   Mean   : 0.2673  
 3rd Qu.:2008                3rd Qu.: 18.259   3rd Qu.: 44.00   3rd Qu.: 0.0000  
 Max.   :2009                Max.   :106.927   Max.   :830.00   Max.   :24.0000  
                             NA\'s   :5086                                        
    football           soccer           softball         volleyball     
 Min.   : 0.0000   Min.   : 0.0000   Min.   : 0.0000   Min.   : 0.0000  
 1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.: 0.0000  
 Median : 0.0000   Median : 0.0000   Median : 0.0000   Median : 0.0000  
 Mean   : 0.2523   Mean   : 0.2228   Mean   : 0.1612   Mean   : 0.1431  
 3rd Qu.: 0.0000   3rd Qu.: 0.0000   3rd Qu.: 0.0000   3rd Qu.: 0.0000  
 Max.   :15.0000   Max.   :27.0000   Max.   :17.0000   Max.   :14.0000  
                                                                        
    swimming        cheerleading       baseball           tennis        
 Min.   : 0.0000   Min.   :0.0000   Min.   : 0.0000   Min.   : 0.00000  
 1st Qu.: 0.0000   1st Qu.:0.0000   1st Qu.: 0.0000   1st Qu.: 0.00000  
 Median : 0.0000   Median :0.0000   Median : 0.0000   Median : 0.00000  
 Mean   : 0.1344   Mean   :0.1066   Mean   : 0.1049   Mean   : 0.08733  
 3rd Qu.: 0.0000   3rd Qu.:0.0000   3rd Qu.: 0.0000   3rd Qu.: 0.00000  
 Max.   :31.0000   Max.   :9.0000   Max.   :16.0000   Max.   :15.00000  
                                                                        
     sports           cute              sex                sexy        
 Min.   : 0.00   Min.   : 0.0000   Min.   :  0.0000   Min.   : 0.0000  
 1st Qu.: 0.00   1st Qu.: 0.0000   1st Qu.:  0.0000   1st Qu.: 0.0000  
 Median : 0.00   Median : 0.0000   Median :  0.0000   Median : 0.0000  
 Mean   : 0.14   Mean   : 0.3229   Mean   :  0.2094   Mean   : 0.1412  
 3rd Qu.: 0.00   3rd Qu.: 0.0000   3rd Qu.:  0.0000   3rd Qu.: 0.0000  
 Max.   :12.00   Max.   :18.0000   Max.   :114.0000   Max.   :18.0000  
                                                                       
      hot              kissed            dance              band        
 Min.   : 0.0000   Min.   : 0.0000   Min.   : 0.0000   Min.   : 0.0000  
 1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.: 0.0000  
 Median : 0.0000   Median : 0.0000   Median : 0.0000   Median : 0.0000  
 Mean   : 0.1266   Mean   : 0.1032   Mean   : 0.4252   Mean   : 0.2996  
 3rd Qu.: 0.0000   3rd Qu.: 0.0000   3rd Qu.: 0.0000   3rd Qu.: 0.0000  
 Max.   :10.0000   Max.   :26.0000   Max.   :30.0000   Max.   :66.0000  
                                                                        
    marching           music              rock              god         
 Min.   : 0.0000   Min.   : 0.0000   Min.   : 0.0000   Min.   : 0.0000  
 1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.: 0.0000  
 Median : 0.0000   Median : 0.0000   Median : 0.0000   Median : 0.0000  
 Mean   : 0.0406   Mean   : 0.7378   Mean   : 0.2433   Mean   : 0.4653  
 3rd Qu.: 0.0000   3rd Qu.: 1.0000   3rd Qu.: 0.0000   3rd Qu.: 1.0000  
 Max.   :11.0000   Max.   :64.0000   Max.   :21.0000   Max.   :79.0000  
                                                                        
     church            jesus             bible               hair        
 Min.   : 0.0000   Min.   : 0.0000   Min.   : 0.00000   Min.   : 0.0000  
 1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.: 0.00000   1st Qu.: 0.0000  
 Median : 0.0000   Median : 0.0000   Median : 0.00000   Median : 0.0000  
 Mean   : 0.2482   Mean   : 0.1121   Mean   : 0.02133   Mean   : 0.4226  
 3rd Qu.: 0.0000   3rd Qu.: 0.0000   3rd Qu.: 0.00000   3rd Qu.: 0.0000  
 Max.   :44.0000   Max.   :30.0000   Max.   :11.00000   Max.   :37.0000  
                                                                         
     dress           blonde              mall            shopping     
 Min.   :0.000   Min.   :  0.0000   Min.   : 0.0000   Min.   : 0.000  
 1st Qu.:0.000   1st Qu.:  0.0000   1st Qu.: 0.0000   1st Qu.: 0.000  
 Median :0.000   Median :  0.0000   Median : 0.0000   Median : 0.000  
 Mean   :0.111   Mean   :  0.0989   Mean   : 0.2574   Mean   : 0.353  
 3rd Qu.:0.000   3rd Qu.:  0.0000   3rd Qu.: 0.0000   3rd Qu.: 1.000  
 Max.   :9.000   Max.   :327.0000   Max.   :12.0000   Max.   :11.000  
                                                                      
    clothes         hollister        abercrombie           die         
 Min.   :0.0000   Min.   :0.00000   Min.   :0.00000   Min.   : 0.0000  
 1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.: 0.0000  
 Median :0.0000   Median :0.00000   Median :0.00000   Median : 0.0000  
 Mean   :0.1485   Mean   :0.06987   Mean   :0.05117   Mean   : 0.1841  
 3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.: 0.0000  
 Max.   :8.0000   Max.   :9.00000   Max.   :8.00000   Max.   :22.0000  
                                                                       
     death             drunk             drugs         
 Min.   : 0.0000   Min.   :0.00000   Min.   : 0.00000  
 1st Qu.: 0.0000   1st Qu.:0.00000   1st Qu.: 0.00000  
 Median : 0.0000   Median :0.00000   Median : 0.00000  
 Mean   : 0.1142   Mean   :0.08797   Mean   : 0.06043  
 3rd Qu.: 0.0000   3rd Qu.:0.00000   3rd Qu.: 0.00000  
 Max.   :14.0000   Max.   :8.00000   Max.   :16.00000  

缺失处理

> table(teens$gender,useNA = \'ifany\')#返回缺失个数

    F     M  <NA> 
22054  5222  2724 
> teens$female<-ifelse(teens$gender==\'F\'&!is.na(teens$gender),1,0)#将性别进行哑变量处理,首先将非空的女性赋值为1,NA或者男赋值为0

  > teens$no_gender<-ifelse(is.na(teens$gender),1,0)#性别缺失为1,否则为0,这样的话相当于把男女哑变量处理,同时缺失也纳入考量

  > teens$gender<-NULL

年龄处理:高中生不可能106岁,也不可能有3岁,同时存在较多缺失

> summary(teens$age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA\'s 
  3.086  16.310  17.290  17.990  18.260 106.900    5086 
> teens$age <- ifelse(teens$age >= 16 & teens$age < 20,teens$age,NA)#将16-20岁视为高中生的正常年龄,其他数据视为异常值为空
> aggregate(data=teens,age~gradyear,mean,na.rm=TRUE)#求不同毕业年份的年龄均值,越最近毕业的人越年轻
  gradyear      age
1     2006 18.68560
2     2007 17.71878
3     2008 16.78118
4     2009 16.33252
# 按年份生成与原数据集合等长的向量
> ave_age <- ave(teens$age, teens$gradyear, FUN = function(x) { return(mean(x, na.rm = TRUE)) })
> teens$age <- ifelse(is.na(teens$age), ave_age, teens$age)#将NA的年龄赋值为均值,非空的保持原值
> summary(teens$age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  16.00   16.33   17.25   17.38   18.22   20.00 

数据划分

> interests <- teens[, 4:39]
> interests_z <- data.frame(lapply(interests, scale))#标准化数据,减少量纲差异

k均值建模

set.seed(11)#设定随机种子
teen_clusters <- kmeans(interests_z, 5)

评估模型

> head(teen_clusters$cluster)#聚类标签
[1] 2 1 2 2 5 2
> teen_clusters$size#5个类别的数量
[1]  5233 22182   731   803  1051
> teen_clusters$centers#聚类中心
  basketball    football      soccer     softball  volleyball    swimming cheerleading    baseball
1  0.4531058  0.43339400 -0.05953070  0.375996747  0.40342075  0.30482805   0.46479149  0.24682651
2 -0.1290187 -0.12977694 -0.14899365 -0.097191693 -0.10066663 -0.09079804  -0.11397570 -0.06918139
3  0.2457823  0.23577386  5.02937767  0.033589676  0.09057228  0.13165120  -0.02646162  0.03223472
4 -0.1109113  0.03602234 -0.14104152 -0.007481003 -0.07966170  0.04595234  -0.12500102 -0.11042346
5  0.3807668  0.38961431  0.05069711  0.161530303  0.11384059  0.27191272   0.20520869  0.29309699
       tennis     sports        cute          sex         sexy         hot      kissed       dance
1  0.13565618  0.2229282  0.65570947  0.005138338  0.230127643  0.50391601 -0.02252810  0.55362283
2 -0.04152949 -0.1020853 -0.17622878 -0.094412671 -0.077760392 -0.13325067 -0.13207336 -0.15193168
3  0.11156263  0.3998271  0.01361179 -0.056096990 -0.008332101  0.06411155 -0.06295344 -0.02447142
4  0.01655266 -0.1041384 -0.04851623 -0.046718513 -0.019765548 -0.06929059 -0.06814111  0.02098573
5  0.11082240  0.8460723  0.48219547  2.041764818  0.516256470  0.31165437  2.99550632  0.45107250
         band    marching       music        rock           god      church       jesus
1 -0.04813421 -0.12344684  0.26060636  0.18108418  0.3304953087  0.51972951  0.24097725
2 -0.12756333 -0.13513757 -0.13216778 -0.10905451 -0.1001591088 -0.13566360 -0.06601809
3 -0.04525084 -0.09376867  0.04330389  0.09518405 -0.0001367759  0.10922246  0.01664811
4  3.37541403  4.63879117  0.39528646  0.16181449  0.0735799460  0.03691392  0.07066919
5  0.38450719 -0.01216509  1.15977396  1.21019696  0.4122385221  0.17132449  0.12793743
        bible         hair        dress      blonde        mall    shopping     clothes
1  0.22258000  0.325589087  0.444453842  0.04528480  0.64138010  0.85834274  0.51465115
2 -0.05930602 -0.198168525 -0.132551874 -0.02830573 -0.17908734 -0.22075867 -0.18246937
3  0.03613298 -0.004854472  0.005732568  0.03146825  0.03512586  0.21142729 -0.04501739
4  0.04788738 -0.041196192  0.049581302 -0.01631323 -0.09605527 -0.06611646  0.01252734
5  0.08173007  2.596190010  0.542753982  0.36251061  0.63523179  0.28896231  1.31038466
    hollister  abercrombie          die       death       drunk       drugs
1  0.57661846  0.557958327  0.043023482  0.12784283  0.01420576 -0.05433707
2 -0.15090261 -0.146583539 -0.089778750 -0.07512888 -0.08530937 -0.11084192
3  0.06677613  0.002922202 -0.003454604 -0.01723961 -0.04902532 -以上是关于在R中练习k-meas聚类的主要内容,如果未能解决你的问题,请参考以下文章

聚类算法,从案例到入门

备战数学建模44-聚类模型(攻坚站8)

备战数学建模44-聚类模型(攻坚站8)

基于皮尔逊相关的聚类

K-means聚类算法原理与实现

K均值聚类--利用k-means算法分析NBA近四年球队实力