r中具有多个个体的聚类分析
Posted
技术标签:
【中文标题】r中具有多个个体的聚类分析【英文标题】:cluster analysis in r with multiple individuals 【发布时间】:2011-08-25 08:08:26 【问题描述】:抱歉,我不知道如何使用 html 或其他任何东西来让这个看起来“漂亮”。特别是为了让我的示例数据对大家有用。我只是边走边学。
我正在尝试对变量 PersVel、TurnVel 和 Velocity(可能还有其他变量,但这些暂时可用)进行聚类分析。我的数据已经按年份分开,但我每年有不同数量的人(ID 是这些人的名称)。我想对每个个体的这些变量运行 k-means 和/或层次聚类分析。下面的数据只有20个数据点。一旦通过感兴趣的变量确定了集群,我想将其链接回日历日期或日期/时间变量。最终我想知道集群何时发生。
我已经编写了将 ID 转换为级别的代码,并被告知我需要标准化 k-means 聚类的变量(所以我假设你会为分层做同样的事情,但这没什么大不了的)。只是如何让它循环通过个人?
IDNames = levels(Data$ID)
for (i in 1:(length(IDNames))
现在呢???我如何编写下一部分来做这个测试?
structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("c_002",
"c_102", "c_104", "c_401", "c_402", "c_406", "c_409", "c_411",
"c_412", "c_413", "c_414", "c_415", "c_417", "c_418", "c_420",
"c_421", "c_423", "c_425", "c_426", "c_602", "c_604", "c_9809",
"c_9814", "c_9815", "c_9816", "c_9819", "c_9908", "c_9911"), class = "factor"),
x = c(229539.8109, 231122.438, 231290.6472, 231355.2828,
230910.8116, 230928.7384, 231164.6592, 231113.9708, 231186.0565,
231270.4396, 231334.5768, 231153.0715, 231215.2728, 231200.7462,
231325.1136, 231777.6369, 231522.6185, 231674.6925, 231684.3388,
231924.464, 232065.5961), y = c(2229114.92, 2229455.232,
2230388.77, 2232003.32, 2232559.623, 2232521.689, 2232434.829,
2232996.109, 2233038.608, 2233160.861, 2233371.836, 2233471.823,
2233307.792, 2233285.778, 2233204.662, 2231630.353, 2231054.838,
2231056.299, 2230981.267, 2230840.082, 2230998.991), DateTime = structure(c(1148853637,
1148871660, 1148889637, 1148907637, 1148925637, 1148943666,
1148961637, 1148979636, 1148997636, 1149015637, 1149033637,
1149051690, 1149069666, 1149087665, 1149105637, 1149123683,
1149141654, 1149159637, 1149177636, 1149195696, 1149213696
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), RunClock_days = c(1179.58332175926,
1179.79192129630, 1179.99998842593, 1180.20832175926, 1180.41665509259,
1180.62532407407, 1180.83332175926, 1181.04164351852, 1181.24997685185,
1181.45832175926, 1181.66665509259, 1181.87560185185, 1182.08365740741,
1182.29197916667, 1182.49998842593, 1182.70885416667, 1182.91685185185,
1183.12498842593, 1183.33331018519, 1183.54233796296, 1183.75067129630
), CalDay = c("148", "149", "149", "149", "149", "149", "150",
"150", "150", "150", "151", "151", "151", "151", "151", "152",
"152", "152", "152", "152", "153"), DistX = c(1582.62709999998,
168.209200000012, 64.6355999999796, -444.4712, 17.9268000000156,
235.920799999993, -50.6883999999845, 72.085699999996, 84.3831000000064,
64.1371999999974, -181.505300000019, 62.2013000000152, -14.5266000000120,
124.367400000017, 452.523300000001, -255.018400000001, 152.073999999993,
9.64629999999306, 240.125200000009, 141.132099999988, -3159.38569999998
), DistY = c(340.311999999918, 933.538000000175, 1614.54999999981,
556.303000000305, -37.9340000003576, -86.8599999998696, 561.280000000261,
42.4989999998361, 122.253000000026, 210.975000000093, 99.9869999997318,
-164.030999999959, -22.0139999999665, -81.1159999999218,
-1574.30899999989, -575.51500000013, 1.46100000012666, -75.032000000123,
-141.185000000056, 158.908999999985, -5943.84400000004),
Dist = c(1618.80227174238, 948.571311188026, 1615.84326693116,
712.058758417295, 41.9566265835052, 251.402632191101, 563.564151002218,
83.6810202224823, 148.547310896621, 220.508573640299, 207.223488285096,
175.428534402698, 26.3749559916007, 148.482509538166, 1638.05560483262,
629.48542442515, 152.081017872048, 75.649534880978, 278.555768025113,
212.533150194039, 6731.34455348268), LnDist = c(7.38944181635036,
6.8549569696676, 7.3876122460922, 6.56816043387389, 3.73663638428818,
5.527055766233, 6.33428117083723, 4.42701219219356, 5.00090349939957,
5.39593657685343, 5.33379786440982, 5.16723174859221, 3.27241492322993,
5.00046717041827, 7.4012652106211, 6.44490269900689, 5.02441339116178,
4.32611129191379, 5.62961828357648, 5.35909797711072, 8.81453018774869
), TimeDif = c(5.00638888888889, 4.99361111111111, 5, 5,
5.00805555555556, 4.99194444444444, 4.99972222222222, 5,
5.00027777777778, 5, 5.01472222222222, 4.99333333333333,
4.99972222222222, 4.99222222222222, 5.01277777777778, 4.99194444444444,
4.99527777777778, 4.99972222222222, 5.01666666666667, 5,
4.98361111111111), Velocity = c(323.347288368894, 189.956985051838,
323.168653386232, 142.411751683459, 8.3778277053979, 50.3616646757533,
112.719092372242, 16.7362040444965, 29.7078117453384, 44.1017147280597,
41.3230243076688, 35.1325502809141, 5.27528426966845, 29.7427684363120,
326.776026676129, 126.100246393108, 30.4449571450467, 15.130747573283,
55.5260667159693, 42.5066300388078, 1350.69619266137), LnVelocity = c(5.77872694180175,
5.24679765206538, 5.7781743336581, 4.95872252143979, 2.12558865719019,
3.91923026414518, 4.72489881550196, 2.81757427975946, 3.39141003295307,
3.78649866441933, 3.72141983391736, 3.55912805917125, 1.66303256789466,
3.39258602467242, 5.78927500251085, 4.83707719691907, 3.41592036944078,
2.71672893657851, 4.0168525810497, 3.74966006467662, 7.2083754367735
), Heading = c(1.35899167682096, 0.178271769107279, 0.040011832151945,
5.60907076311214, 2.70012174242416, 1.92356952639201, 6.193121040462,
1.03808707214764, 0.604141059039809, 0.295125938335282, 5.21590486031959,
2.77914091577713, 3.72488212039469, 2.14873677066758, 2.86169595063768,
3.55870493136089, 1.56118945741765, 3.01373153808326, 2.10231890072709,
0.726219128764754, 3.63015207232184), Angle = c(0.609592148368293,
-1.18071990771368, -0.138259936955334, 5.5690589309602, -2.90894902068798,
-0.776552216032153, 4.26955151407000, -5.15503396831437,
-0.433946013107828, -0.309015120704527, 4.92077892198431,
-2.43676394454246, 0.945741204617556, -1.57614534972711,
0.712959179970102, 0.697008980723212, -1.99751547394325,
1.45254208066561, -0.911412637356172, -1.37609977196233,
2.90393294355708), CosAngle = c(0.81988159459602, 0.380259094713527,
0.990457310809811, 0.755665715954353, -0.973060304449898,
0.713334063328187, -0.428504949324728, 0.428331029577178,
0.907313699540722, 0.952633553896418, 0.206884943442359,
-0.761722542920473, 0.585141974104434, -0.00534899742449928,
0.756429664977827, 0.766765630720815, -0.413886381311673,
0.117978826229562, 0.612629855005907, 0.193468831871728,
-0.971891607429047), SinAngle = c(0.572533117682013, -0.924880003507292,
-0.137819865996876, -0.654957499179294, -0.230550740410589,
-0.700824167745161, -0.90353943378483, 0.903621894987807,
-0.420454338336196, -0.304120555028232, -0.978365279523375,
-0.647903362861135, 0.810930866437557, -0.999985694010946,
0.654075043050515, 0.641927151275992, -0.910328546934967,
0.993016110927459, -0.7903699518298, -0.981106421900391,
0.235428765041538), PersVel = c(265.106490396188, 72.2328711703229,
320.084755370955, 107.615678296195, -8.15213157764327, 35.9246908991267,
-48.300688964897, 7.1686355095929, 26.9543045799223, 42.0127732343175,
8.54911154675927, -26.7612555392593, 3.08679025151586, -0.159093991763311,
247.183080381409, 96.6893349596614, -12.6007531419523, 1.78510783867173,
34.0169262012526, 8.22370806041186, -1312.73029383395), TurnVel = c(185.127031103868,
-175.687417000979, -44.5390605040812, -93.273644736341, -1.93151438051183,
-35.2946717326457, -101.846144898756, 15.1232004135905, -12.4907783308025,
-13.4122379607943, -40.4290122275236, -22.7624974728921,
4.27789084350665, -29.7423429365923, 213.736043716065, 80.9471719423284,
-27.7149135993477, 15.0250761106466, -43.886134675599, -41.7035277044184,
317.992736584573)), .Names = c("ID", "x", "y", "DateTime",
"RunClock_days", "CalDay", "DistX", "DistY", "Dist", "LnDist",
"TimeDif", "Velocity", "LnVelocity", "Heading", "Angle", "CosAngle",
"SinAngle", "PersVel", "TurnVel"), row.names = 150:170, class = "data.frame")
【问题讨论】:
【参考方案1】:为简单起见(假设您的原始 data.frame 名为 orgData):
results<-list()
IDNames = levels(Data$ID)
for (i in 1:(length(IDNames))
dataForCurrentIndividual<-orgData[orgData$ID==IDNames[i],]
#now do whatever analysis you're interested in on data.frame dataFor...
#after your analysis, I assume the result is in a variable resCurIndv
results[[i]]<-resCurIndv #keep your results in the i'th spot in the results list
一旦你这样做了,这可能是让你的代码更 R'ish 的好一步。
首先,把上面的变成一个函数。那就是:把你写的所有代码都放在我的注释(以#开头)所在的地方,然后把它变成这样的函数:
analysisPerIndividual<-function(dataForCurrentIndividual)
#now do whatever analysis you're interested in on data.frame dataFor...
#after your analysis, I assume the result is in a variable resCurIndv
return(resCurIndv)
现在,您可以像这样正确使用它(注意您必须为此安装 plyr 包):
require(plyr)
dlply(orgData, "ID", analysisPerIndividual)
有关标准化变量的信息,请参阅 ?scale
,有关 k-means 聚类的信息,请参阅 ?kmeans
。
祝你好运!
【讨论】:
以上是关于r中具有多个个体的聚类分析的主要内容,如果未能解决你的问题,请参考以下文章
R语言聚类分析之基于划分的聚类KMeans实战:基于菌株数据