如何按 data.table 中的十分位组计算统计信息
Posted
技术标签:
【中文标题】如何按 data.table 中的十分位组计算统计信息【英文标题】:How can I compute statistics by decile groups in data.table 【发布时间】:2014-04-29 10:08:03 【问题描述】:我有一个 data.table,想按组计算统计数据。
R) set.seed(1)
R) DT=data.table(a=rnorm(100),b=rnorm(100))
这些组应该由
定义R) quantile(DT$a,probs=seq(.1,.9,.1))
10% 20% 30% 40% 50% 60% 70% 80% 90%
-1.05265747329 -0.61386923071 -0.37534201964 -0.07670312896 0.11390916079 0.37707993057 0.58121734252 0.77125359976 1.18106507751
我如何计算每个 bin 的平均 b
,如果 b=-.5
我在 [-0.61386923071,-0.37534201964]
之内,那么在 bin 3
中
【问题讨论】:
【参考方案1】:怎么样:
> DT[, mean(b), keyby=cut(a,quantile(a,probs=seq(.1,.9,.1)))]
cut V1
1: NA -0.31359818
2: (-1.05,-0.614] -0.14103182
3: (-0.614,-0.375] -0.33474492
4: (-0.375,-0.0767] 0.20827735
5: (-0.0767,0.114] 0.14890251
6: (0.114,0.377] 0.16685304
7: (0.377,0.581] 0.07086979
8: (0.581,0.771] 0.17950572
9: (0.771,1.18] -0.04951607
为了看看那个 NA(无论如何都要检查结果),我接下来做了:
> DT[, list(mean(b),.N,list(a)), keyby=cut(a,quantile(a,probs=seq(.1,.9,.1)))]
cut V1 N V3
1: NA -0.31359818 20 1.59528080213779,1.51178116845085,-2.2146998871775,-1.98935169586337,-1.47075238389927,1.35867955152904,
2: (-1.05,-0.614] -0.14103182 10 -0.626453810742332,-0.835628612410047,-0.820468384118015,-0.621240580541804,-0.68875569454952,-0.70749515696212,
3: (-0.614,-0.375] -0.33474492 10 -0.47815005510862,-0.41499456329968,-0.394289953710349,-0.612026393250771,-0.443291873218433,-0.589520946188072,
4: (-0.375,-0.0767] 0.20827735 10 -0.305388387156356,-0.155795506705329,-0.102787727342996,-0.164523596253587,-0.253361680136508,-0.112346212150228,
5: (-0.0767,0.114] 0.14890251 10 -0.0449336090152309,-0.0161902630989461,0.0745649833651906,-0.0561287395290008,-0.0538050405829051,-0.0593133967111857,
6: (0.114,0.377] 0.16685304 10 0.183643324222082,0.329507771815361,0.36458196213683,0.341119691424425,0.188792299514343,0.153253338211898,
7: (0.377,0.581] 0.07086979 10 0.487429052428485,0.575781351653492,0.389843236411431,0.417941560199702,0.387671611559369,0.556663198673657,
8: (0.581,0.771] 0.17950572 10 0.738324705129217,0.593901321217509,0.61982574789471,0.763175748457544,0.696963375404737,0.768532924515416,
9: (0.771,1.18] -0.04951607 10 1.12493091814311,0.943836210685299,0.821221195098089,0.918977371608218,0.782136300731067,1.10002537198388,
除此之外:我已经返回了一个list
列(每个单元格本身就是一个向量),以便快速查看进入垃圾箱的值,只是为了检查。 data.table
打印时显示逗号(并且每个单元格仅显示前 6 个项目),但 V3
的每个单元格实际上都有一个数字向量。
因此,第一个和最后一个 break
之外的值被一起编码为 NA。我不清楚如何告诉cut
不要那样做。所以我只是添加了 -Inf 和 +Inf :
> DT[,list(mean(b),.N),keyby=cut(a,c(-Inf,quantile(a,probs=seq(.1,.9,.1)),+Inf))]
cut V1 N
1: (-Inf,-1.05] -0.16938368 10
2: (-1.05,-0.614] -0.14103182 10
3: (-0.614,-0.375] -0.33474492 10
4: (-0.375,-0.0767] 0.20827735 10
5: (-0.0767,0.114] 0.14890251 10
6: (0.114,0.377] 0.16685304 10
7: (0.377,0.581] 0.07086979 10
8: (0.581,0.771] 0.17950572 10
9: (0.771,1.18] -0.04951607 10
10: (1.18, Inf] -0.45781268 10
这样更好。或者:
> DT[, list(mean(b),.N), keyby=cut(a,quantile(a,probs=seq(0,1,.1)),include=TRUE)]
cut V1 N
1: [-2.21,-1.05] -0.16938368 10
2: (-1.05,-0.614] -0.14103182 10
3: (-0.614,-0.375] -0.33474492 10
4: (-0.375,-0.0767] 0.20827735 10
5: (-0.0767,0.114] 0.14890251 10
6: (0.114,0.377] 0.16685304 10
7: (0.377,0.581] 0.07086979 10
8: (0.581,0.771] 0.17950572 10
9: (0.771,1.18] -0.04951607 10
10: (1.18,2.4] -0.45781268 10
这样您就可以看到最小值和最大值是什么,而不是显示 -Inf 和 +Inf。请注意,您需要将 include=TRUE
传递给 cut
否则将返回 11 个 bin,而第一个 bin 中只有 1 个。
【讨论】:
我错过了那个剪切命令!很酷,这很完美......我也从未使用过 keyby 这里为什么使用keyby
而不是by
?
@hadley 对垃圾箱进行排序。 by=
按首次出现顺序返回组。
@MattDowle 当我运行DT[, list(mean(b),.N,list(a)), keyby=cut(a,quantile(a,probs=seq(.1,.9,.1)))]
时,我得到Error in setkeyv(ans, names(ans)[seq_along(byval)])
Item 4 of list is not a vector
。但是其他 3 个命令工作正常。如果我运行DT[, list(mean(b),.N,paste0(as.character(a),collapse=",")), keyby=cut(a,quantile(a,probs=seq(.1,.9,.1)))]
,那就行得通。我在 R3.0.1 Win7 上运行 DT 1.9.2。知道这里发生了什么吗?【参考方案2】:
我经常做这种事情,所以我在我的 R 包中为它写了一个非常灵活的 bin_data() 方法 - mltools。它完全基于data.table
,并利用了新的non-equi joins。
要回答您的具体问题,请将 Bin1 设置为 DT
中的一列,然后按 Bin1 分组
library(data.table)
library(mltools)
DT[, Bin1 := bin_data(vals=a, bins=seq(.1, .9, .1), binType="quantile")]
DT[, list(mean(b)), keyby=Bin1]
Bin1 V1
1: NA -0.31359818
2: [-1.05265747329296, -0.613869230708978) -0.14103182
3: [-0.613869230708978, -0.375342019639661) -0.33474492
4: [-0.375342019639661, -0.0767031289639095) 0.20827735
5: [-0.0767031289639095, 0.113909160788544) 0.14890251
6: [0.113909160788544, 0.377079930573521) 0.16685304
7: [0.377079930573521, 0.581217342522697) 0.07086979
8: [0.581217342522697, 0.771253599758546) 0.17950572
9: [0.771253599758546, 1.1810650775142] -0.04951607
你也可以做其他很酷的事情
按分位数制作 10 个等间距的 bin
DT[, Bin2 := bin_data(vals=a, bins=10, binType="quantile")]
DT[, list(mean(b)), keyby=Bin2]
Bin2 V1
1: [-2.2146998871775, -1.05265747329296) -0.16938368
2: [-1.05265747329296, -0.613869230708978) -0.14103182
3: [-0.613869230708978, -0.375342019639661) -0.33474492
4: [-0.375342019639661, -0.0767031289639095) 0.20827735
5: [-0.0767031289639095, 0.113909160788544) 0.14890251
6: [0.113909160788544, 0.377079930573521) 0.16685304
7: [0.377079930573521, 0.581217342522697) 0.07086979
8: [0.581217342522697, 0.771253599758546) 0.17950572
9: [0.771253599758546, 1.1810650775142) -0.04951607
10: [1.1810650775142, 2.40161776050478] -0.45781268
使最后一个边界左闭右开
DT[, Bin3 := bin_data(vals=a, bins=10, binType="quantile", boundaryType="lcro)")]
DT[, list(mean(b)), keyby=Bin2]
1: NA 0.42510038
2: [-2.2146998871775, -1.05265747329296) -0.16938368
3: [-1.05265747329296, -0.613869230708978) -0.14103182
4: [-0.613869230708978, -0.375342019639661) -0.33474492
5: [-0.375342019639661, -0.0767031289639095) 0.20827735
6: [-0.0767031289639095, 0.113909160788544) 0.14890251
7: [0.113909160788544, 0.377079930573521) 0.16685304
8: [0.377079930573521, 0.581217342522697) 0.07086979
9: [0.581217342522697, 0.771253599758546) 0.17950572
10: [0.771253599758546, 1.1810650775142) -0.04951607
11: [1.1810650775142, 2.40161776050478) -0.55591413
指定您自己的显式垃圾箱(注意返回空垃圾箱)
bin_data(dt=DT, binCol="a", bins=seq(-5, 5, 1), returnDT=TRUE)
Bin a b
1: [-5, -4) NA NA
2: [-4, -3) NA NA
3: [-3, -2) -2.214700 -0.65069635
4: [-2, -1) -1.989352 -0.17955653
5: [-2, -1) -1.470752 -0.03763417
---
100: [1, 2) 1.586833 -1.20808279
101: [2, 3) 2.401618 0.42510038
102: [2, 3) 2.172612 0.20753834
103: [3, 4) NA NA
104: [4, 5] NA NA
使用可变大小的箱子
bin_data(dt=DT, binCol="a", bins=data.table(LB=c(-5, 0, 1), RB=c(0, 1, Inf)), returnDT=TRUE)
Bin a b
1: [-5, 0) -0.626453811 -0.62036668
2: [-5, 0) -0.835628612 -0.91092165
3: [-5, 0) -0.820468384 1.76728727
4: [-5, 0) -0.305388387 1.68217608
5: [-5, 0) -0.621240581 1.43228224
---
95: [1, Inf] 2.172611670 0.20753834
96: [1, Inf] 1.178086997 0.21992480
97: [1, Inf] 1.063099837 1.46458731
98: [1, Inf] 1.207867806 0.40201178
99: [1, Inf] 1.160402616 -0.73174817
100: [1, Inf] 1.586833455 -1.20808279
Bin a b
【讨论】:
感谢@statquant 我构建它是为了解决不断出现的需求。请注意,它需要一些参数检查和错误处理,但只要您提供适当的输入,它就会起作用。 我会使用它并回来提供建议/改进,,,,通常我会添加一个 by,这样你就可以通过子集化,然后做 bining(我一直这样做,就像 2014 ,2015,...) @ben。看起来很有趣。我去看看mltools
以上是关于如何按 data.table 中的十分位组计算统计信息的主要内容,如果未能解决你的问题,请参考以下文章
如何根据每周日期创建移动平均线,按data.table中的多列分组?