部分 group by 语句以查找变量的多个实例的计数
Posted
技术标签:
【中文标题】部分 group by 语句以查找变量的多个实例的计数【英文标题】:Partial group by statements to find the count for multiple instances of a variable 【发布时间】:2015-02-09 22:33:46 【问题描述】:我正在为一项相当简单的任务而苦苦挣扎。我有以下数据,并想为每个 visit_high 查找 event_list 中的项目计数。所以它可能如下所示。
Visit_high visit event_list
101 1 3
101 2 5
102 1 2
103 1 6
103 2 8
103 3 5
...
Visit high 是用户 id,visit 是指他们的访问次数,event list 是他们采取的操作数。因此,用户 101 两次访问该网站,并在第一次访问时执行了 3 次操作,在第二次访问时执行了 5 次操作。
> dput(tail(mydf[1:50,c(5,10)], 10))
structure(list(event_list = structure(c(2L, 2L, 2L, 2L, 76L,
36L, 64L, 37L, 14L, 25L), .Label = c("", "100,101,102,115,116",
"100,101,102,115,116,146", "100,101,102,116", "100,101,102,116,146",
"100,101,115,116", "100,101,117,118", "100,102,115,116", "100,102,115,116,146",
"100,102,116", "100,102,116,146", "100,107,115,116", "100,107,116",
"100,115,116", "100,115,116,146", "100,116", "100,116,146", "100,117",
"102,115,116", "102,115,116,146", "102,116", "102,116,146", "107,115,116",
"108,117,118", "115,116", "115,116,146", "116", "116,146", "202",
"202,120", "205,100,101,109,117,118", "206,115,116", "206,115,116,146",
"206,116", "206,116,146", "206,214,115,116", "206,214,115,116,146",
"206,214,116", "206,214,116,146", "206,215,115,116", "206,215,115,116,146",
"207,102,115,116", "207,102,115,116,146", "207,102,116", "207,102,116,146",
"207,115,116", "208,100,101,102,115,116", "208,100,101,102,116",
"208,100,102,115,116", "208,100,115,116", "208,102,109,115,116",
"208,102,109,116", "208,102,115,116", "208,102,116", "208,109,115,116",
"208,109,115,116,146", "208,109,116", "208,115,116", "208,116",
"210,102,108,115,116", "210,102,108,116", "212,102,109,115,116",
"212,102,109,116", "212,109,115,116", "212,109,116", "212,115,116",
"214,100,101,102,115,116", "214,100,101,102,115,116,146", "214,100,115,116",
"214,100,115,116,146", "214,100,116", "214,100,116,146", "214,102,115,116",
"214,102,115,116,146", "214,102,116", "214,115,116", "214,115,116,146",
"214,116", "214,116,146", "214,207,102,115,116", "214,221,102,115,116",
"214,221,102,115,116,146", "215,100,101,102,115,116", "215,100,101,102,115,116,146",
"215,100,101,102,116", "215,100,101,115,116", "215,100,102,115,116",
"215,100,102,116", "215,100,115,116", "215,100,115,116,146",
"215,100,116", "215,102,115,116", "215,102,115,116,146", "215,102,116",
"215,115,116", "215,115,116,146", "215,116", "215,207,102,115,116",
"215,207,102,116", "215,221,100,102,115,116", "215,221,100,102,116",
"215,221,102,115,116", "215,221,102,116", "220,102,115,116",
"221,100,102,115,116", "221,100,102,115,116,146", "221,100,102,116",
"221,102,115,116", "221,102,115,116,146", "221,102,116", "226,100,117,119,120",
"227,102,115,116", "227,102,116", "228,102,115,116", "232,102,115,116",
"234,102,115,116", "235"), class = "factor"), visid_high = c(2710815361820866560,
2710815518587167232, 2710815707565725184, 2710815726893081600,
2710815857889578496, 2710815857889578496, 2710815857889578496,
2710815883659387904, 2710815902986739712, 2710815950231374336
)), .Names = c("event_list", "visid_high"), row.names = 41:50, class = "data.frame")
我有每个访问者 ID 的访问次数,但我对如何区分每个 visit_high 实例有点迷茫。
event_sum = cbind(mmf$visid_high, mmf$event_list, sapply(strsplit(mmf$event_list, ","), length))
【问题讨论】:
您到底想达到什么目的?每个Visit_high
的event_list
的总和?例如:访客 101 的 8 个操作?
如果您的示例使用一致的拼写,例如visid
与 visit
。但我们真正需要的是您提供的非常短的样本输入的样本输出。
【参考方案1】:
希望我的问题是正确的(调用您的数据 DF):
myfun <- function(row)
data.frame(event_list = unlist(strsplit(row[1],",")), visid_high = row[2])
table(do.call(rbind, apply(DF,1,myfun)))
visid_high
event_list 2.710815e+18 2.710816e+18
100 1 4
101 1 3
102 1 3
115 1 9
116 1 9
214 0 3
206 0 2
109 0 1
212 0 1
146 0 1
【讨论】:
【参考方案2】:这个怎么样。
# dummy data based on your example
du <- data.frame(Visit_high=c(101,101,102,103,103,103),
visit=c(1,2,1,1,2,3),
event_list=c(3,5,2,6,8,5))
# a function to sum all visits for a given visit-high
fu <- function(vh)
rows_for_this_user <- which(du$Visit_high==vh)
events_for_this_user <- sum(du$event_list[rows_for_this_user])
#return a vector with the user id and events count
c(Visit_high=vh, event_sum=events_for_this_user)
data.frame(t(sapply(unique(du$Visit_high), fu)))
样本输出。
Visit_high event_sum
1 101 8
2 102 2
3 103 19
【讨论】:
【参考方案3】:如果您需要group
的sum
,有很多方法可以做到。
library(data.table)
setDT(du)[, list(event_sum=sum(event_list)), by=Visit_high]
# Visit_high event_sum
#1: 101 8
#2: 102 2
#3: 103 19
或base R
方法
aggregate(cbind(event_sum=event_list)~Visit_high, du, FUN=sum)
数据
du <- data.frame(Visit_high=c(101,101,102,103,103,103),
visit=c(1,2,1,1,2,3),
event_list=c(3,5,2,6,8,5))
【讨论】:
以上是关于部分 group by 语句以查找变量的多个实例的计数的主要内容,如果未能解决你的问题,请参考以下文章
C# Linq group by 和 group by into 运用实例