R中可变窗口滚动最大值/分钟的内存效率

Posted 2023-03-07

技术标签:

【中文标题】R中可变窗口滚动最大值/分钟的内存效率【英文标题】：Memory efficiency of rolling max / mins with variable windows in R 【发布时间】：2021-10-28 20:49:11 【问题描述】：

我一直在做一个练习，我需要在一些大型数据集（约 100 - 2.5 亿行）上计算具有可变窗口长度的最大值/最小值。

简而言之，我有一个显示开始和结束索引的表（由下面的“Lookup_table”表示），它反映了第二个表的行位置（下面称为“Price_table”）。使用这些行位置，然后我需要提取“Price_table”中特定列的最大值和最小值。我需要对“查找”表的所有行重复此操作。

例如，如果“Lookup_table”的第一行的 Start = 1 和 End = 5，我需要在 Price_table 中从第 1 行到第 5 行（含）中找到目标列的最大值/最小值。然后，如果第二列的 Start = 6，End = 12，我会在 Price_table 的第 6 行到第 12 行中找到 max / min，依此类推。

我在下面创建了一组 10,000 行的虚拟数据（对所有包要求表示歉意）。

require(data.table)
require(dplyr)
require(purrr)
   
# Number of rows
nn <- 10000
# Create a random table of Price data with 1,000,000 rows
Price_table <- data.table(Price = runif(nn,300,1000)) %>% mutate(.,Index = seq_len(nrow(.)))

# Create a lookup table with start / end indexes
Lookup_table <- data.table(Start = floor(sort(runif(nn,1,nn)))) %>% mutate(.,End = shift(Start,fill = nn,type = "lead"))

我最初使用以下代码行计算了最大/分钟。不幸的是，我发现它在非常大的数据集上失败了，因为它内存不足（我有 64 Gig 的内存）。

# Option 1: Determine the max / min between the Start / End prices in the "Price_table" table's "Price" column
Lookup_table[,(c("High_2","Low_2")) := rbindlist(pmap(.l = list(x = Start,y = End),
                       function(x,y) data.table(Max = max(Price_table$Price[x:y],na.rm = T),
                                                Min = min(Price_table$Price[x:y],na.rm = T))))]

我还没有在完整的数据集上重新测试以下替代选项，但是基于一些较小的数据集，它似乎更节省内存（嗯，至少使用 memory.size()，这可能会或可能不会提供准确的反映...）。

# Option 2: Use mapply separately to determine the max / min between the Start / End prices in the "Price_table" table's "Price" column
Lookup_table[,High := mapply(function(x,y) max(Price_table$Price[x:y],na.rm = T),Start,End)]
Lookup_table[,Low := mapply(function(x,y) min(Price_table$Price[x:y],na.rm = T),Start,End)]

我有两个问题：

如果我说 mapply 方法更节省内存（不是一般情况，但至少相对于我的第一次尝试）是正确的，有人可以解释为什么会这样吗？是不是因为第一次尝试使用了 rbindlist() + data.table() 调用？在处理更大的数据集时，我是否应该考虑其他任何内存效率高（且速度更快？）的方法？

提前致谢。菲尔

【问题讨论】：

旁注：library-vs-require，***.com/a/51263513/3358272。如果你使用require，检查它的返回值；如果您不检查它，即使包不可用，脚本也会愉快地继续执行，从而使脚本难以分发。谢谢你 - 我不知道！我认为您想要 frollmin(adaptive=TRUE) 但 min（和 max）尚未实现。 【参考方案1】：

你可以使用non-equijoins：

Price_table[Lookup_table,.(Price,Start,End),on=.(Index>=Start,Index<=End)][
            ,.(Low = min(Price), High = max(Price)),by=.(Start,End)]

      Start   End      Low     High
   1:     3     5 668.3308 908.1017
   2:     5     5 908.1017 908.1017
   3:     5     6 333.3477 908.1017
   4:     6     7 333.3477 827.1258
   5:     7     8 785.8887 827.1258
  ---                              
8947:  9993  9995 395.8449 827.7860
8948:  9995  9995 827.7860 827.7860
8949:  9995  9997 418.7436 827.7860
8950:  9997  9999 418.7436 947.1398
8951:  9999 10000 489.3145 634.6268

【讨论】：

我认为它可以在[]:Price_table[Lookup_table, on=.(Index>=Start, Index<=End), by=.EACHI, as.list(range(x.Price))] 内。除非Lookup_table中有骗子 @chinsoon12，感谢您的建议。查找表中可能存在欺骗，因为上述示例的结果并不完全相同。非常感谢您的快速回复。我将更详细地研究非等值连接。真的很感激！

以上是关于R中可变窗口滚动最大值/分钟的内存效率的主要内容，如果未能解决你的问题，请参考以下文章