R：具有 2 个大型数据集的模式匹配金融时间序列数据：

Posted 2023-03-25

技术标签:

【中文标题】R：具有 2 个大型数据集的模式匹配金融时间序列数据：【英文标题】：R: Pattern-matching financial time-series data with 2 large data sets: 【发布时间】：2015-06-14 10:08:42 【问题描述】：

我的问题可能比较复杂，请耐心阅读。

我正在处理以下案例，我有来自 2 个交易所（纽约和伦敦）的金融时间序列的两个时间数据集

两个数据集如下所示：

伦敦数据集：

Date        time.second Price
2015-01-05  32417   238.2
2015-01-05  32418   238.2
2015-01-05  32421   238.2
2015-01-05  32422   238.2
2015-01-05  32423   238.2
2015-01-05  32425   238.2
2015-01-05  32427   238.2
2015-01-05  32431   238.2
2015-01-05  32435   238.47
2015-01-05  32436   238.47

纽约数据集：

NY.Date     Time    Price
2015-01-05  32416   1189.75
2015-01-05  32417   1189.665
2015-01-05  32418   1189.895
2015-01-05  32419   1190.15
2015-01-05  32420   1190.075
2015-01-05  32421   1190.01
2015-01-05  32422   1190.175
2015-01-05  32423   1190.12
2015-01-05  32424   1190.14
2015-01-05  32425   1190.205
2015-01-05  32426   1190.2
2015-01-05  32427   1190.33
2015-01-05  32428   1190.29
2015-01-05  32429   1190.28
2015-01-05  32430   1190.05
2015-01-05  32432   1190.04

可以看出，有 3 列：日期、时间（秒）、价格

我想做的是使用伦敦数据集作为参考，找到最近但更早的数据项纽约数据集。

最近但更早是什么意思？我的意思是，例如，

"2015-01-01","21610","15.6871" 在伦敦数据集中，我想在纽约数据集中找到同一日期的数据，以及最近但更早或相同的时间，看看我当前的程序会很有帮助：

# I am trying to avoid using for-loop
for(i in 1:dim(london_data)[1]) #for each row in london data set
    print(i)
    tempRow<-london_data[i,]
    dateMatch<-(which(NY_data[,1]==tempRow[1])) # select the same date
    dataNeeded<-(london_before[dateMatch,]) # subset the same date data
    # find the nearest but earlier data in NY_data set
    Found<-dataNeeded[which(dataNeeded[,2]<=tempRow[2]),] 
    # Found may be more than one row, each row is of length 3
    if(length(Found)>3)
        # Select the data, we only need "time" and "price", 2nd and 3rd  
         # column
         # the data is in the final row of **Found**
         selected<-Found[dim(Found)[1],2:3] 
         if(length(selected)==0) # if nothing selected, just insert 0 and 0
             temp[i,]<-c(0,0)
         else
            temp[i,]<-selected
     
     else # Found may only one row, of length 3
         temp[i,]<-Found[2:3] # just insert what we want
     
   print(paste("time is", as.numeric(selected[1]))) #Monitor the loop
 
 res<-cbind(london_data,temp)
 colnames(res)<-c("LondonDate","LondonTime","LondonPrice","NYTime","NYPrice")

上面列出的数据集的正确输出是**（仅部分）**：

      "LondonDate","LondonTime","LondonPrice","NYTime","NYPrice"
 [1,] "2015-01-05" "32417"      "238.2"       "32417"    "1189.665" 
 [2,] "2015-01-05" "32418"      "238.2"       "32418"    "1189.895" 
 [3,] "2015-01-05" "32421"      "238.2"       "32421"    "1190.01"  
 [4,] "2015-01-05" "32422"      "238.2"       "32422"    "1190.175" 
 [5,] "2015-01-05" "32423"      "238.2"       "32423"    "1190.12"  
 [6,] "2015-01-05" "32425"      "238.2"       "32425"    "1190.205" 
 [7,] "2015-01-05" "32427"      "238.2"       "32427"    "1190.33"  
 [8,] "2015-01-05" "32431"      "238.2"       "32430"    "1190.05"  
 [9,] "2015-01-05" "32435"      "238.47"      "32432"    "1190.04"  
 [10,] "2015-01-05" "32436"      "238.47"      "32432"    "1190.04"

我的问题是，伦敦数据集有超过 5,000,000 列，我试图避免 for-loop 但我仍然至少需要一个 strong>，上面的程序运行成功，但需要大约 24 小时。

如何避免使用 for 循环并加速程序？

我们将不胜感激。

【问题讨论】：

查看滚动连接，dt1[dt2, roll=TRUE]。肯定有人会很快发布答案。让我们知道您将使用滚动联接的时间。请为您的 2 个数据集提供一个带有 ?dput 的小（10 行）可重现示例 @RockScience 我在问题中放了一个小例子，请阅读。 @GeekCat 最佳实践是使用 R 函数 dput 而不是粘贴数据，因为这样人们可以直接加载具有完全相同格式（相同日期格式等）的数据集如果你这样做，你更有可能让人们回答你的问题。见***.com/questions/5963269/… 【参考方案1】：

在@Jan Gorecki 评论的基础上使用data.table 这是解决方案：

library(data.table)

df1 <- data.table(Date=rep("05/01/2015", 10),   
              time.second=c(32417, 32418, 32421, 32422, 32423, 32425, 32427, 32431, 32435, 32436),  
              Price=c(238.2, 238.2, 238.2, 238.2, 238.2, 238.2, 238.2, 238.2, 238.47, 238.47))

df2 <- data.table(NY.Date=rep("05/01/2015", 16),    
              Time=c(32416, 32417, 32418, 32419, 32420, 32421, 32422, 32423, 32424, 32425, 32426, 32427, 32428, 32429, 32430, 32432),   
              Price=c(1189.75, 1189.665, 1189.895, 1190.15, 1190.075, 1190.01, 1190.175, 1190.12, 1190.14, 1190.205, 1190.2, 1190.33, 1190.29, 1190.28, 1190.05, 1190.04))


setnames(df2, c("Date", "time.second", "NYPrice"))

setkey(df1,"Date", "time.second")
setkey(df2,"Date", "time.second")

df2[, NYTime:=time.second]

df3 <- df2[df1, roll=TRUE]
df3

【讨论】：

感谢您的发帖，它只运行大约 1 秒而不是 24 小时，您介意解释一下吗？太棒了。有一篇关于滚动连接的优秀博文@@gormanalysis.com/r-data-table-rolling-joins @GeekCat 不要忘记将问题标记为已回答。 @dimitris_ps 我遇到了一个问题，我可能不想找到完全匹配的数据，我想找到最近和更早的数据但不是同一时间。我该怎么做，我试图转移数据，但似乎没有给出正确的结果。 @GeekCat 你可以尝试类似：dt[, time.seconds := time.seconds * 10L] 在两个数据集上，然后dt[, time.seconds := time.seconds - 1L] 只在一个数据集上，所以你永远不应该有相同的值。

以上是关于R：具有 2 个大型数据集的模式匹配金融时间序列数据：的主要内容，如果未能解决你的问题，请参考以下文章