查找两个数据帧之间的重叠区域

Posted 2023-04-18

技术标签:

【中文标题】查找两个数据帧之间的重叠区域【英文标题】：Find Overlapping Regions between two dataframes 【发布时间】：2018-05-30 03:47:10 【问题描述】：

我有两个数据框如下：

数据1：

chr19   45770502    45770503    5.26315789473684
chr19   45770513    45770514    3.17460317460317
chr19   45770516    45770517    6.56063618290259
chr19   45770526    45770527    7.3558648111332
chr19   45770538    45770539    5.81162324649299
chr19   45770539    45770540    0
chr19   45770541    45770542    6.85483870967742
chr19   47430080    47430081    0
chr19   47430099    47430100    0
chr19   47430113    47430114    0
chr19   47430127    47430128    0
chr19   47430164    47430165    0
chr19   47430166    47430167    0
chr19   47430175    47430176    0
chr19   47430187    47430188    0
chr19   47430189    47430190    0
chr19   47430191    47430192    0
chr19   47430196    47430197    0
chr19   47430205    47430206    0
chr19   47430208    47430209    0
chr19   47430211    47430212    0
chr19   47430222    47430223    0
chr19   47430228    47430229    0
chr7    23904987    23904988    0
chr7    23904990    23904991    0

数据2：

chr19   45770509    45777447    uc061acd.1  0   -   45770509    45777447    0   5   131,98,112,86,121,  0,1058,2131,4439,6817,
chr19   45770921    45772712    uc061ace.1  0   -   45771157    45772712    0   4   475,98,158,72,  0,646,849,1719,
chr19   45770981    45772504    uc061acf.1  0   +   45770981    45770981    0   3   98,186,199, 0,508,1324,
chr19   45770995    45772504    uc061acg.1  0   +   45770995    45770995    0   3   84,95,199,  0,594,1310,
chr19   45771012    45772504    uc061ach.1  0   +   45771012    45771012    0   3   67,86,199,  0,577,1293,
chr19   45771532    45775268    uc061aci.1  0   -   45771532    45771532    0   4   133,158,112,320,    0,238,1108,3416,
chr19   45774947    45777037    uc061acj.1  0   -   45774947    45774947    0   2   87,379, 0,1711,

我想创建一个输出，其中从 Data1 和 Data2 中提取重叠的开始和结束位置，并将 Data1 中的 column4 中的值相加以用于重叠区域。

输出示例：

chr19 45770513 45770542 35

我想对 Data1 的 column4 中的值求和，其中开始和结束位置与 Data2 重叠。

如何为每个可能与 chr 更改的重叠创建这种格式的输出？

提前致谢。

【问题讨论】：

【参考方案1】：

如果我正确理解了这个问题，那么您可以尝试data.table 方法

library(data.table)

#convert sample data into data table
DT1 <- as.data.table(df1)
DT2 <- as.data.table(df2)

#identify rows of DT1 which fall under DT2's range (see 'pos_range' column) 
#In case of NA (i.e. not found) replace it with row_number so that proper summarisation happens at the end
DT1[DT2, pos_range := paste(V2, V3, sep = '-'), 
    on = .(col2 >= V2, col3 <= V3)][, .(col1, col2, col3, col4, pos_range)]
DT1[, pos_range := ifelse(is.na(pos_range), .I, pos_range)]

#summarise data
DT <- unique(DT1[, c("start_pos", "end_pos", "value_sum") := list(first(col2), last(col3), sum(col4)), 
                 .(col1, pos_range)][, .(col1, start_pos, end_pos, value_sum)])

输出为：

> DT
     col1 start_pos  end_pos value_sum
 1: chr19  45770502 45770503  5.263158
 2: chr19  45770513 45770542 29.757566
 3: chr19  47430080 47430081  0.000000
 4: chr19  47430099 47430100  0.000000
...

更新：如果您只想知道重叠的行，则只需忽略 pos_range 的 pos_range 列中的 NA DT1

library(data.table)    
DT1 <- as.data.table(df1)
DT2 <- as.data.table(df2)

DT <- DT1[DT2, pos_range := paste(V2, V3, sep = '-'), 
          on = .(col2 >= V2, col3 <= V3)][!is.na(pos_range), .(col1, col2, col3, col4, pos_range)]
DT <- unique(DT[, c("start_pos", "end_pos", "value_sum") := list(first(col2), last(col3), sum(col4)), 
                .(col1, pos_range)][, .(col1, start_pos, end_pos, value_sum)])
DT
#    col1 start_pos  end_pos value_sum
#1: chr19  45770513 45770542  29.75757

样本数据：

df1 <- structure(list(col1 = c("chr19", "chr19", "chr19", "chr19", "chr19", 
"chr19", "chr19", "chr19", "chr19", "chr19", "chr19", "chr19", 
"chr19", "chr19", "chr19", "chr19", "chr19", "chr19", "chr19", 
"chr19", "chr19", "chr19", "chr19", "chr7", "chr7"), col2 = c(45770502L, 
45770513L, 45770516L, 45770526L, 45770538L, 45770539L, 45770541L, 
47430080L, 47430099L, 47430113L, 47430127L, 47430164L, 47430166L, 
47430175L, 47430187L, 47430189L, 47430191L, 47430196L, 47430205L, 
47430208L, 47430211L, 47430222L, 47430228L, 23904987L, 23904990L
), col3 = c(45770503L, 45770514L, 45770517L, 45770527L, 45770539L, 
45770540L, 45770542L, 47430081L, 47430100L, 47430114L, 47430128L, 
47430165L, 47430167L, 47430176L, 47430188L, 47430190L, 47430192L, 
47430197L, 47430206L, 47430209L, 47430212L, 47430223L, 47430229L, 
23904988L, 23904991L), col4 = c(5.26315789473684, 3.17460317460317, 
6.56063618290259, 7.3558648111332, 5.81162324649299, 0, 6.85483870967742, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), .Names = c("col1", 
"col2", "col3", "col4"), class = "data.frame", row.names = c(NA, 
-25L))

df2 <- structure(list(V1 = c("chr19", "chr19", "chr19", "chr19", "chr19", 
"chr19", "chr19"), V2 = c(45770509L, 45770921L, 45770981L, 45770995L, 
45771012L, 45771532L, 45774947L), V3 = c(45777447L, 45772712L, 
45772504L, 45772504L, 45772504L, 45775268L, 45777037L), V4 = c("uc061acd.1", 
"uc061ace.1", "uc061acf.1", "uc061acg.1", "uc061ach.1", "uc061aci.1", 
"uc061acj.1"), V5 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L), V6 = c("-", 
"-", "+", "+", "+", "-", "-"), V7 = c(45770509L, 45771157L, 45770981L, 
45770995L, 45771012L, 45771532L, 45774947L), V8 = c(45777447L, 
45772712L, 45770981L, 45770995L, 45771012L, 45771532L, 45774947L
), V9 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L), V10 = c(5L, 4L, 3L, 3L, 
3L, 4L, 2L), V11 = c("131,98,112,86,121,", "475,98,158,72,", 
"98,186,199,", "84,95,199,", "67,86,199,", "133,158,112,320,", 
"87,379,"), V12 = c("0,1058,2131,4439,6817,", "0,646,849,1719,", 
"0,508,1324,", "0,594,1310,", "0,577,1293,", "0,238,1108,3416,", 
"0,1711,")), .Names = c("V1", "V2", "V3", "V4", "V5", "V6", "V7", 
"V8", "V9", "V10", "V11", "V12"), class = "data.frame", row.names = c(NA, 
-7L))

【讨论】：

嗨，这个解决方案非常有效。但是，如果我只想保留计算总和的重叠位置怎么办？我怎样才能做到这一点？此输出列出了所有位置，而不是存在重叠并计算总和的位置。很高兴它有帮助！关于上述问题，请参阅更新的答案。【参考方案2】：

你可以试试 Bioconductor 的GenomicRanges。这是一个不太干净的解决方案：

suppressPackageStartupMessages(library(GenomicRanges))

gr1 <- GRanges(seqnames = dtt1$V1, ranges = IRanges(start = dtt1$V2, end = dtt1$V3), score = dtt1$V4)

gr2 <- GRanges(seqnames = dtt2$V1, ranges = IRanges(start = dtt2$V2, end = dtt2$V3))


x <- findOverlaps(gr1, gr2)

y <- lapply(split(gr1[queryHits(x)], subjectHits(x)), function(g)
    res <- reduce(g, min.gapwidth = max(end(g)) - min(start(g)))
    score(res) <- sum(score(g))
    res
)
as(y, 'GRangesList')

# GRangesList object of length 1:
# $1 
# GRanges object with 1 range and 1 metadata column:
#       seqnames               ranges strand |            score
#          <Rle>            <IRanges>  <Rle> |        <numeric>
#   [1]    chr19 [45770513, 45770542]      * | 29.7575661248094

您也可以在data.table 或命令行工具bedtools 中使用非等连接来执行此操作。

【讨论】：

以上是关于查找两个数据帧之间的重叠区域的主要内容，如果未能解决你的问题，请参考以下文章