在pandas df中,对列的值在范围内的行进行分组。
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了在pandas df中,对列的值在范围内的行进行分组。相关的知识,希望对你有一定的参考价值。
我有一个大熊猫的DF。
number sample chrom1 start chrom2 end
1 s1 1 0 2 1500
2 s1 2 10 2 50
19 s2 3 3098318 3 3125700
19 s3 3 3098720 3 3125870
20 s4 3 3125694 3 3126976
20 s1 3 3125694 3 3126976
20 s1 3 3125695 3 3126976
20 s5 3 3125700 3 3126976
21 s3 3 3125870 3 3134920
22 s2 3 3126976 3 3135039
24 s5 3 17286051 3 17311472
25 s2 3 17286052 3 17294628
26 s4 3 17286052 3 17311472
26 s1 3 17286052 3 17311472
27 s3 3 17286405 3 17294550
28 s4 3 17293197 3 17294628
28 s1 3 17293197 3 17294628
28 s5 3 17293199 3 17294628
29 s2 3 17294628 3 17311472
我想把数字不同的线进行分组,但在这里... start
是在 +/- 10
AND 大限已到 +/- 10
在同一染色体上。
在这个例子中,我想找到这两条线。
24 s5 3 17286051 3 17311472
26 s4 3 17286052 3 17311472
这两条线都有相同的 chrom1
[3]
和 chrom2
[3]
和 start
和终值是 +/- 10
并将它们归入同一编号。
24 s5 3 17286051 3 17311472
24 s4 3 17286052 3 17311472 # Change the number to the first seen in this series
这就是我的尝试
import pandas as pd
from collections import defaultdict
def parse_vars(inFile):
df = pd.read_csv(inFile, delimiter=" ")
df = df[['number', 'chrom1', 'start', 'chrom2', 'end']]
vars = {}
seen_l = defaultdict(lambda: defaultdict(dict)) # To track the `starts`
seen_r = defaultdict(lambda: defaultdict(dict)) # To track the `ends`
for index in df.index:
event = df.loc[index, 'number']
c1 = df.loc[index, 'chrom1']
b1 = int(df.loc[index, 'start'])
c2 = df.loc[index, 'chrom2']
b2 = int(df.loc[index, 'end'])
print [event, c1, b1, c2, b2]
vars[event] = [c1, b1, c2, b2]
# Iterate over windows +/- 10
for i, j in zip( range(b1-10, b1+10), range(b2-10, b2+10) ):
# if :
# i in seen_l[c1] AND
# j in seen_r[c2] AND
# the 'number' for these two instances is the same:
if i in seen_l[c1] and j in seen_r[c2] and seen_l[c1][i] == seen_r[c2][j]:
print seen_l[c1][i], seen_r[c2][j]
if seen_l[c1][i] != event: print"Seen: %s %s in event %s %s" % (event, [c1, b1, c2, b2], seen_l[c1][i], vars[seen_l[c1][i]])
seen_l[c1][b1] = event
seen_r[c2][b2] = event
我遇到的问题是 seen_l[3][17286052]
并存 numbers
25
和 26
以及作为各自的 seen_r
事件(seen_r[3][17294628] = 25
, seen_r[3][17311472] = 26
)不相等,我无法将这些线条连接起来。
有什么方法可以让我用一个 start
的嵌套键,作为 seen_l
dict?
答案
区间重叠很容易在 黄种人. 下面的大部分代码是将开始和结束分成两个不同的dfs。然后根据+-10的间隔重合度将它们连接起来。
from io import StringIO
import pandas as pd
import pyranges as pr
c = """number sample chrom1 start chrom2 end
1 s1 1 0 2 1500
2 s1 2 10 2 50
19 s2 3 3098318 3 3125700
19 s3 3 3098720 3 3125870
20 s4 3 3125694 3 3126976
20 s1 3 3125694 3 3126976
20 s1 3 3125695 3 3126976
20 s5 3 3125700 3 3126976
21 s3 3 3125870 3 3134920
22 s2 3 3126976 3 3135039
24 s5 3 17286051 3 17311472
25 s2 3 17286052 3 17294628
26 s4 3 17286052 3 17311472
26 s1 3 17286052 3 17311472
27 s3 3 17286405 3 17294550
28 s4 3 17293197 3 17294628
28 s1 3 17293197 3 17294628
28 s5 3 17293199 3 17294628
29 s2 3 17294628 3 17311472"""
df = pd.read_table(StringIO(c), sep="s+")
df1 = df[["chrom1", "start", "number", "sample"]]
df1.insert(2, "end", df.start + 1)
df2 = df[["chrom2", "end", "number", "sample"]]
df2.insert(2, "start", df.end - 1)
names = ["Chromosome", "Start", "End", "number", "sample"]
df1.columns = names
df2.columns = names
gr1, gr2 = pr.PyRanges(df1), pr.PyRanges(df2)
j = gr1.join(gr2, slack=10)
# +--------------+-----------+-----------+-----------+------------+-----------+-----------+------------+------------+
# | Chromosome | Start | End | number | sample | Start_b | End_b | number_b | sample_b |
# | (category) | (int32) | (int32) | (int64) | (object) | (int32) | (int32) | (int64) | (object) |
# |--------------+-----------+-----------+-----------+------------+-----------+-----------+------------+------------|
# | 3 | 3125694 | 3125695 | 20 | s4 | 3125700 | 3125699 | 19 | s2 |
# | 3 | 3125694 | 3125695 | 20 | s1 | 3125700 | 3125699 | 19 | s2 |
# | 3 | 3125695 | 3125696 | 20 | s1 | 3125700 | 3125699 | 19 | s2 |
# | 3 | 3125700 | 3125701 | 20 | s5 | 3125700 | 3125699 | 19 | s2 |
# | ... | ... | ... | ... | ... | ... | ... | ... | ... |
# | 3 | 17294628 | 17294629 | 29 | s2 | 17294628 | 17294627 | 25 | s2 |
# | 3 | 17294628 | 17294629 | 29 | s2 | 17294628 | 17294627 | 28 | s5 |
# | 3 | 17294628 | 17294629 | 29 | s2 | 17294628 | 17294627 | 28 | s1 |
# | 3 | 17294628 | 17294629 | 29 | s2 | 17294628 | 17294627 | 28 | s4 |
# +--------------+-----------+-----------+-----------+------------+-----------+-----------+------------+------------+
# Unstranded PyRanges object has 13 rows and 9 columns from 1 chromosomes.
# For printing, the PyRanges was sorted on Chromosome.
# to get the data as a pandas df:
jdf = j.df
以上是关于在pandas df中,对列的值在范围内的行进行分组。的主要内容,如果未能解决你的问题,请参考以下文章
python&pandas:列表中具有值的子集数据框[重复]