比较熊猫数据框中的行值
Posted
技术标签:
【中文标题】比较熊猫数据框中的行值【英文标题】:Comparing row values in pandas dataframe 【发布时间】:2014-11-24 16:11:20 【问题描述】:我在 pandas 数据框中有数据,其中两列包含数字序列(开始和停止)。我想确定哪些行的停止值与下一行的起始值重叠。然后我需要将它们连接成一行,这样我就只有一个不重叠的数字序列,由每行中的起始值和终止值表示。
我已将数据加载到 pandas 数据框中:
chr start stop geneID 0 chr13 32889584 32889814 BRCA2 1 chr13 32890536 32890737 BRCA2 2 chr13 32893194 32893307 BRCA2 3 chr13 32893282 32893400 BRCA2 4 chr13 32893363 32893466 BRCA2 5 chr13 32899127 32899242 BRCA2
我想比较数据框中的行。检查每一行的停止值是否小于下一行的起始值,然后在具有正确起始值和停止值的新数据框中创建一行。理想情况下,当有几行全部重叠时,这将一次全部连接起来,但是我怀疑我将不得不迭代我的输出,直到不再发生这种情况。
到目前为止我的代码可以识别是否有重叠(改编自this post):
import pandas as pd
import numpy as np
columns = ['chr','start','stop','geneID']
bed = pd.read_table('bedfile.txt',sep='\s',names=['chr','start','stop','geneID'],engine='python')
def bed_prepare(inp_bed):
inp_bed['next_start'] = inp_bed['start'].shift(periods=-1)
inp_bed['distance_to_next'] = inp_bed['next_start'] - inp_bed['stop']
inp_bed['next_region_overlap'] = inp_bed['next_start'] < inp_bed['stop']
intermediate_bed = inp_bed
return intermediate_bed
这给了我这样的输出:
print bed_prepare(bed)
chr start stop geneID next_start distance_to_next next_region_overlap 0 chr13 32889584 32889814 BRCA2 32890536 722 False 1 chr13 32890536 32890737 BRCA2 32893194 2457 False 2 chr13 32893194 32893307 BRCA2 32893282 -25 True 3 chr13 32893282 32893400 BRCA2 32893363 -37 True 4 chr13 32893363 32893466 BRCA2 32899127 5661 False
我想将此中间数据帧放入以下函数中以获得所需的输出(如下所示):
new_bed = pd.DataFrame(data=np.zeros((0,len(columns))),columns=columns)
def bed_collapse(intermediate_bed, new_bed,columns=columns):
for row in bed.itertuples():
output =
if row[7] == False:
# If row doesn't overlap next row, insert into new dataframe unchanged.
output_row = list(row[1:5])
if row[7] == True:
# For overlapping rows take the chromosome and start coordinate
output_row = list(row[1:3])
# Iterate to next row
bed.itertuples().next()
# Append stop coordinate and geneID
output_row.append(row[3])
output_row.append(row[4])
#print output_row
for k, v in zip(columns,output_row): otpt[k] = v
#print output
new_bed = new_bed.append(otpt,ignore_index=True)
output_bed = new_bed
return output_bed
int_bed = bed_prepare(bed)
print bed_collapse(int_bed,new_bed)
期望的输出:
chr start stop geneID 0 chr13 32889584 32889814 BRCA2 1 chr13 32890536 32890737 BRCA2 2 chr13 32893194 32893466 BRCA2 5 chr13 32899127 32899242 BRCA2
但是,当我运行该函数时,我的原始数据框保持不变。我知道问题出在我尝试调用 bed.itertuples().next() 时,因为这显然不是调用的正确语法/位置。但我不知道纠正这个问题的正确方法。
一些指针会很棒。
SB :)
更新
这是一个BED file,其中每一行指的是一个具有起始和终止坐标的扩增子(基因组区域)。一些扩增子重叠;即起始坐标在前一行的停止坐标之前。因此,我需要确定哪些行重叠并连接正确的开始和停止,以便每一行代表完全独特的扩增子,不会与任何其他行重叠。
【问题讨论】:
【参考方案1】:我会尽量给你一些建议。
一个指针是您希望根据由移位的布尔值组成的系列来获取行。可能您可以使用以下方法获得新的移位系列:
Boolean_Series = intermediate_bed.loc[:,'next_region_overlap'].shift(periods=1, freq=None, axis=0, **kwds)
有关此功能的更多背景信息: http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.shift.html
第二个指针是,通过使用这个移位系列,您可以通过以下方式获取数据帧:
int_bed = bed.loc[Boolean_Series, :]
更多关于索引的信息可以在这里找到: http://pandas.pydata.org/pandas-docs/dev/indexing.html
这些只是现在的指针,我不知道这是否是一个实际可行的解决方案。
【讨论】:
【参考方案2】:我修改了 bed_prepare 函数来检查前一个和下一个基因组区域的重叠:
def bed_prepare(inp_bed):
''' Takes pandas dataframe bed file and identifies which regions overlap '''
inp_bed['next_start'] = inp_bed['start'].shift(periods=-1)
inp_bed['distance_to_next'] = inp_bed['next_start'] - inp_bed['stop']
inp_bed['next_region_overlap'] = inp_bed['next_start'] <= inp_bed['stop']
inp_bed['previous_stop'] = inp_bed['stop'].shift(periods=1)
inp_bed['distance_from_previous'] = inp_bed['start'] - inp_bed['previous_stop']
inp_bed['previous_region_overlap'] = inp_bed['previous_stop'] >= inp_bed['start']
intermediate_bed = inp_bed
return intermediate_bed
然后我使用这些布尔输出来为写入步骤存储变量:
# Create empty dataframe to fill with parsed values
new_bed = pd.DataFrame(data=np.zeros((0,len(columns))),columns=columns,dtype=int)
def bed_collapse(intermediate_bed, new_bed,columns=columns):
''' Takes a pandas dataframe bed file with overlap information and returns
genomic regions without overlaps '''
output_row = []
for row in bed.itertuples():
output =
if row[7] == False and row[10] == False:
# If row doesn't overlap next row, insert into new dataframe unchanged.
output_row = list(row[1:5])
elif row[7] == True and row[10] == False:
# Only next region overlaps; take the chromosome and start coordinate
output_row = list(row[1:3])
elif row[7] == True and row[10] == True:
# Next and previous regions overlap. Skip row.
pass
elif row[7] == False and row[10] == True:
# Only previous region overlaps; append stop coordinate and geneID to output_row variable
output_row.append(row[3])
output_row.append(row[4])
if row[7] == False:
#Zip columns and output_row values together to form a dict for appending
for k, v in zip(columns,output_row): output[k] = v
#print output
new_bed = new_bed.append(output,ignore_index=True)
output_bed = new_bed
return output_bed
这现在已经解决了我的问题并给出了问题中指定的所需输出。 :)
【讨论】:
【参考方案3】:我不确定我理解你为什么要做你正在做的事情,但是你可以通过简单地使用索引来获得你想要的输出。例如
# assume your data is stored in <df>
# call the temporary dataframe <tmp>
tmp = df[ ['chr','start','stop','geneID'] ][(df.stop - df.start.shift(-1))>0]
这就是你最终想要做的吗?
更新 好的,我明白你在做什么。请记住,我从未处理过任何基因组数据,所以我不知道你的列中有多少行,所以简单的“循环”可能会很慢(如果你有几十亿行,这可能需要一段时间),但是这是唯一想到的解决方案。这是首先想到的(注意:这不是成品,因为您需要确定如何处理引入的 NaN 以及如何处理循环终止)。
import pandas as pd
df = pd.DataFrame(index = [0,1,2,3,4,5],columns=['chr','start','stop','geneID'])
df['chr'] = np.array( ['chr13']*6 )
df['start'] = np.array( [32889584,32890536,32893194,32893282,32893363,32899127] )
df['stop'] = np.array( [32889814,32890737,32893307,32893400,32893466,32899242] )
df['geneID'] = np.array( ['BRCA2']*6 )
# calculate difference between start/stop times for adjacent rows
# this will effectively "look into the future" to see if the upcoming row has
# a start time that is greater than the current stop time
df['tdiff'] = (df.start - df.stop.shift(1)).shift(-1)
# create new dataframe
df_cut = df.copy()*0
r = 0
while r < df.shape[0]:
if df.tdiff[r] > 0:
df_cut.iloc[r] = df.iloc[r]
r+=1
elif df.tdiff.iloc[r] < 0: # have to determine how you will handle the NaN's later
df_cut.chr.iloc[r] = df.chr.iloc[r]
df_cut.start.iloc[r] = df.start.iloc[r]
df_cut.geneID.iloc[r] = df.geneID.iloc[r]
# get the next-valid row and put "stop" value into <df_cut>
df_cut.stop.iloc[r] = df.ix[r:][df.tdiff>0].stop.iloc[0]
# determine new index location for <r>
r = df.ix[r:][df.tdiff>0].index[0] + 1
# eliminate empty rows
df_cut = df_cut[df_cut.start<>0]
运行后:
>>> df_cut
chr start stop geneID tdiff
0 chr13 32889584 32889814 BRCA2 722
1 chr13 32890536 32890737 BRCA2 2457
2 chr13 32893194 32893466 BRCA2 -0
【讨论】:
感谢您的帮助,但这并不是我想要的。您的代码给我的新数据框与我的中间数据框相同。我已经更新了原始帖子以增加清晰度。希望这会有所帮助。 :)【参考方案4】:pyranges 将允许您在一行代码中超快速地完成此操作:
import pyranges as pr
c = """Chromosome Start End geneID
chr13 32889584 32889814 BRCA2
chr13 32890536 32890737 BRCA2
chr13 32893194 32893307 BRCA2
chr13 32893282 32893400 BRCA2
chr13 32893363 32893466 BRCA2
chr13 32899127 32899242 BRCA2"""
gr = pr.from_string(c)
# +--------------+-----------+-----------+------------+
# | Chromosome | Start | End | geneID |
# | (category) | (int32) | (int32) | (object) |
# |--------------+-----------+-----------+------------|
# | chr13 | 32889584 | 32889814 | BRCA2 |
# | chr13 | 32890536 | 32890737 | BRCA2 |
# | chr13 | 32893194 | 32893307 | BRCA2 |
# | chr13 | 32893282 | 32893400 | BRCA2 |
# | chr13 | 32893363 | 32893466 | BRCA2 |
# | chr13 | 32899127 | 32899242 | BRCA2 |
# +--------------+-----------+-----------+------------+
# Unstranded PyRanges object has 6 rows and 4 columns from 1 chromosomes.
# For printing, the PyRanges was sorted on Chromosome.
m = gr.merge(by="geneID")
# +--------------+-----------+-----------+------------+
# | Chromosome | Start | End | geneID |
# | (category) | (int32) | (int32) | (object) |
# |--------------+-----------+-----------+------------|
# | chr13 | 32889584 | 32889814 | BRCA2 |
# | chr13 | 32890536 | 32890737 | BRCA2 |
# | chr13 | 32893194 | 32893466 | BRCA2 |
# | chr13 | 32899127 | 32899242 | BRCA2 |
# +--------------+-----------+-----------+------------+
# Unstranded PyRanges object has 4 rows and 4 columns from 1 chromosomes.
# For printing, the PyRanges was sorted on Chromosome.
请注意,by="geneID"
使得间隔只有在它们重叠并且具有相同的 geneID
值时才会合并。如果您想将区间元数据与自定义函数合并,另请参阅方法集群。
【讨论】:
以上是关于比较熊猫数据框中的行值的主要内容,如果未能解决你的问题,请参考以下文章