访问大熊猫数据一百万次 - 需要提高效率

Question

我是一名试图验证实验的生物学家。在我的实验中，我在特定治疗后发现了71个突变。为了确定这些突变是否真的是由于我的治疗，我想将它们与一组随机产生的突变进行比较。有人告诉我，我可能会尝试生成一百万套71个随机突变用于统计比较。

首先，我有一个数据框，其中包含感兴趣的基因组中的7000个基因。我知道他们的开始和结束位置。数据帧的前五行如下所示：

    transcript_id   protein_id  start   end kogClass
0   g2.t1   695054  1   1999    Replication, recombination and repair 
1   g3.t1   630170  2000    3056    General function prediction only 
2   g5.t1   695056  3057    4087    Signal transduction mechanisms 
3   g6.t1   671982  4088    5183    N/A
4   g7.t1   671985  5184    8001    Chromatin structure and dynamics

现在大约有一百万套71个随机突变：我已经编写了一个我称之为一百万次的函数，它看起来效率不高，因为在4小时后它只有1/10。这是我的代码。如果有人能提出加快速度的方法，我会欠你一杯啤酒！我的赞赏。

def get_71_random_genes(df, outfile):
    # how many nucleotides are there in all transcripts?
    end_pos_last_gene = df.iloc[-1,3]

    # this loop will go 71 times
    for i in range(71):
        # generate a number from 1 to the end of all transcripts
        random_number = randint(1, end_pos_last_gene)
        # this is the boolean condition - checks which gene a random number falls within 
        mask = (df['start'] <= random_number) & (df['end'] >= random_number)
        # collect the rows that match
        data = df.loc[mask]
        # write data to file.
        data.to_csv(outfile, sep='	', index=False, header=False)

访问大熊猫数据一百万次 - 需要提高效率

Edit to add a more valid approach