05 Computing GC Content
Posted thinkanddo
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了05 Computing GC Content相关的知识,希望对你有一定的参考价值。
Problem
The GC-content of a DNA string is given by the percentage of symbols in the string that are ‘C‘ or ‘G‘. For example, the GC-content of "AGCTATAG" is 37.5%. Note that the reverse complement of any DNA string has the same GC-content.
DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with ‘>‘, followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with ‘>‘ indicates the label of the next string.
In Rosalind‘s implementation, a string in FASTA format will be labeled by the ID "Rosalind_xxxx", where "xxxx" denotes a four-digit code between 0000 and 9999.
Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).
Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.
Sample Dataset
>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG
>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC
>Rosalind_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
TGGGAACCTGCGGGCAGTAGGTGGAAT
Sample Output
Rosalind_0808 60.919540
方法一:
# -*- coding: utf-8 -*- # to open FASTA format sequence file: s=open(‘Computing_GC_Content.txt‘,‘r‘).readlines() # to create two lists, one for names, one for sequences name_list=[] seq_list=[] data=‘‘ # to put the sequence from several lines together for line in s: line=line.strip() for i in line: if i == ‘>‘: name_list.append(line[1:]) if data: seq_list.append(data) #将每一行的的核苷酸字符串连接起来 data=‘‘ # 合完后data 清零 break else: line=line.upper() if all([k==k.upper() for k in line]): #验证是不是所有的都是大写 data=data+line seq_list.append(data) # is there a way to include the last sequence in the for loop? GC_list=[] for seq in seq_list: i=0 for k in seq: if k=="G" or k==‘C‘: i+=1 GC_cont=float(i)/len(seq)*100.0 GC_list.append(GC_cont) m=max(GC_list) print name_list[GC_list.index(m)] # to find the index of max GC print "{:0.6f}".format(m) # 保留6位小数
以上是关于05 Computing GC Content的主要内容,如果未能解决你的问题,请参考以下文章
[High Performance Computing] {Udacity} L3: Intro to the Work-Span Model
A Busiest Computing Nodes(线段树+优先队列)
A Busiest Computing Nodes(线段树+优先队列)