Pagine

martedì 8 novembre 2011

Compute a sequence profile

Given an alignment in fasta format you can easily compute the sequence profile referring to the first sequence in the alignment. This is the script:


X=open(sys.argv[1],'r').readlines()
LL=[]
seq=''
for x in X:
    if x[0]!='>':
        seq=seq+x.strip()
    else:
        LL.append(seq)
        seq=''
LL.append(seq)
L=LL[1:]
reference=L[0]
e=0
POS=[]
for r in reference:
    if r!='-':
        POS.append(e)
    e=e+1
AA='VLIMFWYGAPSTCHRKQEND-'
for i in POS:
    values=[]
    for aa in AA:
        aac=0
        for seq in L:
            if seq[i]==aa:
                aac=aac+1
        values.append(float(aac))
    SUM=sum(values)
    for val in values:
        print "%.2f" % (val/SUM),
    print

The script take as argument the fasta alignments and computes position by position the amino acid and the gap frequencies. If you want to eliminate the gap you have to replace the string  AA='VLIMFWYGAPSTCHRKQEND-' with AA='VLIMFWYGAPSTCHRKQEND'.
If you want more that 2 decimal, you have to substitute  "%.2f" % with the number desired.

Nessun commento:

Posta un commento