Scripts for dummies: novembre 2011

venerdì 25 novembre 2011

Protein Clusters

Two proteins, with sequence identity higher than a certain percentage, are similar in term of 3Dmentional structure and probably they come from a common ancestor.
Clusterize the homologous protein sequences can be very useful. For example, if you have to collect a non-redundant protein dataset you can compute the clusters and than choose one sequence per cluster...
The principle commonly used to build the clusters is the following:
if protein A is homologous to protein B and protien B is homologous to protein C, than proteins A, B and C will be put into the same cluster.
This is a python script to generate the cluster starting from BLAST output.
The script requests 2 arguments: the Blast alignment output in TAB format and the homology threshold. You need only the first 3 column of the first argument, therefore you can use an input file like the following:

seq1    seq1    100.00
seq1    seq2    56.90
seq2    seq1    56.90
seq2    seq2    100.00
seq3    seq3    100.00
seq1    seq3    20.43
seq3    seq1    20.40

where you have respectively query sequence, target sequence and identity percentage. The file must contain all the sequences compared with itself (see the first line in the example).
Here, follow the instructions to execute the script.

#!/usr/bin/python
from sets import Set
import sys
import os
#first argument= concatenated BLAST output in TAB format

#second argument= homology threshold
if len(sys.argv)<3:
    print 'USAGE: python cluster.py <blast output> <homology threshold>'
    sys.exit()

os.system("awk '$3>="+sys.argv[2]+"' "+sys.argv[1]+" > tmp")

B=open(sys.argv[1],'r').readlines()
X=Set([])
for b in B:
    line=b.split()
    X.add(line[0])

Y=open('tmp','r').readlines()

D={}
for x in X:
    D[x.strip()]=Set([x.strip()])
for y in Y:
    line=y.split()
    D[line[0]].add(line[1])
for x in X:
    for e in D[x.strip()]:
        if x.strip() not in D[e]:
            D[e].add(x.strip())

def CLU (elemento,LAB,D,CLUSTER):
        if elemento not in LAB:
                LAB.add(elemento)
                NEW=D[elemento]
                for n in NEW:
                        CLUSTER.add(n)
                        CLUSTER=CLU(n,LAB,D,CLUSTER)
        return CLUSTER


C=Set([])
LAB=Set([])
for x in X:
        CLUSTER=Set([])
        cluster=CLU(x.strip(),LAB,D,CLUSTER)
        C.add(cluster)

i=0
for c in C:
    if c!=Set([]):
        i=i+1
        print 'CLUSTER', str(i), '(', str(len(c)), 'elements)'
        for e in c:
            print e,
        print ''
os.system('rm tmp')

The script is not faultless...unfortunately it is limitated to the maximun number of recursion depth allowed from your system. Anyway I'm a dummy, I'm hopeless...so you cannot ask for more!!!!!

giovedì 17 novembre 2011

Python implementation of DOT PLOT

Hello! This is a very simple script to compute the dot plot in order to compare two biological sequences.

#!/usr/bin/python

import sys

X=open(sys.argv[1],'r').readlines()

Y=open(sys.argv[2],'r').readlines()

seq1=''

seq2=''

for f in X[1:]:

        seq1=seq1+f.strip()

for f in Y[1:]:

        seq2=seq2+f.strip()

i=0

g=open('DP.dat','w')

for A in seq1:

        dot=''

        i=i+1

        e=0

        for B in seq2:

                e=e+1

                if A==B:

                        dot=dot+'/'

                        g.write(str(i)+' '+str(e)+'\n')

                else:

                        dot=dot+' '

The script takes as arguments two fasta sequences and generates a file named DP.dat that can be visualized with different tools.(Instruction to execution)
For example you can use gnuplot typing:

gnuplot

and than:

gnuplot> plot "DP.dat" with dots

You will obtain a plot like the following, where the numbers in the axis correspond to amino acid positions in the protein sequences.

Dop Plot obtained with GNUPLOT

martedì 8 novembre 2011

Compute a sequence profile

Given an alignment in fasta format you can easily compute the sequence profile referring to the first sequence in the alignment. This is the script:

X=open(sys.argv[1],'r').readlines()
LL=[]
seq=''
for x in X:
    if x[0]!='>':
        seq=seq+x.strip()
    else:
        LL.append(seq)
        seq=''
LL.append(seq)
L=LL[1:]
reference=L[0]
e=0
POS=[]
for r in reference:
    if r!='-':
        POS.append(e)
    e=e+1
AA='VLIMFWYGAPSTCHRKQEND-'
for i in POS:
    values=[]
    for aa in AA:
        aac=0
        for seq in L:
            if seq[i]==aa:
                aac=aac+1
        values.append(float(aac))
    SUM=sum(values)
    for val in values:
        print "%.2f" % (val/SUM),
    print

The script take as argument the fasta alignments and computes position by position the amino acid and the gap frequencies. If you want to eliminate the gap you have to replace the string AA='VLIMFWYGAPSTCHRKQEND-' with AA='VLIMFWYGAPSTCHRKQEND'.
If you want more that 2 decimal, you have to substitute "%.2f" % with the number desired.