Scripts for dummies: Protein Clusters

Two proteins, with sequence identity higher than a certain percentage, are similar in term of 3Dmentional structure and probably they come from a common ancestor.
Clusterize the homologous protein sequences can be very useful. For example, if you have to collect a non-redundant protein dataset you can compute the clusters and than choose one sequence per cluster...
The principle commonly used to build the clusters is the following:
if protein A is homologous to protein B and protien B is homologous to protein C, than proteins A, B and C will be put into the same cluster.
This is a python script to generate the cluster starting from BLAST output.
The script requests 2 arguments: the Blast alignment output in TAB format and the homology threshold. You need only the first 3 column of the first argument, therefore you can use an input file like the following:

seq1    seq1    100.00
seq1    seq2    56.90
seq2    seq1    56.90
seq2    seq2    100.00
seq3    seq3    100.00
seq1    seq3    20.43
seq3    seq1    20.40

where you have respectively query sequence, target sequence and identity percentage. The file must contain all the sequences compared with itself (see the first line in the example).
Here, follow the instructions to execute the script.

#!/usr/bin/python
from sets import Set
import sys
import os
#first argument= concatenated BLAST output in TAB format

#second argument= homology threshold
if len(sys.argv)<3:
    print 'USAGE: python cluster.py <blast output> <homology threshold>'
    sys.exit()

os.system("awk '$3>="+sys.argv[2]+"' "+sys.argv[1]+" > tmp")

B=open(sys.argv[1],'r').readlines()
X=Set([])
for b in B:
    line=b.split()
    X.add(line[0])

Y=open('tmp','r').readlines()

D={}
for x in X:
    D[x.strip()]=Set([x.strip()])
for y in Y:
    line=y.split()
    D[line[0]].add(line[1])
for x in X:
    for e in D[x.strip()]:
        if x.strip() not in D[e]:
            D[e].add(x.strip())

def CLU (elemento,LAB,D,CLUSTER):
        if elemento not in LAB:
                LAB.add(elemento)
                NEW=D[elemento]
                for n in NEW:
                        CLUSTER.add(n)
                        CLUSTER=CLU(n,LAB,D,CLUSTER)
        return CLUSTER


C=Set([])
LAB=Set([])
for x in X:
        CLUSTER=Set([])
        cluster=CLU(x.strip(),LAB,D,CLUSTER)
        C.add(cluster)

i=0
for c in C:
    if c!=Set([]):
        i=i+1
        print 'CLUSTER', str(i), '(', str(len(c)), 'elements)'
        for e in c:
            print e,
        print ''
os.system('rm tmp')

The script is not faultless...unfortunately it is limitated to the maximun number of recursion depth allowed from your system. Anyway I'm a dummy, I'm hopeless...so you cannot ask for more!!!!!

Scripts for dummies

Pagine

venerdì 25 novembre 2011

Protein Clusters

Nessun commento:

Posta un commento