Clusterize the homologous protein sequences can be very useful. For example, if you have to collect a non-redundant protein dataset you can compute the clusters and than choose one sequence per cluster...
The principle commonly used to build the clusters is the following:
if protein A is homologous to protein B and protien B is homologous to protein C, than proteins A, B and C will be put into the same cluster.
This is a python script to generate the cluster starting from BLAST output.
The script requests 2 arguments: the Blast alignment output in TAB format and the homology threshold. You need only the first 3 column of the first argument, therefore you can use an input file like the following:
seq1 seq1 100.00
seq1 seq2 56.90
seq2 seq1 56.90
seq2 seq2 100.00
seq3 seq3 100.00
seq1 seq3 20.43
seq3 seq1 20.40
seq1 seq2 56.90
seq2 seq1 56.90
seq2 seq2 100.00
seq3 seq3 100.00
seq1 seq3 20.43
seq3 seq1 20.40
where you have respectively query sequence, target sequence and identity percentage. The file must contain all the sequences compared with itself (see the first line in the example).
Here, follow the instructions to execute the script.
#!/usr/bin/python
from sets import Set
import sys
import os
#first argument= concatenated BLAST output in TAB format
from sets import Set
import sys
import os
#first argument= concatenated BLAST output in TAB format
#second argument= homology threshold
if len(sys.argv)<3:
print 'USAGE: python cluster.py <blast output> <homology threshold>'
sys.exit()
os.system("awk '$3>="+sys.argv[2]+"' "+sys.argv[1]+" > tmp")
B=open(sys.argv[1],'r').readlines()
X=Set([])
for b in B:
line=b.split()
X.add(line[0])
Y=open('tmp','r').readlines()
D={}
for x in X:
D[x.strip()]=Set([x.strip()])
for y in Y:
line=y.split()
D[line[0]].add(line[1])
for x in X:
for e in D[x.strip()]:
if x.strip() not in D[e]:
D[e].add(x.strip())
def CLU (elemento,LAB,D,CLUSTER):
if elemento not in LAB:
LAB.add(elemento)
NEW=D[elemento]
for n in NEW:
CLUSTER.add(n)
CLUSTER=CLU(n,LAB,D,CLUSTER)
return CLUSTER
C=Set([])
LAB=Set([])
for x in X:
CLUSTER=Set([])
cluster=CLU(x.strip(),LAB,D,CLUSTER)
C.add(cluster)
i=0
for c in C:
if c!=Set([]):
i=i+1
print 'CLUSTER', str(i), '(', str(len(c)), 'elements)'
for e in c:
print e,
print ''
os.system('rm tmp')
if len(sys.argv)<3:
print 'USAGE: python cluster.py <blast output> <homology threshold>'
sys.exit()
os.system("awk '$3>="+sys.argv[2]+"' "+sys.argv[1]+" > tmp")
B=open(sys.argv[1],'r').readlines()
X=Set([])
for b in B:
line=b.split()
X.add(line[0])
Y=open('tmp','r').readlines()
D={}
for x in X:
D[x.strip()]=Set([x.strip()])
for y in Y:
line=y.split()
D[line[0]].add(line[1])
for x in X:
for e in D[x.strip()]:
if x.strip() not in D[e]:
D[e].add(x.strip())
def CLU (elemento,LAB,D,CLUSTER):
if elemento not in LAB:
LAB.add(elemento)
NEW=D[elemento]
for n in NEW:
CLUSTER.add(n)
CLUSTER=CLU(n,LAB,D,CLUSTER)
return CLUSTER
C=Set([])
LAB=Set([])
for x in X:
CLUSTER=Set([])
cluster=CLU(x.strip(),LAB,D,CLUSTER)
C.add(cluster)
i=0
for c in C:
if c!=Set([]):
i=i+1
print 'CLUSTER', str(i), '(', str(len(c)), 'elements)'
for e in c:
print e,
print ''
os.system('rm tmp')
The script is not faultless...unfortunately it is limitated to the maximun number of recursion depth allowed from your system. Anyway I'm a dummy, I'm hopeless...so you cannot ask for more!!!!!