Scripts for dummies

lunedì 5 dicembre 2011

Convert ClustalW format in FASTA format

When you are working with multiple sequence alignments you may have some different alignments format.
This is a simple tool to convert the ClustalW format:

P31937          ----------------MAASLRLLGAASGLRYWSR-RLRPAAGSFAAVCSRSVASKTPVG
P00508          MALLQSRLL-------LSAPRRAAATARASSWWSHVEMGPPDPILGVTEAFKRDTNSKKM
P12344          MALLHSGRFLSGVAAAFHPGLAAAASARASSWWAHVEMGPPDPILGVTEAFKRDTNSKKM
P00505          MALLHSGRVLPGIAAAFHPGLAAAASARASSWWTHVEMGPPDPILGVTEAFKRDTNSKKM
P05202          MALLHSSRILSGMAAAFHPGLAAAASARASSWWTHVEMGPPDPILGVTEAFKRDTNSKKM
                                : .     .:* . :*:. : *.   :... : . :::

P31937          FIGLGNM----GNPMAKNLMKHGYPLIIYDVFPDAC------KEFQDAGEQVVSSPADVA
P00508          NLGVGAYRDDNGKSYVLNCVRKAEAMIAAKKMDKEYLPIAGLADFTRASAELALGENSEA
P12344          NLGVGAYRDDNGKPYVLPSVRKAEAQIAAKNLDKEYLPIAGLAEFCKASAELALGENNEV
P00505          NLGVGAYRDDNGKPYVLPSVRKAEAQIAAKNLDKEYLPIGGLAEFCKASAELALGENSEV
P05202          NLGVGAYRDDNGKPYVLPSVRKAEAQIAAKNLDKEYLPIGGLAEFCKASAELALGENNEV
                 :*:*      *:. .   :.:. . * . : .         :* *. ::. . .

in FASTA format (one line sequence one line):

>P31937
----------------MAASLRLLGAASGLRYWSR-RLRPAAGSFAAVCSRSVASKTPVG
FIGLGNM----GNPMAKNLMKHGYPLIIYDVFPDAC------KEFQDAGEQVVSSPADVA
>P00508
MALLQSRLL-------LSAPRRAAATARASSWWSHVEMGPPDPILGVTEAFKRDTNSKKM
NLGVGAYRDDNGKSYVLNCVRKAEAMIAAKKMDKEYLPIAGLADFTRASAELALGENSEA
>P00505
MALLHSGRVLPGIAAAFHPGLAAAASARASSWWTHVEMGPPDPILGVTEAFKRDTNSKKM
NLGVGAYRDDNGKPYVLPSVRKAEAQIAAKNLDKEYLPIGGLAEFCKASAELALGENSEV
>P12344
MALLHSGRFLSGVAAAFHPGLAAAASARASSWWAHVEMGPPDPILGVTEAFKRDTNSKKM
NLGVGAYRDDNGKPYVLPSVRKAEAQIAAKNLDKEYLPIAGLAEFCKASAELALGENNEV
>P05202
MALLHSSRILSGMAAAFHPGLAAAASARASSWWTHVEMGPPDPILGVTEAFKRDTNSKKM
NLGVGAYRDDNGKPYVLPSVRKAEAQIAAKNLDKEYLPIGGLAEFCKASAELALGENNEV

The script take as argument a file containing the alignment in ClustalW format

#!/usr/bin/python

import sys

CluW=open(sys.argv[1],'r').readlines()

FASTA={}

for x in CluW[1:]:

        line=x.split()

        if len(line)==2:

                FASTA[line[0]]=''

for x in CluW[1:]:

        line=x.split()

        if len(line)==2:

                FASTA[line[0]]=FASTA[line[0]]+line[1].strip()

for k,v in FASTA.iteritems():

        print '>'+k

        print v

Follow the instruction to execute the script.

venerdì 25 novembre 2011

Protein Clusters

Two proteins, with sequence identity higher than a certain percentage, are similar in term of 3Dmentional structure and probably they come from a common ancestor.
Clusterize the homologous protein sequences can be very useful. For example, if you have to collect a non-redundant protein dataset you can compute the clusters and than choose one sequence per cluster...
The principle commonly used to build the clusters is the following:
if protein A is homologous to protein B and protien B is homologous to protein C, than proteins A, B and C will be put into the same cluster.
This is a python script to generate the cluster starting from BLAST output.
The script requests 2 arguments: the Blast alignment output in TAB format and the homology threshold. You need only the first 3 column of the first argument, therefore you can use an input file like the following:

seq1    seq1    100.00
seq1    seq2    56.90
seq2    seq1    56.90
seq2    seq2    100.00
seq3    seq3    100.00
seq1    seq3    20.43
seq3    seq1    20.40

where you have respectively query sequence, target sequence and identity percentage. The file must contain all the sequences compared with itself (see the first line in the example).
Here, follow the instructions to execute the script.

#!/usr/bin/python
from sets import Set
import sys
import os
#first argument= concatenated BLAST output in TAB format

#second argument= homology threshold
if len(sys.argv)<3:
    print 'USAGE: python cluster.py <blast output> <homology threshold>'
    sys.exit()

os.system("awk '$3>="+sys.argv[2]+"' "+sys.argv[1]+" > tmp")

B=open(sys.argv[1],'r').readlines()
X=Set([])
for b in B:
    line=b.split()
    X.add(line[0])

Y=open('tmp','r').readlines()

D={}
for x in X:
    D[x.strip()]=Set([x.strip()])
for y in Y:
    line=y.split()
    D[line[0]].add(line[1])
for x in X:
    for e in D[x.strip()]:
        if x.strip() not in D[e]:
            D[e].add(x.strip())

def CLU (elemento,LAB,D,CLUSTER):
        if elemento not in LAB:
                LAB.add(elemento)
                NEW=D[elemento]
                for n in NEW:
                        CLUSTER.add(n)
                        CLUSTER=CLU(n,LAB,D,CLUSTER)
        return CLUSTER


C=Set([])
LAB=Set([])
for x in X:
        CLUSTER=Set([])
        cluster=CLU(x.strip(),LAB,D,CLUSTER)
        C.add(cluster)

i=0
for c in C:
    if c!=Set([]):
        i=i+1
        print 'CLUSTER', str(i), '(', str(len(c)), 'elements)'
        for e in c:
            print e,
        print ''
os.system('rm tmp')

The script is not faultless...unfortunately it is limitated to the maximun number of recursion depth allowed from your system. Anyway I'm a dummy, I'm hopeless...so you cannot ask for more!!!!!

giovedì 17 novembre 2011

Python implementation of DOT PLOT

Hello! This is a very simple script to compute the dot plot in order to compare two biological sequences.

#!/usr/bin/python

import sys

X=open(sys.argv[1],'r').readlines()

Y=open(sys.argv[2],'r').readlines()

seq1=''

seq2=''

for f in X[1:]:

        seq1=seq1+f.strip()

for f in Y[1:]:

        seq2=seq2+f.strip()

i=0

g=open('DP.dat','w')

for A in seq1:

        dot=''

        i=i+1

        e=0

        for B in seq2:

                e=e+1

                if A==B:

                        dot=dot+'/'

                        g.write(str(i)+' '+str(e)+'\n')

                else:

                        dot=dot+' '

The script takes as arguments two fasta sequences and generates a file named DP.dat that can be visualized with different tools.(Instruction to execution)
For example you can use gnuplot typing:

gnuplot

and than:

gnuplot> plot "DP.dat" with dots

You will obtain a plot like the following, where the numbers in the axis correspond to amino acid positions in the protein sequences.

Dop Plot obtained with GNUPLOT

martedì 8 novembre 2011

Compute a sequence profile

Given an alignment in fasta format you can easily compute the sequence profile referring to the first sequence in the alignment. This is the script:

X=open(sys.argv[1],'r').readlines()
LL=[]
seq=''
for x in X:
    if x[0]!='>':
        seq=seq+x.strip()
    else:
        LL.append(seq)
        seq=''
LL.append(seq)
L=LL[1:]
reference=L[0]
e=0
POS=[]
for r in reference:
    if r!='-':
        POS.append(e)
    e=e+1
AA='VLIMFWYGAPSTCHRKQEND-'
for i in POS:
    values=[]
    for aa in AA:
        aac=0
        for seq in L:
            if seq[i]==aa:
                aac=aac+1
        values.append(float(aac))
    SUM=sum(values)
    for val in values:
        print "%.2f" % (val/SUM),
    print

The script take as argument the fasta alignments and computes position by position the amino acid and the gap frequencies. If you want to eliminate the gap you have to replace the string AA='VLIMFWYGAPSTCHRKQEND-' with AA='VLIMFWYGAPSTCHRKQEND'.
If you want more that 2 decimal, you have to substitute "%.2f" % with the number desired.

venerdì 28 ottobre 2011

Split fasta file

Today I want to introduce you a simple method to split a FASTA file into single files, each one containig a single sequence.
Before starting, let me take a back step. Usually, when you download a fasta file you find something like this:

>sp|P47064|AP3S_YEAST AP-3 complex subunit sigma
MIHAVLIFNKKCQPRLVKFYTPVDLPKQKLLLEQVYELISQRNSDFQSSFLVTPPSLLLS
NENNNDEVNNEDIQIIYKNYATLYFTFIVDDQESELAILDLIQTFVESLDRCFTEVNELD
LIFNWQTLESVLEEIVQGGMVIETNVNRIVASVDELNKAAESTDSKIGRLTSTGFGSALQ
AFAQGGFAQWATGQ
>sp|Q29RJ1|AP4A_BOVIN Bis(5'-nucleosyl)-tetraphosphatase
MALRACGLIIFRRRLIPKVDNTAIEFLLLQASDGIHHWTPPKGHVEPGESDLETALRETQ
EEAGIEAGQLTIIEGFRRELSYVARAKPKIVIYWLAEVKDCDVEVRLSREHQAYRWLELE
DACQLAQFEEMKAALQEGHQFLCSTAT

First of all, you can perform an operation in order to transform the sequence name in a simpler one and to put the sequence in a single line.
This is the python code:

#!/usr/bin/python

import sys 

X=open(sys.argv[1],'r').readlines()
g=open(sys.argv[1]+'.format','w')
for x in X:

    if x[0]=='>':
        line=x.split('|')

        if X.index(x)==0:

            g.write('>'+line[1]+'\n')

        else:
            g.write('\n>'+line[1]+'\n')

    else:
        g.write(x.strip())

This script takes the fasta file as argument. In this way you obtain a formatted fasta file:

>sp|P47064|AP3S_YEAST AP-3 complex subunit sigma
MIHAVLIFNKKCQPRLVKFYTPVDLPKQKLLLEQVYELISQRNSDF.....AFAQGGFAQWATGQ
>sp|Q29RJ1|AP4A_BOVIN Bis(5'-nucleosyl)-tetraphosphatase

MALRACGLIIFRRRLIPKVDNTAIEFLLLQASDGIHHWTPPKGHVE.....ALQEGHQFLCSTAT

Now is very simple splitting in single files:

#!/usr/bin/python 

import sys

if len(sys.argv)==2:

    DIR=sys.argv[2]+'/'

else:

    DIR='./'

X=open(sys.argv[1],'r').readlines() 
for i in range(len(X)-1):
    if X[i][0]=='>':
        g=open(DIR+X[i][1:-1]+'.fasta','w')
        g.write(X[i]+X[i+1])
        g.close()

This script takes two arguments: first the fasta file formatted as showed before and second the name of the directory where you want to save the single sequence files. If no direcroty is specified the files will be saved into the work directory.

mercoledì 26 ottobre 2011

RNA to Protein

Today, I would like to describe a script that given a RNA sequence returns a protein sequence.

First of all, you have to download this file, containing the dictionary related to genetic code (in order to understand how I created the file follow this link).

Now, copy the script in a file named script.py in the same directory in which you downloaded the dictionary, then execute the script.

RNA sequence is request as argument. The character may be lower-case or upper-case indistinctly, in fact the string.upper() function converts the whole string in upper-case characters.

#!/usr/bin/python

import pickle

import sys

dictionary=open('genetic_code.dic','r')

D=pickle.load(dictionary)

RNA=sys.argv[1].upper()

protein=''

for i in range(0,len(RNA),3):

if i+3<=len(RNA):

codon=RNA[i]+RNA[i+1]+RNA[i+2]

protein=protein+D[codon]

print protein

This script is very stupid!! It is builded using uniquely a computer science approach. In fact, strarting from the first nucleotide, it reads the codons 3 by 3, and convert the codon into the relative amino acid.
Actually a biologist knows that, to initialize the translation, we need a specific codon in the RNA sequence (AUG that corresponds to Metionin), as well as for the termination (UAA, UAG o UGA that are the stop codons).
So we can add some line in the script code, in order to keep into account these molecular biology roles.
We will do nothing but verify in which RNA sequence positions the first AUG and the first STOP codons appear, and we will cositer these positions as the first and the last of the RNA sequence.

#!/usr/bin/python

import pickle

import sys

dictionary=open('genetic_code.dic','r')

D=pickle.load(dictionary)

RNA=sys.argv[1].upper()

if 'AUG' in RNA:

s=RNA.index('AUG')

protein='-' * s

PROTEIN=''

else:

s=0

print RNA

print '-' * len(RNA)

print 'AUG codon was not found'

sys.exit()

for i in range(s,len(RNA),3):

if i+3<=len(RNA):

codon=RNA[i]+RNA[i+1]+RNA[i+2]

if D[codon]=='STOP':

protein=protein+'-*-'

break

else:

protein=protein+'-'+D[codon]+'-'

PROTEIN=PROTEIN+D[codon]

end='-' * (len(RNA)-len(protein))

print RNA

print protein+end

print 'Protein sequence --->', PROTEIN

A dictionary for ever

When you write data into an object (list, set, dictionary, e.g.), save this object in a file may be very useful. By this, you can access to data structure without re-build it from scratch.

The python module that allows to perform this is pickle.
For example, take a txt file containing the genetic code data, and try to seve a dictionary.

#!/usr/bin/python
import pickle
GC=open('genetic_code.txt','r').readlines()
D={}
for i in GC:
        line=i.split()
        codons=line[-1].split(',')
        for c in codons[:-1]:
                D[c]=line[1]
g=open('dictionary','w')
pickle.dump(D,g)
g.close()

Now you will find a file named 'dictionary' in the working directory.
Opening the file, you will see nothing of readable, but you can import and use the file when the dictionary will serve you.
In order to import the dictionary you have to use this syntax:

import pickle
F=open('dizionario','r')
D=pickle.load(F)

Now the variable D is your dictionary again.

In spite of the post title, the procedure works also with lists, sets and whatever you want immortalize.

Pagine

lunedì 5 dicembre 2011

Convert ClustalW format in FASTA format

venerdì 25 novembre 2011

Protein Clusters

giovedì 17 novembre 2011

Python implementation of DOT PLOT

martedì 8 novembre 2011

Compute a sequence profile

venerdì 28 ottobre 2011

Split fasta file

mercoledì 26 ottobre 2011

RNA to Protein

A dictionary for ever