Pagine

lunedì 5 dicembre 2011

Convert ClustalW format in FASTA format

When you are working with multiple sequence alignments you may have some different alignments format.
This is a simple tool to convert the ClustalW  format:

P31937          ----------------MAASLRLLGAASGLRYWSR-RLRPAAGSFAAVCSRSVASKTPVG
P00508          MALLQSRLL-------LSAPRRAAATARASSWWSHVEMGPPDPILGVTEAFKRDTNSKKM
P12344          MALLHSGRFLSGVAAAFHPGLAAAASARASSWWAHVEMGPPDPILGVTEAFKRDTNSKKM
P00505          MALLHSGRVLPGIAAAFHPGLAAAASARASSWWTHVEMGPPDPILGVTEAFKRDTNSKKM
P05202          MALLHSSRILSGMAAAFHPGLAAAASARASSWWTHVEMGPPDPILGVTEAFKRDTNSKKM
                                : .     .:* .  :*:.  : *.   :... : .  :::

 

P31937          FIGLGNM----GNPMAKNLMKHGYPLIIYDVFPDAC------KEFQDAGEQVVSSPADVA
P00508          NLGVGAYRDDNGKSYVLNCVRKAEAMIAAKKMDKEYLPIAGLADFTRASAELALGENSEA
P12344          NLGVGAYRDDNGKPYVLPSVRKAEAQIAAKNLDKEYLPIAGLAEFCKASAELALGENNEV
P00505          NLGVGAYRDDNGKPYVLPSVRKAEAQIAAKNLDKEYLPIGGLAEFCKASAELALGENSEV
P05202          NLGVGAYRDDNGKPYVLPSVRKAEAQIAAKNLDKEYLPIGGLAEFCKASAELALGENNEV
                 :*:*      *:. .   :.:. . *  . : .         :*  *. ::. .  . 

in FASTA format (one line sequence one line):

>P31937
----------------MAASLRLLGAASGLRYWSR-RLRPAAGSFAAVCSRSVASKTPVG
FIGLGNM----GNPMAKNLMKHGYPLIIYDVFPDAC------KEFQDAGEQVVSSPADVA
>P00508
MALLQSRLL-------LSAPRRAAATARASSWWSHVEMGPPDPILGVTEAFKRDTNSKKM
NLGVGAYRDDNGKSYVLNCVRKAEAMIAAKKMDKEYLPIAGLADFTRASAELALGENSEA
>P00505
MALLHSGRVLPGIAAAFHPGLAAAASARASSWWTHVEMGPPDPILGVTEAFKRDTNSKKM
NLGVGAYRDDNGKPYVLPSVRKAEAQIAAKNLDKEYLPIGGLAEFCKASAELALGENSEV
>P12344
MALLHSGRFLSGVAAAFHPGLAAAASARASSWWAHVEMGPPDPILGVTEAFKRDTNSKKM
NLGVGAYRDDNGKPYVLPSVRKAEAQIAAKNLDKEYLPIAGLAEFCKASAELALGENNEV
>P05202
MALLHSSRILSGMAAAFHPGLAAAASARASSWWTHVEMGPPDPILGVTEAFKRDTNSKKM
NLGVGAYRDDNGKPYVLPSVRKAEAQIAAKNLDKEYLPIGGLAEFCKASAELALGENNEV


The script take as argument a file containing the alignment in ClustalW format


#!/usr/bin/python


import sys


CluW=open(sys.argv[1],'r').readlines()
FASTA={}
for x in CluW[1:]:
        line=x.split()
        if len(line)==2:
                FASTA[line[0]]=''


for x in CluW[1:]:
        line=x.split()
        if len(line)==2:
                FASTA[line[0]]=FASTA[line[0]]+line[1].strip()


for k,v in FASTA.iteritems():
        print '>'+k
        print v




Follow the instruction to execute the script.

venerdì 25 novembre 2011

Protein Clusters

Two proteins, with sequence identity higher than a certain percentage, are similar in term of 3Dmentional structure and probably they come from a common ancestor.
Clusterize the homologous protein sequences can be very useful. For example, if you have to collect a non-redundant protein dataset you can compute the clusters and than choose one sequence per cluster...
The principle commonly used to build the clusters is the following:
if protein A is homologous to protein B and protien B is homologous to protein C, than proteins A, B and C will be put into the same cluster.
This is a python script to generate the cluster starting from BLAST output.
The script requests 2 arguments: the Blast alignment output in TAB format and the homology threshold. You need only the first 3 column of the first argument, therefore you can use an input file like the following:

seq1    seq1    100.00
seq1    seq2    56.90
seq2    seq1    56.90
seq2    seq2    100.00
seq3    seq3    100.00
seq1    seq3    20.43
seq3    seq1    20.40


where you have respectively query sequence, target sequence and identity percentage. The file must contain all the sequences compared with itself (see the first line in the example).
Here, follow the instructions to execute the script.


#!/usr/bin/python
from sets import Set
import sys
import os
#first argument= concatenated BLAST output in TAB format
#second argument= homology threshold
if len(sys.argv)<3:
    print 'USAGE: python cluster.py <blast output> <homology threshold>'
    sys.exit()

os.system("awk '$3>="+sys.argv[2]+"' "+sys.argv[1]+" > tmp")

B=open(sys.argv[1],'r').readlines()
X=Set([])
for b in B:
    line=b.split()
    X.add(line[0])

Y=open('tmp','r').readlines()

D={}
for x in X:
    D[x.strip()]=Set([x.strip()])
for y in Y:
    line=y.split()
    D[line[0]].add(line[1])
for x in X:
    for e in D[x.strip()]:
        if x.strip() not in D[e]:
            D[e].add(x.strip())

def CLU (elemento,LAB,D,CLUSTER):
        if elemento not in LAB:
                LAB.add(elemento)
                NEW=D[elemento]
                for n in NEW:
                        CLUSTER.add(n)
                        CLUSTER=CLU(n,LAB,D,CLUSTER)
        return CLUSTER


C=Set([])
LAB=Set([])
for x in X:
        CLUSTER=Set([])
        cluster=CLU(x.strip(),LAB,D,CLUSTER)
        C.add(cluster)

i=0
for c in C:
    if c!=Set([]):
        i=i+1
        print 'CLUSTER', str(i), '(', str(len(c)), 'elements)'
        for e in c:
            print e,
        print ''
os.system('rm tmp')


The script is not faultless...unfortunately it is limitated to the maximun number of recursion depth allowed from your system. Anyway I'm a dummy, I'm hopeless...so you cannot ask for more!!!!!

giovedì 17 novembre 2011

Python implementation of DOT PLOT

Hello! This is a very simple script to compute the dot plot in order to compare two biological sequences.




#!/usr/bin/python


import sys


X=open(sys.argv[1],'r').readlines()
Y=open(sys.argv[2],'r').readlines()
seq1=''
seq2=''
for f in X[1:]:
        seq1=seq1+f.strip()
for f in Y[1:]:
        seq2=seq2+f.strip()
i=0
g=open('DP.dat','w')
for A in seq1:
        dot=''
        i=i+1
        e=0
        for B in seq2:
                e=e+1
                if A==B:
                        dot=dot+'/'
                        g.write(str(i)+' '+str(e)+'\n')
                else:
                        dot=dot+' '


The script takes as arguments two fasta sequences and generates a file named DP.dat that can be visualized with different tools.(Instruction to execution)
For example you can use gnuplot typing:


gnuplot


and than:


gnuplot> plot "DP.dat" with dots


You will obtain a plot like the following, where the numbers in the axis correspond to amino acid positions in the protein sequences.


Dop Plot obtained with GNUPLOT




martedì 8 novembre 2011

Compute a sequence profile

Given an alignment in fasta format you can easily compute the sequence profile referring to the first sequence in the alignment. This is the script:


X=open(sys.argv[1],'r').readlines()
LL=[]
seq=''
for x in X:
    if x[0]!='>':
        seq=seq+x.strip()
    else:
        LL.append(seq)
        seq=''
LL.append(seq)
L=LL[1:]
reference=L[0]
e=0
POS=[]
for r in reference:
    if r!='-':
        POS.append(e)
    e=e+1
AA='VLIMFWYGAPSTCHRKQEND-'
for i in POS:
    values=[]
    for aa in AA:
        aac=0
        for seq in L:
            if seq[i]==aa:
                aac=aac+1
        values.append(float(aac))
    SUM=sum(values)
    for val in values:
        print "%.2f" % (val/SUM),
    print

The script take as argument the fasta alignments and computes position by position the amino acid and the gap frequencies. If you want to eliminate the gap you have to replace the string  AA='VLIMFWYGAPSTCHRKQEND-' with AA='VLIMFWYGAPSTCHRKQEND'.
If you want more that 2 decimal, you have to substitute  "%.2f" % with the number desired.

venerdì 28 ottobre 2011

Split fasta file

Today I want to introduce you a simple method to split a FASTA file into single files, each one containig a single sequence.
Before starting, let me take a back step. Usually, when you download a fasta file you find something like this:

>sp|P47064|AP3S_YEAST AP-3 complex subunit sigma
MIHAVLIFNKKCQPRLVKFYTPVDLPKQKLLLEQVYELISQRNSDFQSSFLVTPPSLLLS
NENNNDEVNNEDIQIIYKNYATLYFTFIVDDQESELAILDLIQTFVESLDRCFTEVNELD
LIFNWQTLESVLEEIVQGGMVIETNVNRIVASVDELNKAAESTDSKIGRLTSTGFGSALQ
AFAQGGFAQWATGQ
>sp|Q29RJ1|AP4A_BOVIN Bis(5'-nucleosyl)-tetraphosphatase
MALRACGLIIFRRRLIPKVDNTAIEFLLLQASDGIHHWTPPKGHVEPGESDLETALRETQ
EEAGIEAGQLTIIEGFRRELSYVARAKPKIVIYWLAEVKDCDVEVRLSREHQAYRWLELE
DACQLAQFEEMKAALQEGHQFLCSTAT
 
First of all, you can perform an operation in order to transform the sequence name in a simpler one and to put the sequence in a single line.
This is the python code:


#!/usr/bin/python
import sys
X=open(sys.argv[1],'r').readlines()
g=open(sys.argv[1]+'.format','w')
for x in X:
    if x[0]=='>':
        line=x.split('|')
        if X.index(x)==0:
            g.write('>'+line[1]+'\n')
        else:
            g.write('\n>'+line[1]+'\n')
    else:
        g.write(x.strip())


This script takes the fasta file as argument. In this way you obtain a formatted fasta file:

>sp|P47064|AP3S_YEAST AP-3 complex subunit sigma
MIHAVLIFNKKCQPRLVKFYTPVDLPKQKLLLEQVYELISQRNSDF.....AFAQGGFAQWATGQ
>sp|Q29RJ1|AP4A_BOVIN Bis(5'-nucleosyl)-tetraphosphatase 
MALRACGLIIFRRRLIPKVDNTAIEFLLLQASDGIHHWTPPKGHVE.....ALQEGHQFLCSTAT


Now is very simple splitting in single files:


#!/usr/bin/python
import sys
if len(sys.argv)==2:
    DIR=sys.argv[2]+'/'
else:
    DIR='./'
X=open(sys.argv[1],'r').readlines()
for i in range(len(X)-1):
    if X[i][0]=='>':
        g=open(DIR+X[i][1:-1]+'.fasta','w')
        g.write(X[i]+X[i+1])
        g.close()
This script takes two arguments: first the fasta file formatted as showed before and second the name of the directory where you want to save the single sequence files. If no direcroty is specified the files will be saved into the work directory.

mercoledì 26 ottobre 2011

RNA to Protein

Today, I would like to describe a script that given a RNA sequence returns a protein sequence. 


First of all, you have to download this file, containing the dictionary related to genetic code (in order to understand how I created the file follow this link).


Now, copy the script in a file named script.py in the same directory in which you downloaded the dictionary, then execute the script.


RNA sequence is request as argument. The character may be lower-case or upper-case indistinctly, in fact the string.upper() function converts the whole string in upper-case characters.




#!/usr/bin/python

import pickle
import sys

dictionary=open('genetic_code.dic','r')
D=pickle.load(dictionary)

RNA=sys.argv[1].upper()
protein=''
for i in range(0,len(RNA),3):
        if i+3<=len(RNA):
                codon=RNA[i]+RNA[i+1]+RNA[i+2]
                protein=protein+D[codon]
print protein




This script is very stupid!! It is builded using uniquely a computer science approach. In fact, strarting from the first nucleotide, it reads the codons 3 by 3, and convert the codon into the relative amino acid.
Actually a biologist knows that, to initialize the translation, we need a specific codon in the RNA sequence (AUG that corresponds to Metionin), as well as for the termination (UAA, UAG o UGA that are the stop codons).
So we can add some line in the script code, in order to keep into account these molecular biology roles.
We will do nothing but verify in which RNA sequence positions the first AUG and the first STOP codons appear, and we will cositer these positions as the first and the last of the RNA sequence.




#!/usr/bin/python 

import pickle 
import sys 

dictionary=open('genetic_code.dic','r')
D=pickle.load(dictionary) 

RNA=sys.argv[1].upper() 
if 'AUG' in RNA: 
    s=RNA.index('AUG')
    protein='-' * s 
    PROTEIN='' 
else: 
    s=0 
    print RNA 
    print '-' * len(RNA)
    print 'AUG codon was not found' 
    sys.exit() 

for i in range(s,len(RNA),3):
    if i+3<=len(RNA): 
        codon=RNA[i]+RNA[i+1]+RNA[i+2] 
        if D[codon]=='STOP': 
            protein=protein+'-*-' 
            break 
        else: 
            protein=protein+'-'+D[codon]+'-' 
            PROTEIN=PROTEIN+D[codon] 
end='-' * (len(RNA)-len(protein)) 
print RNA 
print protein+end 
print 'Protein sequence --->', PROTEIN

A dictionary for ever

When you write data into an object (list, set, dictionary, e.g.),  save this object in a file may be very useful. By this, you can access to data structure without re-build it from scratch.


The python module that allows to perform this is pickle.
For example, take a txt file containing the genetic code data, and try to seve a dictionary.
 


#!/usr/bin/python
import pickle
GC=open('genetic_code.txt','r').readlines()
D={}
for i in GC:
        line=i.split()
        codons=line[-1].split(',')
        for c in codons[:-1]:
                D[c]=line[1]
g=open('dictionary','w')
pickle.dump(D,g)
g.close()


Now you will find a file named 'dictionary' in the working directory.
Opening the file, you will see nothing of readable, but you can import and use the file when the dictionary will serve you.
In order to import the dictionary you have to use this syntax:


import pickle
F=open('dizionario','r')
D=pickle.load(F)




Now the variable D is your dictionary again.

In spite of the post title, the procedure works also with lists, sets and whatever you want immortalize.

martedì 25 ottobre 2011

I Dizionari in Python

Quando si usa il linguaggio di programmazione python l'utilizzo dei dizionari può essere la chiave di volta per risolvere un marea di problemi.
Un dizionario e' una struttura dati, ovvero un oggetto in cui e' possibile memorizzare dati, che funziona come una tabella. Per meglio dire, si puo' immaginare il dizionario come se fosse una tabella con due colonne in cui, ad ogni campo della prima colonna (chiave), ne corrisponde uno della seconda (valore). I campi della prima colonna sono unici (non ci puo' essere piu' volte la stessa chiave) .


Ad esempio immaginiamo questa tabella:



Alunno(CHIAVI) Eta'(VALORI)
Pippo 10
Ciccio 11
Franco 10
Gianni 9


 Il relativo dizionario sara':


D={'Pippo':10, Ciccio':11, Franco':10, 'Gianni':9}


D e' la variabile che contiene il dizionario, il contenuto si deve scrivere tra parentesi graffe con la sintassi chiave : valore ,
Quindi ogni volta che si vuole memorizzare una serie di dati di tipo chiave-valore il dizionario e' la struttura dati piu' adatta. Inoltre, grazie ad un sistema che puo' tranquillamente rimanere oscuro ad un biologo, le operazioni che un dizionario consente sono veloci e molto pratiche.
Provero' ad elencarne alcune:


- Per aggiungere elementi al dizionario
D['chiave']='valore'


-Se ho una chiave e ho bisogno di andare a ripescare il corrispondente valore
V=D['chiave']


-Per testare se una determinata chiave c'e' nel dizionario
T=D.has_key('chiave')
se la chiave c'e' T=True, se la chiave non c'e' T=False


-D.keys()
Ti da una lista che contiene tutte le chiavi del dizionario ['Pippo','Ciccio','Franco','Gianni']


-D.values()
Ti da una lista che contiene tutti i valori di un dizionario [10,11,10,9]


-D.items()
Ti da una tupla che contiene tutti i dati [('Pippo', 10),('Ciccio',11),('Franco',10),('Gianni',9)]
A mio avviso le tuple non servono a niente (qualcuno mi potrebbe tirare una sassata, e me la prenderei volentieri se mi venisse anche spiegato il perche')





DNA to RNA

As well as for "reverse strand", the transformation of a string od DNA in a string of RNA is trivial for any biologist. The basic idea is very similar, here the script:


#!/usr/bin/python
import sys
DNA=sys.argv[1]
CODE={'A':'U','T':'A','C':'G','G':'C'} 
RNA=''
for c in DNA:
        RNA=RNA+CODE[c]
print RNA


In comparison of "reverse strand" script, you have to change the cariable CODE contents. Such code is passed in form of dictionary, one of the most virtuous python data structures . Also in this case the script request a DNA sequence as argument.





lunedì 24 ottobre 2011

Reverse strand

Data una sequenza di DNA, scrivere la sequenza complementare e' cosa banale, senza dubbio. Ogni biologo conosce il codice genetico e sa come fare...ma se la sequenza e' lunga lunga lunga farlo con carta e penna puo' essere un lavoretto lungo e noioso.
Ecco poche righe di codice python che ci semplificano la vita:


#!/usr/bin/python
import sys
plus_seq=sys.argv[1]
CODE={'A':'T','T':'A','C':'G','G':'C'} 
minus_seq=''
for c in plus_seq:
        minus_seq=minus_seq+CODE[c]
print minus_seq


Questo script va eseguito dal terminale dando come argomento la sequenza di DNA.
Il risultato dello script e' la sequenza complementare. Se la sequenza di input viene data in direzione 5'-3' allora la sequenza di output sara' 3'-5'.
Talvolta puo' essere utile ottenere la sequenza complementare in direzione 5'-3', in questo caso non dovete fare altro che aggiungere una sola riga di codice:




#!/usr/bin/python
import sys
plus_seq=sys.argv[1]
CODE={'A':'T','T':'A','C':'G','G':'C'} 
minus_seq=''
for c in plus_seq:
        minus_seq=minus_seq+CODE[c]
reverse_seq=minus_seq[::-1]
print reverse_seq


Piu' in generale la riga di codice in rosso permette di scrivere una qualsiasi stringa al contrario.
Provate questo esempio:


stringa1='acetone'
stringa2=stringa1[::-1]
print stringa2


Sulla variabile stringa2 sara' contenuta la parola 'enoteca'.


 





What is an argument???


Before understanding what an argument was, I spent a lot of time!
Now I try to explain it in a less nerdy as possible.


When you run a script, the goal is, given the input, compute the output.
The input may be passed as argument, from command line instead of directly inside the script.
For example, suppose you have a script that take as input a DNA sequence and convert it in RNA. Such string of character is passed as follow:


python script.py ATCTAGTGTCTGATCCTAGTCTTGATC


the string is an argument.


As well as input, you can pass as arguments also the script parameters.
For example, if you have a script that, given a protein sequence, count the frequency of a certain amino acid, you have:


python script.py PPLMAAIEKSNISAWWNF S


Where the first argument is the protein sequence, and the second argument is the character that you want to count (Serine in this case). 
By this you obtain a more general script, that is useful to count any amino acid in any sequence.
You can apply the same criteria for both python and bach scripts.
Within the script you will find the functions suitable to recognize and to read the arguments.


With python first you have to import the sys module; the sintax is the follow:
import sys
ARGUMENT1=sys.argv[1]
ARGUMENTO2=sys.argv[2]


With bash:
ARGUMENT1=$1
ARGUMENT2=$2


By this, the variables ARGUMENT1 and ARGUMENT2 are the "object" that you passed from command line.



domenica 23 ottobre 2011

Script Execution

In order to execute a script you have to perform the following steps:


1)Copy the script in txt file;


2)Rename the file with a proper extension:
      if the script is in python the name will be script.py
      if the script is in bash the name will be script.sh


3)Open the Terminal and change directory where you saved the script;


4)Execute the script:
     if the script is in python type python script.py
     if the script is in bash type bash script.sh

     typing enter the script will be executed.
     In the case of some arguments are request be careful!!


5)In my scripts usually the results will be printed in the terminal as standard outputl. You can redirect the output in a file in this way:
    python script.py > script_results