Pagine

venerdì 28 ottobre 2011

Split fasta file

Today I want to introduce you a simple method to split a FASTA file into single files, each one containig a single sequence.
Before starting, let me take a back step. Usually, when you download a fasta file you find something like this:

>sp|P47064|AP3S_YEAST AP-3 complex subunit sigma
MIHAVLIFNKKCQPRLVKFYTPVDLPKQKLLLEQVYELISQRNSDFQSSFLVTPPSLLLS
NENNNDEVNNEDIQIIYKNYATLYFTFIVDDQESELAILDLIQTFVESLDRCFTEVNELD
LIFNWQTLESVLEEIVQGGMVIETNVNRIVASVDELNKAAESTDSKIGRLTSTGFGSALQ
AFAQGGFAQWATGQ
>sp|Q29RJ1|AP4A_BOVIN Bis(5'-nucleosyl)-tetraphosphatase
MALRACGLIIFRRRLIPKVDNTAIEFLLLQASDGIHHWTPPKGHVEPGESDLETALRETQ
EEAGIEAGQLTIIEGFRRELSYVARAKPKIVIYWLAEVKDCDVEVRLSREHQAYRWLELE
DACQLAQFEEMKAALQEGHQFLCSTAT
 
First of all, you can perform an operation in order to transform the sequence name in a simpler one and to put the sequence in a single line.
This is the python code:


#!/usr/bin/python
import sys
X=open(sys.argv[1],'r').readlines()
g=open(sys.argv[1]+'.format','w')
for x in X:
    if x[0]=='>':
        line=x.split('|')
        if X.index(x)==0:
            g.write('>'+line[1]+'\n')
        else:
            g.write('\n>'+line[1]+'\n')
    else:
        g.write(x.strip())


This script takes the fasta file as argument. In this way you obtain a formatted fasta file:

>sp|P47064|AP3S_YEAST AP-3 complex subunit sigma
MIHAVLIFNKKCQPRLVKFYTPVDLPKQKLLLEQVYELISQRNSDF.....AFAQGGFAQWATGQ
>sp|Q29RJ1|AP4A_BOVIN Bis(5'-nucleosyl)-tetraphosphatase 
MALRACGLIIFRRRLIPKVDNTAIEFLLLQASDGIHHWTPPKGHVE.....ALQEGHQFLCSTAT


Now is very simple splitting in single files:


#!/usr/bin/python
import sys
if len(sys.argv)==2:
    DIR=sys.argv[2]+'/'
else:
    DIR='./'
X=open(sys.argv[1],'r').readlines()
for i in range(len(X)-1):
    if X[i][0]=='>':
        g=open(DIR+X[i][1:-1]+'.fasta','w')
        g.write(X[i]+X[i+1])
        g.close()
This script takes two arguments: first the fasta file formatted as showed before and second the name of the directory where you want to save the single sequence files. If no direcroty is specified the files will be saved into the work directory.

mercoledì 26 ottobre 2011

RNA to Protein

Today, I would like to describe a script that given a RNA sequence returns a protein sequence. 


First of all, you have to download this file, containing the dictionary related to genetic code (in order to understand how I created the file follow this link).


Now, copy the script in a file named script.py in the same directory in which you downloaded the dictionary, then execute the script.


RNA sequence is request as argument. The character may be lower-case or upper-case indistinctly, in fact the string.upper() function converts the whole string in upper-case characters.




#!/usr/bin/python

import pickle
import sys

dictionary=open('genetic_code.dic','r')
D=pickle.load(dictionary)

RNA=sys.argv[1].upper()
protein=''
for i in range(0,len(RNA),3):
        if i+3<=len(RNA):
                codon=RNA[i]+RNA[i+1]+RNA[i+2]
                protein=protein+D[codon]
print protein




This script is very stupid!! It is builded using uniquely a computer science approach. In fact, strarting from the first nucleotide, it reads the codons 3 by 3, and convert the codon into the relative amino acid.
Actually a biologist knows that, to initialize the translation, we need a specific codon in the RNA sequence (AUG that corresponds to Metionin), as well as for the termination (UAA, UAG o UGA that are the stop codons).
So we can add some line in the script code, in order to keep into account these molecular biology roles.
We will do nothing but verify in which RNA sequence positions the first AUG and the first STOP codons appear, and we will cositer these positions as the first and the last of the RNA sequence.




#!/usr/bin/python 

import pickle 
import sys 

dictionary=open('genetic_code.dic','r')
D=pickle.load(dictionary) 

RNA=sys.argv[1].upper() 
if 'AUG' in RNA: 
    s=RNA.index('AUG')
    protein='-' * s 
    PROTEIN='' 
else: 
    s=0 
    print RNA 
    print '-' * len(RNA)
    print 'AUG codon was not found' 
    sys.exit() 

for i in range(s,len(RNA),3):
    if i+3<=len(RNA): 
        codon=RNA[i]+RNA[i+1]+RNA[i+2] 
        if D[codon]=='STOP': 
            protein=protein+'-*-' 
            break 
        else: 
            protein=protein+'-'+D[codon]+'-' 
            PROTEIN=PROTEIN+D[codon] 
end='-' * (len(RNA)-len(protein)) 
print RNA 
print protein+end 
print 'Protein sequence --->', PROTEIN

A dictionary for ever

When you write data into an object (list, set, dictionary, e.g.),  save this object in a file may be very useful. By this, you can access to data structure without re-build it from scratch.


The python module that allows to perform this is pickle.
For example, take a txt file containing the genetic code data, and try to seve a dictionary.
 


#!/usr/bin/python
import pickle
GC=open('genetic_code.txt','r').readlines()
D={}
for i in GC:
        line=i.split()
        codons=line[-1].split(',')
        for c in codons[:-1]:
                D[c]=line[1]
g=open('dictionary','w')
pickle.dump(D,g)
g.close()


Now you will find a file named 'dictionary' in the working directory.
Opening the file, you will see nothing of readable, but you can import and use the file when the dictionary will serve you.
In order to import the dictionary you have to use this syntax:


import pickle
F=open('dizionario','r')
D=pickle.load(F)




Now the variable D is your dictionary again.

In spite of the post title, the procedure works also with lists, sets and whatever you want immortalize.

martedì 25 ottobre 2011

I Dizionari in Python

Quando si usa il linguaggio di programmazione python l'utilizzo dei dizionari può essere la chiave di volta per risolvere un marea di problemi.
Un dizionario e' una struttura dati, ovvero un oggetto in cui e' possibile memorizzare dati, che funziona come una tabella. Per meglio dire, si puo' immaginare il dizionario come se fosse una tabella con due colonne in cui, ad ogni campo della prima colonna (chiave), ne corrisponde uno della seconda (valore). I campi della prima colonna sono unici (non ci puo' essere piu' volte la stessa chiave) .


Ad esempio immaginiamo questa tabella:



Alunno(CHIAVI) Eta'(VALORI)
Pippo 10
Ciccio 11
Franco 10
Gianni 9


 Il relativo dizionario sara':


D={'Pippo':10, Ciccio':11, Franco':10, 'Gianni':9}


D e' la variabile che contiene il dizionario, il contenuto si deve scrivere tra parentesi graffe con la sintassi chiave : valore ,
Quindi ogni volta che si vuole memorizzare una serie di dati di tipo chiave-valore il dizionario e' la struttura dati piu' adatta. Inoltre, grazie ad un sistema che puo' tranquillamente rimanere oscuro ad un biologo, le operazioni che un dizionario consente sono veloci e molto pratiche.
Provero' ad elencarne alcune:


- Per aggiungere elementi al dizionario
D['chiave']='valore'


-Se ho una chiave e ho bisogno di andare a ripescare il corrispondente valore
V=D['chiave']


-Per testare se una determinata chiave c'e' nel dizionario
T=D.has_key('chiave')
se la chiave c'e' T=True, se la chiave non c'e' T=False


-D.keys()
Ti da una lista che contiene tutte le chiavi del dizionario ['Pippo','Ciccio','Franco','Gianni']


-D.values()
Ti da una lista che contiene tutti i valori di un dizionario [10,11,10,9]


-D.items()
Ti da una tupla che contiene tutti i dati [('Pippo', 10),('Ciccio',11),('Franco',10),('Gianni',9)]
A mio avviso le tuple non servono a niente (qualcuno mi potrebbe tirare una sassata, e me la prenderei volentieri se mi venisse anche spiegato il perche')





DNA to RNA

As well as for "reverse strand", the transformation of a string od DNA in a string of RNA is trivial for any biologist. The basic idea is very similar, here the script:


#!/usr/bin/python
import sys
DNA=sys.argv[1]
CODE={'A':'U','T':'A','C':'G','G':'C'} 
RNA=''
for c in DNA:
        RNA=RNA+CODE[c]
print RNA


In comparison of "reverse strand" script, you have to change the cariable CODE contents. Such code is passed in form of dictionary, one of the most virtuous python data structures . Also in this case the script request a DNA sequence as argument.





lunedì 24 ottobre 2011

Reverse strand

Data una sequenza di DNA, scrivere la sequenza complementare e' cosa banale, senza dubbio. Ogni biologo conosce il codice genetico e sa come fare...ma se la sequenza e' lunga lunga lunga farlo con carta e penna puo' essere un lavoretto lungo e noioso.
Ecco poche righe di codice python che ci semplificano la vita:


#!/usr/bin/python
import sys
plus_seq=sys.argv[1]
CODE={'A':'T','T':'A','C':'G','G':'C'} 
minus_seq=''
for c in plus_seq:
        minus_seq=minus_seq+CODE[c]
print minus_seq


Questo script va eseguito dal terminale dando come argomento la sequenza di DNA.
Il risultato dello script e' la sequenza complementare. Se la sequenza di input viene data in direzione 5'-3' allora la sequenza di output sara' 3'-5'.
Talvolta puo' essere utile ottenere la sequenza complementare in direzione 5'-3', in questo caso non dovete fare altro che aggiungere una sola riga di codice:




#!/usr/bin/python
import sys
plus_seq=sys.argv[1]
CODE={'A':'T','T':'A','C':'G','G':'C'} 
minus_seq=''
for c in plus_seq:
        minus_seq=minus_seq+CODE[c]
reverse_seq=minus_seq[::-1]
print reverse_seq


Piu' in generale la riga di codice in rosso permette di scrivere una qualsiasi stringa al contrario.
Provate questo esempio:


stringa1='acetone'
stringa2=stringa1[::-1]
print stringa2


Sulla variabile stringa2 sara' contenuta la parola 'enoteca'.


 





What is an argument???


Before understanding what an argument was, I spent a lot of time!
Now I try to explain it in a less nerdy as possible.


When you run a script, the goal is, given the input, compute the output.
The input may be passed as argument, from command line instead of directly inside the script.
For example, suppose you have a script that take as input a DNA sequence and convert it in RNA. Such string of character is passed as follow:


python script.py ATCTAGTGTCTGATCCTAGTCTTGATC


the string is an argument.


As well as input, you can pass as arguments also the script parameters.
For example, if you have a script that, given a protein sequence, count the frequency of a certain amino acid, you have:


python script.py PPLMAAIEKSNISAWWNF S


Where the first argument is the protein sequence, and the second argument is the character that you want to count (Serine in this case). 
By this you obtain a more general script, that is useful to count any amino acid in any sequence.
You can apply the same criteria for both python and bach scripts.
Within the script you will find the functions suitable to recognize and to read the arguments.


With python first you have to import the sys module; the sintax is the follow:
import sys
ARGUMENT1=sys.argv[1]
ARGUMENTO2=sys.argv[2]


With bash:
ARGUMENT1=$1
ARGUMENT2=$2


By this, the variables ARGUMENT1 and ARGUMENT2 are the "object" that you passed from command line.



domenica 23 ottobre 2011

Script Execution

In order to execute a script you have to perform the following steps:


1)Copy the script in txt file;


2)Rename the file with a proper extension:
      if the script is in python the name will be script.py
      if the script is in bash the name will be script.sh


3)Open the Terminal and change directory where you saved the script;


4)Execute the script:
     if the script is in python type python script.py
     if the script is in bash type bash script.sh

     typing enter the script will be executed.
     In the case of some arguments are request be careful!!


5)In my scripts usually the results will be printed in the terminal as standard outputl. You can redirect the output in a file in this way:
    python script.py > script_results