Pagine

lunedì 5 dicembre 2011

Convert ClustalW format in FASTA format

When you are working with multiple sequence alignments you may have some different alignments format.
This is a simple tool to convert the ClustalW  format:

P31937          ----------------MAASLRLLGAASGLRYWSR-RLRPAAGSFAAVCSRSVASKTPVG
P00508          MALLQSRLL-------LSAPRRAAATARASSWWSHVEMGPPDPILGVTEAFKRDTNSKKM
P12344          MALLHSGRFLSGVAAAFHPGLAAAASARASSWWAHVEMGPPDPILGVTEAFKRDTNSKKM
P00505          MALLHSGRVLPGIAAAFHPGLAAAASARASSWWTHVEMGPPDPILGVTEAFKRDTNSKKM
P05202          MALLHSSRILSGMAAAFHPGLAAAASARASSWWTHVEMGPPDPILGVTEAFKRDTNSKKM
                                : .     .:* .  :*:.  : *.   :... : .  :::

 

P31937          FIGLGNM----GNPMAKNLMKHGYPLIIYDVFPDAC------KEFQDAGEQVVSSPADVA
P00508          NLGVGAYRDDNGKSYVLNCVRKAEAMIAAKKMDKEYLPIAGLADFTRASAELALGENSEA
P12344          NLGVGAYRDDNGKPYVLPSVRKAEAQIAAKNLDKEYLPIAGLAEFCKASAELALGENNEV
P00505          NLGVGAYRDDNGKPYVLPSVRKAEAQIAAKNLDKEYLPIGGLAEFCKASAELALGENSEV
P05202          NLGVGAYRDDNGKPYVLPSVRKAEAQIAAKNLDKEYLPIGGLAEFCKASAELALGENNEV
                 :*:*      *:. .   :.:. . *  . : .         :*  *. ::. .  . 

in FASTA format (one line sequence one line):

>P31937
----------------MAASLRLLGAASGLRYWSR-RLRPAAGSFAAVCSRSVASKTPVG
FIGLGNM----GNPMAKNLMKHGYPLIIYDVFPDAC------KEFQDAGEQVVSSPADVA
>P00508
MALLQSRLL-------LSAPRRAAATARASSWWSHVEMGPPDPILGVTEAFKRDTNSKKM
NLGVGAYRDDNGKSYVLNCVRKAEAMIAAKKMDKEYLPIAGLADFTRASAELALGENSEA
>P00505
MALLHSGRVLPGIAAAFHPGLAAAASARASSWWTHVEMGPPDPILGVTEAFKRDTNSKKM
NLGVGAYRDDNGKPYVLPSVRKAEAQIAAKNLDKEYLPIGGLAEFCKASAELALGENSEV
>P12344
MALLHSGRFLSGVAAAFHPGLAAAASARASSWWAHVEMGPPDPILGVTEAFKRDTNSKKM
NLGVGAYRDDNGKPYVLPSVRKAEAQIAAKNLDKEYLPIAGLAEFCKASAELALGENNEV
>P05202
MALLHSSRILSGMAAAFHPGLAAAASARASSWWTHVEMGPPDPILGVTEAFKRDTNSKKM
NLGVGAYRDDNGKPYVLPSVRKAEAQIAAKNLDKEYLPIGGLAEFCKASAELALGENNEV


The script take as argument a file containing the alignment in ClustalW format


#!/usr/bin/python


import sys


CluW=open(sys.argv[1],'r').readlines()
FASTA={}
for x in CluW[1:]:
        line=x.split()
        if len(line)==2:
                FASTA[line[0]]=''


for x in CluW[1:]:
        line=x.split()
        if len(line)==2:
                FASTA[line[0]]=FASTA[line[0]]+line[1].strip()


for k,v in FASTA.iteritems():
        print '>'+k
        print v




Follow the instruction to execute the script.