Pagine

venerdì 28 ottobre 2011

Split fasta file

Today I want to introduce you a simple method to split a FASTA file into single files, each one containig a single sequence.
Before starting, let me take a back step. Usually, when you download a fasta file you find something like this:

>sp|P47064|AP3S_YEAST AP-3 complex subunit sigma
MIHAVLIFNKKCQPRLVKFYTPVDLPKQKLLLEQVYELISQRNSDFQSSFLVTPPSLLLS
NENNNDEVNNEDIQIIYKNYATLYFTFIVDDQESELAILDLIQTFVESLDRCFTEVNELD
LIFNWQTLESVLEEIVQGGMVIETNVNRIVASVDELNKAAESTDSKIGRLTSTGFGSALQ
AFAQGGFAQWATGQ
>sp|Q29RJ1|AP4A_BOVIN Bis(5'-nucleosyl)-tetraphosphatase
MALRACGLIIFRRRLIPKVDNTAIEFLLLQASDGIHHWTPPKGHVEPGESDLETALRETQ
EEAGIEAGQLTIIEGFRRELSYVARAKPKIVIYWLAEVKDCDVEVRLSREHQAYRWLELE
DACQLAQFEEMKAALQEGHQFLCSTAT
 
First of all, you can perform an operation in order to transform the sequence name in a simpler one and to put the sequence in a single line.
This is the python code:


#!/usr/bin/python
import sys
X=open(sys.argv[1],'r').readlines()
g=open(sys.argv[1]+'.format','w')
for x in X:
    if x[0]=='>':
        line=x.split('|')
        if X.index(x)==0:
            g.write('>'+line[1]+'\n')
        else:
            g.write('\n>'+line[1]+'\n')
    else:
        g.write(x.strip())


This script takes the fasta file as argument. In this way you obtain a formatted fasta file:

>sp|P47064|AP3S_YEAST AP-3 complex subunit sigma
MIHAVLIFNKKCQPRLVKFYTPVDLPKQKLLLEQVYELISQRNSDF.....AFAQGGFAQWATGQ
>sp|Q29RJ1|AP4A_BOVIN Bis(5'-nucleosyl)-tetraphosphatase 
MALRACGLIIFRRRLIPKVDNTAIEFLLLQASDGIHHWTPPKGHVE.....ALQEGHQFLCSTAT


Now is very simple splitting in single files:


#!/usr/bin/python
import sys
if len(sys.argv)==2:
    DIR=sys.argv[2]+'/'
else:
    DIR='./'
X=open(sys.argv[1],'r').readlines()
for i in range(len(X)-1):
    if X[i][0]=='>':
        g=open(DIR+X[i][1:-1]+'.fasta','w')
        g.write(X[i]+X[i+1])
        g.close()
This script takes two arguments: first the fasta file formatted as showed before and second the name of the directory where you want to save the single sequence files. If no direcroty is specified the files will be saved into the work directory.

Nessun commento:

Posta un commento