pyfasta

fast, memory-efficient, pythonic access to fasta sequence files

Project description

Author:: Brent Pedersen (brentp)
Email:: bpederse@gmail.com
License:: MIT

Implementation

Requires Python >= 2.5. Stores a flattened version of the fasta file without spaces or headers and uses either a mmap of numpy binary format or fseek/fread so the sequence data is never read into memory. Saves a pickle (.gdx) of the start, stop (for fseek/mmap) locations of each header in the fasta file for internal use.

Usage

>>> from pyfasta import Fasta

>>> f = Fasta('tests/data/three_chrs.fasta')
>>> sorted(f.keys())
['chr1', 'chr2', 'chr3']

>>> f['chr1']
NpyFastaRecord(0..80)

Slicing

>>> f['chr1'][:10]
'ACTGACTGAC'

# get the 1st basepair in every codon (it's python yo)
>>> f['chr1'][::3]
'AGTCAGTCAGTCAGTCAGTCAGTCAGT'

# can query by a 'feature' dictionary
>>> f.sequence({'chr': 'chr1', 'start': 2, 'stop': 9})
'CTGACTGA'

# same as:
>>> f['chr1'][1:9]
'CTGACTGA'

# with reverse complement (automatic for - strand)
>>> f.sequence({'chr': 'chr1', 'start': 2, 'stop': 9, 'strand': '-'})
'TCAGTCAG'

Numpy

The default is to use a memmaped numpy array as the backend. In which case it’s possible to get back an array directly…

>>> f['chr1'].tostring = False
>>> f['chr1'][:10] # doctest: +NORMALIZE_WHITESPACE
memmap(['A', 'C', 'T', 'G', 'A', 'C', 'T', 'G', 'A', 'C'], dtype='|S1')

>>> import numpy as np
>>> a = np.array(f['chr2'])
>>> a.shape[0] == len(f['chr2'])
True

>>> a[10:14] # doctest: +NORMALIZE_WHITESPACE
array(['A', 'A', 'A', 'A'], dtype='|S1')

mask a sub-sequence

>>> a[11:13] = np.array('N', dtype='c')
>>> a[10:14].tostring()
'ANNA'

Backends (Record class)

It’s also possible to specify another record class as the underlying work-horse for slicing and reading. Currently, there’s just the default:

NpyFastaRecord which uses numpy memmap

FastaRecord, which uses using fseek/fread

MemoryRecord which reads everything into memory and must reparse the original fasta every time.

TCRecord which is identical to NpyFastaRecord except that it saves the index in a TokyoCabinet hash database, for cases when there are enough records that loading the entire index from a pickle into memory is unwise. (NOTE: that the sequence is not loaded into memory in either case).

It’s possible to specify the class used with the record_class kwarg to the Fasta constructor:

>>> from pyfasta import FastaRecord # default is NpyFastaRecord
>>> f = Fasta('tests/data/three_chrs.fasta', record_class=FastaRecord)
>>> f['chr1']
FastaRecord('tests/data/three_chrs.fasta.flat', 0..80)

other than the repr, it should behave exactly like the Npy record class backend

it’s possible to create your own using a sub-class of FastaRecord. see the source in pyfasta/records.py for details.

Command Line Interface

there’s also a command line interface to manipulate / view fasta files. the pyfasta executable is installed via setuptools, running it will show help text.

split a fasta file into 6 new files of relatively even size:

$ pyfasta split -n 6 original.fasta

create 1 new fasta file with the sequence split into 10K-mers:

$ pyfasta split -n 1 -k 10000 original.fasta

2 new fasta files with the sequence split into 10K-mers with 2K overlap:

$ pyfasta split -n 2 -k 10000 -o 2000 original.fasta

show some info about the file (and show gc content):

$ pyfasta info –gc test/data/three_chrs.fasta

extract sequence from the file. use the header flag to make a new fasta file. the args are a list of sequences to extract.

$ pyfasta extract –header –fasta test/data/three_chrs.fasta seqa seqb seqc

cleanup

(though for real use these will remain for faster access)

>>> import os
>>> os.unlink('tests/data/three_chrs.fasta.gdx')
>>> os.unlink('tests/data/three_chrs.fasta.flat')

Testing

there is currently > 99% test coverage for the 2 modules and all included record classes. to run the tests:

$ python setup.py nosetests

Changes

0.3.2

separate out backends into records.py
use nosetests (python setup.py nosetests)
add a TCRecord backend for next-gen sequencing availabe if tc is (easy-)installed.
improve test coverage.

Project details

Release history Release notifications | RSS feed

0.5.2

Apr 3, 2014

0.5.1

Oct 3, 2013

0.5.0

Aug 29, 2013

0.4.5

Feb 21, 2012

0.4.4

Oct 12, 2011

0.4.3

May 31, 2011

0.4.2

Apr 5, 2011

0.4.1

Dec 1, 2010

0.4.0

Oct 25, 2010

0.3.9

Mar 17, 2010

0.3.7

Dec 21, 2009

0.3.6

Dec 21, 2009

0.3.5

Dec 20, 2009

0.3.4

Dec 15, 2009

0.3.3

Dec 6, 2009

This version

0.3.2

Dec 3, 2009

0.3.1

Nov 17, 2009

0.3.0

Nov 17, 2009

0.2.9

Nov 10, 2009

0.2.8

Nov 6, 2009

0.2.5

Sep 23, 2009

0.2.4

Sep 9, 2009

0.2.3

Sep 8, 2009

0.2.2

Sep 8, 2009

0.2.1

Jul 13, 2009

0.2

Jul 13, 2009

0.1

May 27, 2009

pyfasta 0.3.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers