Skip to main content

present a collection of indexed fasta files as a single source

Project description

MultiFastaDB presents a collection of indexed fasta files as a single source. The intent is to simplify accessing a virtual database of sequences that is distributed across multiple files.

$ pip install multifastadb

$ python

>>> from multifastadb import MultiFastaDB

The simplest use is by passing a list of files or directories:

>>> mfdb = MultiFastaDB(['tests/data/ncbi'])

By default, MultiFastaDB looks for files ending in .fasta, .fa, .faa, .fna, and compressed versions of these ending in .gz. (NOTE: One must use bgzip for compression; using gzip will fail on reading.)

Fasta files from NCBI contain multiple identifiers for a single sequence encoded in the accession line, such as (gi|53292629|ref|NP_001005405.1|). Optionally, MultiFastaDB will create a meta index to the ref entries:

>>> mfdb = MultiFastaDB(['tests/data/ncbi'], use_meta_index=True)

Sequences may be retrieved by the fetch() method, with optional sequence start and end bounds (in 0-based or interbase coordinates):

>>> seq = mfdb.fetch('NP_001005405.1')
>>> seq = mfdb.fetch('NP_001005405.1',0,10)

NOTE: Fetching subsequences with bounds is much more efficient than:

>>> seq = mfdb.fetch('NP_001005405.1')[0:10]    # Don't do this!

If a sequence occurs more than once, only the first version is returned (intentionally).

Attribute-based retrieval is also supported:

>>> seq = mfdb['NP_001005405.1']
>>> seq = mfdb['NP_001005405.1'][0:10]

Attribute-based retrieval does not fetch any sequence immediately. Instead it returns a SequenceProxy object that fetches sequence lazily and transparently. This is particularly useful for accessing large sequences (e.g., chromosomes).

The locations of a given accession may be found with the where_is() method:

>>> mfdb.where_is('gi|53292629|ref|NP_001005405.1|')   # doctest: +ELLIPSIS
[('tests/data/ncbi/f1.human.protein.small.faa.gz', <pysam.cfaidx.Fastafile object at ...>)]

Project details

Release history Release notifications

History Node


History Node


History Node


History Node


History Node


History Node


History Node


History Node


History Node


This version
History Node


History Node


History Node


History Node


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size & hash SHA256 hash help File type Python version Upload date
multifastadb-0.2.0-py2.7.egg (2.4 kB) Copy SHA256 hash SHA256 Egg 2.7 Sep 1, 2014
multifastadb-0.2.0.tar.gz (402.4 kB) Copy SHA256 hash SHA256 Source None Sep 1, 2014

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging CloudAMQP CloudAMQP RabbitMQ AWS AWS Cloud computing Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page