Skip to main content
Help us improve Python packaging – donate today!

"samtools faidx" compatible FASTA indexing in pure python

Project Description

Travis PyPI

Please cite Shirley, Matthew (2014): pyfaidx: efficient pythonic random access to fasta subsequences. figshare. DOI:10.6084/m9.figshare.972933.

Description

Samtools provides a function “faidx” (FAsta InDeX), which creates a small flat index file “.fai” allowing for fast random access to any subsequence in the indexed fasta, while loading a minimal amount of the file in to memory.

Pyfaidx provides an interface for creating and using this index for fast random access of DNA subsequences from huge fasta files in a “pythonic” manner. Indexing speed is comparable to samtools, and in some cases sequence retrieval is much faster (benchmark). For example:

>>> from pyfaidx import Fasta
>>> genes = Fasta('tests/data/genes.fasta')
>>> genes
Fasta("tests/data/genes.fasta")

Acts like a dictionary.

>>> genes.keys() ['NR_104215.1',
'KF435150.1', 'NM_001282548.1', 'NM_001282549.1', 'XM_005249644.1',
'NM_001282543.1', 'NR_104216.1', 'XM_005265508.1', 'XR_241079.1',
'AB821309.1', 'XM_005249645.1', 'XR_241081.1', 'XM_005249643.1',
'XM_005249642.1', 'NM_001282545.1', 'NR_104212.1', 'XR_241080.1',
'XM_005265507.1', 'KF435149.1', 'NM_000465.3']

>>> genes['NM_001282543.1'][200:230]
NM_001282543.1:201-230
CTCGTTCCGCGCCCGCCATGGAACCGGATG

>>> genes['NM_001282543.1'][200:230].seq
'CTCGTTCCGCGCCCGCCATGGAACCGGATG'

>>> genes['NM_001282543.1'][200:230].name
'NM_001282543.1:201-230'

>>> genes['NM_001282543.1'][200:230].start
201

>>> genes['NM_001282543.1'][200:230].end
230

Slices just like a string:

>>> genes['NM_001282543.1'][200:230][:10]
NM_001282543.1:201-210
CTCGTTCCGC

>>> genes['NM_001282543.1'][200:230][::-1]
NM_001282543.1:230-201
GTAGGCCAAGGTACCGCCCGCGCCTTGCTC

>>> genes['NM_001282543.1'][200:230][::3]
NM_001282543.1:201-230
CGCCCCTACA

Complements and reverse complements just like DNA

>>> genes['NM_001282543.1'][200:230].complement
NM_001282543.1 (complement):201-230
GAGCAAGGCGCGGGCGGTACCTTGGCCTAC

>>> genes['NM_001282543.1'][200:230]
NM_001282543.1 (complement):230-201
CATCCGGTTCCATGGCGGGCGCGGAACGAG

It also provides a command-line script:

cli script: faidx

$ faidx tests/data/genes.fasta NM_001282543.1:201-210 NM_001282543.1:300-320
>NM_001282543.1:201-210
CTCGTTCCGC
>NM_001282543.1:300-320
GTAATTGTGTAAGTGACTGCA

Same syntax as samtools faidx

A lower-level Faidx class is also available:

>>> from pyfaidx import Faidx
>>> fa = Faidx('T7.fa')
>>> fa.build('T7.fa', 'T7.fa.fai')
>>> fa.index
{'EM_PHG:V01146': {'lenc': 60, 'lenb': 61, 'rlen': 39937, 'offset': 40571}, 'EM_PHG:GU071091': {'lenc': 60, 'lenb': 61, 'rlen': 39778, 'offset': 74}}

>>> fa.fetch('EM_PHG:V01146', 1, 10)
EM_PHG:V01146
TCTCACAGTG

>>> fa.fetch('EM_PHG:V01146', 100, 120)
>EM_PHG:V01146
GGTTGGGGATGACCCTTGGGT
  • If the FASTA file is not indexed, when Faidx is initialized the build method will automatically run, producing “filename.fa.fai” where “filename.fa” is the original FASTA file.
  • Start and end coordinates are 1-based.

Installation

This package is tested under Python 3.3, 3.2, 2.7, 2.6, and pypy.

pip install pyfaidx

or

python setup.py install

CLI Usage

“samtools faidx” compatible FASTA indexing in pure python.

usage: faidx [-h] [-n] fasta [regions [regions ...]]

Fetch sequence from faidx-indexed FASTA

positional arguments:
  fasta       FASTA file
  regions     space separated regions of sequence to fetch e.g. chr1:1-1000

optional arguments:
  -h, --help  show this help message and exit
  -n, --name  print sequence names

Acknowledgements

This project is freely licensed by the author, Matthew Shirley, and was completed under the mentorship and financial support of Drs. Sarah Wheelan and Vasan Yegnasubramanian at the Sidney Kimmel Comprehensive Cancer Center in the Department of Oncology.

Release history Release notifications

History Node

0.5.3.1

History Node

0.5.3

History Node

0.5.2

History Node

0.5.1

History Node

0.5.0.1

History Node

0.5.0

History Node

0.4.9.2

History Node

0.4.9.1

History Node

0.4.9

History Node

0.4.8.4

History Node

0.4.8.3

History Node

0.4.8.2

History Node

0.4.8.1

History Node

0.4.8

History Node

0.4.7.1

History Node

0.4.7

History Node

0.4.6

History Node

0.4.5.2

History Node

0.4.5.1

History Node

0.4.5

History Node

0.4.4

History Node

0.4.3.1

History Node

0.4.3

History Node

0.4.2

History Node

0.4.1.1

History Node

0.4.1

History Node

0.4.0.1

History Node

0.4.0

History Node

0.3.9.1

History Node

0.3.9

History Node

0.3.8.1

History Node

0.3.8

History Node

0.3.7.1

History Node

0.3.7

History Node

0.3.6.1

History Node

0.3.6

History Node

0.3.5

History Node

0.3.4

History Node

0.3.3

History Node

0.3.2

History Node

0.3.1

History Node

0.3.0

History Node

0.2.9

History Node

0.2.8

History Node

0.2.7

History Node

0.2.6

History Node

0.2.5

History Node

0.2.4

History Node

0.2.3

History Node

0.2.1

History Node

0.2.0

History Node

0.1.9

History Node

0.1.8

History Node

0.1.7

History Node

0.1.6

History Node

0.1.5

This version
History Node

0.1.4

History Node

0.1.3

History Node

0.1.2

History Node

0.1.1

History Node

0.1.0

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size & hash SHA256 hash help File type Python version Upload date
pyfaidx-0.1.4.tar.gz (14.4 kB) Copy SHA256 hash SHA256 Source None Apr 15, 2014

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging CloudAMQP CloudAMQP RabbitMQ AWS AWS Cloud computing Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page