pyfaidx

pyfaidx: efficient pythonic random access to fasta subsequences

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 5 - Production/Stable
Environment
- Console
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Natural Language
- English
Operating System
- Unix
Programming Language
Topic
- Scientific/Engineering :: Bio-Informatics

Project description

Description

Samtools provides a function “faidx” (FAsta InDeX), which creates a small flat index file “.fai” allowing for fast random access to any subsequence in the indexed fasta, while loading a minimal amount of the file in to memory.

Installation

This package is tested under Python 3.2-3.4, 2.7, 2.6, and pypy.

pip install pyfaidx

or

python setup.py install

Usage

>>> from pyfaidx import Fasta
>>> genes = Fasta('tests/data/genes.fasta')
>>> genes
Fasta("tests/data/genes.fasta")  # set strict_bounds=True for bounds checking

Acts like a dictionary.

>>> genes.keys() ('AB821309.1', 'KF435150.1', 'KF435149.1', 'NR_104216.1', 'NR_104215.1', 'NR_104212.1', 'NM_001282545.1', 'NM_001282543.1', 'NM_000465.3', 'NM_001282549.1', 'NM_001282548.1', 'XM_005249645.1', 'XM_005249644.1', 'XM_005249643.1', 'XM_005249642.1', 'XM_005265508.1', 'XM_005265507.1', 'XR_241081.1', 'XR_241080.1', 'XR_241079.1')

>>> genes['NM_001282543.1'][200:230]
>NM_001282543.1:201-230
CTCGTTCCGCGCCCGCCATGGAACCGGATG

>>> genes['NM_001282543.1'][200:230].seq
'CTCGTTCCGCGCCCGCCATGGAACCGGATG'

>>> genes['NM_001282543.1'][200:230].name
'NM_001282543.1:201-230'

>>> genes['NM_001282543.1'][200:230].start
201

>>> genes['NM_001282543.1'][200:230].end
230

>>> len(genes['NM_001282543.1'])
5466

Indexes like a list:

>>> genes[0][:50]
>AB821309.1:1-50
ATGGTCAGCTGGGGTCGTTTCATCTGCCTGGTCGTGGTCACCATGGCAAC

Slices just like a string:

>>> genes['NM_001282543.1'][200:230][:10]
>NM_001282543.1:201-210
CTCGTTCCGC

>>> genes['NM_001282543.1'][200:230][::-1]
>NM_001282543.1:230-201
GTAGGCCAAGGTACCGCCCGCGCCTTGCTC

>>> genes['NM_001282543.1'][200:230][::3]
>NM_001282543.1:201-230
CGCCCCTACA

>>> genes['NM_001282543.1'][:]
>NM_001282543.1:1-5466
CCCCGCCCCT........

Start and end coordinates are 0-based, just like Python.

Sequence can be buffered in memory using a read-ahead buffer:

>>> genes = Fasta('tests/data/genes.fasta' read_ahead=100)

>>> genes['NM_001282543.1'][200:230][::-1]
>NM_001282543.1:230-201
GTAGGCCAAGGTACCGCCCGCGCCTTGCTC

>>> len(genes.buffer)
100

Complements and reverse complements just like DNA

>>> genes['NM_001282543.1'][200:230].complement
>NM_001282543.1 (complement):201-230
GAGCAAGGCGCGGGCGGTACCTTGGCCTAC

>>> genes['NM_001282543.1'][200:230].reverse
>NM_001282543.1:230-201
GTAGGCCAAGGTACCGCCCGCGCCTTGCTC

>>> -genes['NM_001282543.1'][200:230]
>NM_001282543.1 (complement):230-201
CATCCGGTTCCATGGCGGGCGCGGAACGAG

Custom key functions provide cleaner access:

>>> from pyfaidx import Fasta
>>> genes = Fasta('tests/data/genes.fasta', key_function = lambda x: x.split('.')[0])
>>> genes.keys()
dict_keys(['NR_104212', 'NM_001282543', 'XM_005249644', 'XM_005249645', 'NR_104216', 'XM_005249643', 'NR_104215', 'KF435150', 'AB821309', 'NM_001282549', 'XR_241081', 'KF435149', 'XR_241079', 'NM_000465', 'XM_005265508', 'XR_241080', 'XM_005249642', 'NM_001282545', 'XM_005265507', 'NM_001282548'])
>>> genes['NR_104212'][:10]
>NR_104212:1-10
CCCCGCCCCT

Or just get a Python string:

>>> from pyfaidx import Fasta
>>> genes = Fasta('tests/data/genes.fasta', as_raw=True)
>>> genes
Fasta("tests/data/genes.fasta", as_raw=True)

>>> genes['NM_001282543.1'][200:230]
CTCGTTCCGCGCCCGCCATGGAACCGGATG

You can also perform line-based iteration, receiving the sequence lines as they appear in the FASTA file:

>>> from pyfaidx import Fasta
>>> genes = Fasta('tests/data/genes.fasta')
>>> for line in genes['NM_001282543.1']:
...   print(line)
CCCCGCCCCTCTGGCGGCCCGCCGTCCCAGACGCGGGAAGAGCTTGGCCGGTTTCGAGTCGCTGGCCTGC
AGCTTCCCTGTGGTTTCCCGAGGCTTCCTTGCTTCCCGCTCTGCGAGGAGCCTTTCATCCGAAGGCGGGA
CGATGCCGGATAATCGGCAGCCGAGGAACCGGCAGCCGAGGATCCGCTCCGGGAACGAGCCTCGTTCCGC
...

If you want to modify the contents of your FASTA file in-place, you can use the mutable argument. Any portion of the FastaRecord can be replaced with an equivalent-length string. Warning: This will change the contents of your file immediately and permanently:

>>> genes = Fasta('tests/data/genes.fasta', mutable=True)
>>> type(genes['NM_001282543.1'])
<class 'pyfaidx.MutableFastaRecord'>

>>> genes['NM_001282543.1'][:10]
>NM_001282543.1:1-10
CCCCGCCCCT
>>> genes['NM_001282543.1'][:10] = 'NNNNNNNNNN'
>>> genes['NM_001282543.1'][:15]
>NM_001282543.1:1-15
NNNNNNNNNNCTGGC

It also provides a command-line script:

cli script: faidx

For usage type faidx -h.

$ faidx tests/data/genes.fasta NM_001282543.1:201-210 NM_001282543.1:300-320
>NM_001282543.1
CTCGTTCCGC
>NM_001282543.1
GTAATTGTGTAAGTGACTGCA

$ faidx --complement tests/data/genes.fasta NM_001282543.1:201-210
>NM_001282543.1
GAGCAAGGCG

$ faidx --reverse tests/data/genes.fasta NM_001282543.1:201-210
>NM_001282543.1
CGCCTTGCTC

$ faidx tests/data/genes.fasta NM_001282543.1
>NM_001282543.1
CCCCGCCCCT........

$ faidx tests/data/genes.fasta --list regions.txt
...

Similar syntax as samtools faidx

A lower-level Faidx class is also available:

>>> from pyfaidx import Faidx
>>> fa = Faidx('genes.fa')  # can return str with as_raw=True
>>> fa.index
OrderedDict([('AB821309.1', IndexRecord(rlen=3510, offset=12, lenc=70, lenb=71)), ('KF435150.1', IndexRecord(rlen=481, offset=3585, lenc=70, lenb=71)),... ])

>>> fa.index['AB821309.1'].rlen
3510

fa.fetch('AB821309.1', 1, 10)
>AB821309.1:1-10
ATGGTCAGCT

If the FASTA file is not indexed, when Faidx is initialized the build_index method will automatically run, and the index will be written to “filename.fa.fai” with write_fai(). where “filename.fa” is the original FASTA file.
Start and end coordinates are 1-based.

Changes

New in version 0.3.1:

Fasta can now accept an integer index in addition to string keys.

New in version 0.3.0:

FastaRecord now works as a line-based iterator (#30)
Added MutableFastaRecord class that allows same-length in-place replacement for FASTA (#29)

New in version 0.2.9:

Added read-ahead buffer for fast sequential sequence access (#26)
Fixed a condition where as_raw parameter was not respected (#27)

New in version 0.2.8:

Small internal refactoring

New in version 0.2.7:

Faidx and Fasta strict_bounds bounds checking logic is more correct
Fasta default_seq parameter now works
CLI script faidx now takes a BED file for fetching regions from a fasta

New in version 0.2.6:

Faidx no longer has raw_index attribute or rebuild_index method (reduce memory footprint)
Faidx index memory usage decreased by 31-40%
.fai creation is streaming, performance increase for very large indices
Possible speed regression when performing many small queries using Fasta class

New in version 0.2.5:

Fasta and Faidx can take default_seq in addition to as_raw, key_function, and strict_bounds parameters.
Fixed issue #20
Faidx has attribute raw_index which is a list representing the fai file.
Faidx has rebuild_index and write_fai functions for building and writing raw_index to file.
Extra test cases, and test cases against Biopython SeqIO

New in version 0.2.4:

Faidx index order is stable and non-random

New in version 0.2.3:

Fixed a bug affecting Python 2.6

New in version 0.2.2:

Fasta can receive the strict_bounds argument

New in version 0.2.1:

FastaRecord str attribute returns a string
Fasta is now an iterator

New in version 0.2.0:

as_raw keyword arg for Faidx and Fasta allows a simple string return type
__str__ method for FastaRecord returns entire contig sequence

New in version 0.1.9:

line wrapping of faidx is set based on the wrapping of the indexed fasta file
added --reverse and --complement arguments to faidx

New in version 0.1.8:

key_function keyword argument to Fasta allows lookup based on function output

Acknowledgements

This project is freely licensed by the author, Matthew Shirley, and was completed under the mentorship and financial support of Drs. Sarah Wheelan and Vasan Yegnasubramanian at the Sidney Kimmel Comprehensive Cancer Center in the Department of Oncology.

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 5 - Production/Stable
Environment
- Console
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Natural Language
- English
Operating System
- Unix
Programming Language
Topic
- Scientific/Engineering :: Bio-Informatics

Release history Release notifications | RSS feed

0.9.0.3

Sep 3, 2025

0.9.0.2

Sep 3, 2025

0.9.0.1

Aug 21, 2025

0.9.0

Aug 21, 2025

0.8.2

Aug 15, 2025

0.8.1.4

May 5, 2025

0.8.1.3

Oct 10, 2024

0.8.1.2

Aug 5, 2024

0.8.1.1

Jan 19, 2024

0.8.1

Jan 18, 2024

0.8.0

Jan 8, 2024

0.7.2.2

Sep 22, 2023

0.7.2.1

Feb 16, 2023

0.7.2

Feb 15, 2023

0.7.1

Jul 26, 2022

0.7.0

Jun 2, 2022

0.6.4

Jan 31, 2022

0.6.3.1

Oct 28, 2021

0.6.3

Oct 28, 2021

0.6.2

Aug 30, 2021

0.6.1

Jul 14, 2021

0.6.0

Jun 29, 2021

0.5.9.5

Feb 28, 2021

0.5.9.2

Dec 9, 2020

0.5.9.1

Jul 29, 2020

0.5.9

Jun 22, 2020

0.5.8

Jan 16, 2020

0.5.7

Dec 10, 2019

0.5.6

Nov 22, 2019

0.5.5.2

Oct 27, 2018

0.5.5.1

Oct 14, 2018

0.5.5

Sep 19, 2018

0.5.4.2

Aug 4, 2018

0.5.4.1

Jun 20, 2018

0.5.4

May 12, 2018

0.5.3.1

Feb 9, 2018

0.5.3

Feb 5, 2018

0.5.2

Jan 25, 2018

0.5.1

Oct 26, 2017

0.5.0.1

Sep 11, 2017

0.5.0

Jul 25, 2017

0.4.9.2

Jun 15, 2017

0.4.9.1

Jun 15, 2017

0.4.9

Jun 14, 2017

0.4.8.4

Apr 18, 2017

0.4.8.3

Mar 26, 2017

0.4.8.2

Mar 2, 2017

0.4.8.1

Oct 20, 2016

0.4.8

Oct 16, 2016

0.4.7.1

Dec 31, 2015

0.4.7

Dec 10, 2015

0.4.6

Dec 8, 2015

0.4.5.2

Nov 17, 2015

0.4.5.1

Nov 17, 2015

0.4.5

Nov 16, 2015

0.4.4

Oct 26, 2015

0.4.3.1

Oct 24, 2015

0.4.3

Oct 22, 2015

0.4.2

Aug 3, 2015

0.4.1.1

May 12, 2015

0.4.1

May 8, 2015

0.4.0.1

May 12, 2015

0.4.0

Apr 30, 2015

0.3.9.1

May 12, 2015

0.3.9

Apr 5, 2015

0.3.8.1

May 12, 2015

0.3.8

Apr 2, 2015

0.3.7.1

May 12, 2015

0.3.7

Mar 3, 2015

0.3.6.1

May 12, 2015

0.3.6

Mar 2, 2015

0.3.5

Feb 13, 2015

0.3.4

Jan 13, 2015

0.3.3

Jan 7, 2015

0.3.2

Dec 26, 2014

This version

0.3.1

Dec 15, 2014

0.3.0

Nov 20, 2014

0.2.9

Nov 8, 2014

0.2.8

Oct 28, 2014

0.2.7

Sep 9, 2014

0.2.6

Sep 4, 2014

0.2.5

Aug 27, 2014

0.2.4

Aug 6, 2014

0.2.3

Jul 29, 2014

0.2.1

Jul 3, 2014

0.2.0

Jul 1, 2014

0.1.9

Jun 23, 2014

0.1.8

Jun 16, 2014

0.1.7

Jun 10, 2014

0.1.6

Apr 21, 2014

0.1.5

Apr 16, 2014

0.1.4

Apr 15, 2014

0.1.3

Mar 3, 2014

0.1.2

Feb 25, 2014

0.1.1

Feb 17, 2014

0.1.0

Feb 3, 2014

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyfaidx-0.3.1.tar.gz (15.7 kB view details)

Uploaded Dec 15, 2014 Source

File details

Details for the file pyfaidx-0.3.1.tar.gz.

File metadata

Download URL: pyfaidx-0.3.1.tar.gz
Upload date: Dec 15, 2014
Size: 15.7 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for pyfaidx-0.3.1.tar.gz
Algorithm	Hash digest
SHA256	`d7d1ec68bbba8d92a01758e553eb37beeba34b677e546d1b84534e2248a6c622`
MD5	`6dcaea2c3d334a605e44d9203e410504`
BLAKE2b-256	`1467526750a2e36416dab90992568c7b002457910cf95699fb7a60fda3810cc4`

See more details on using hashes here.

pyfaidx 0.3.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Description

Installation

Usage

cli script: faidx

Changes

Acknowledgements

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes