Yet Another FASTA Index and Sequence Extraction Tool
Project description
YAFAX (Yet Another FASTA Index and Extraction Tool)
Introduction
Pure-Python implementation of samtools faidx with no external dependencies. The tool provides two modes: index, which creates a genome index, and getseq, which retrieves sequences from an indexed genome. As with samtools faidx, the genome must be indexed before sequence retrieval. By default, the genome index is expected to be located in the same directory as the genome FASTA file. Since this is an exact subset of samtools faidx, it is fully interoperable with it. An index created using yafax index can be used by samtools faidx for sequence retrieval, and vice versa. It offers reasonable performance: fetching 2 million queries, each with a width of 1 kbp, takes less than 2 minutes for the GRCh38 build of human genome. It is faster than the existing pyFaidx package due to lower overhead, but it offers fewer options.
Dependencies
YAFAX does not have any external dependencies. However, since the source code utilizes type hints, Python 3.11 or higher is recommended. Additionally, the argparse module in Python 3.13 includes enhanced formatting features that are used in the help message, so Python 3.13 or higher is recommended for full compatibility.
YAFAX has been tested on Ubuntu 20.04, 22.04, 24.04, and Alpine Linux v3.22 with Python versions 3.10–3.13. It should also work on Windows, although Windows Subsystem for Linux (WSL) is recommended for best results.
Installation
Installation from PyPI
pip install yafax
Installation from source
git clone https://github.com/dasprosad/yafax.git
cd yafax
pip install .
Usage and options
For help and usage
yafax -h/--help
yafax <mode> -h/--help
General options
usage: yafax <mode> --help
yafax <mode> [mode optional arguments] <fasta file>
YAFAX (Yet Another FASTA Index and Extraction Tool)
positional arguments:
FASTA FILE path to FASTA file
options:
-h, --help show this help message and exit
-v, --version show program's version number and exit
yafax mode:
{index,getseq}
index generate a FASTA index
getseq fetch sequence of an interval using position
YAFAX index options
usage: yafax index [optional arguments] <fasta file>
options:
-h, --help show this help message and exit
-d, --out-dir directory of the output FASTA index
YAFAX getseq options
usage: yafax getseq [optional arguments] <position> <fasta file>
positional arguments:
position position interval in format chr:start-end
options:
-h, --help show this help message and exit
-i, --index-path path of FASTA index (Default: file.fa.fai)
-r, --revcomp reverse complement the sequence (Default: False)
-w, --width wrapping width of FASTA lines (Default: 50)
-n, --no-header do not include FASTA header (Default: False)
-b, --bedfile BED file with regions
-o, --outfile output file, only used when BED is provided (Default: STDOUT)
Examples
YAFAX index examples
- To index a genome use
yafax index genome.fa
- Index a genome and store it into some other location
yafax index --out-dir /some/other/path genome.fa
YAFAX getseq examples
- To get sequence of an interval
chrK:start-end
yafax getseq chrK:start-end genome.fa
- To get sequence using index path
yafax getseq chrK:start-end --index-path /path/to/index/genome.fa.fai
- To get a reverse-complemented sequence of an interval
yafax getseq --revcomp chrK:start-end genome.fa
- By default the FASTA sequence are wrapped to 50 characters. To set a custom width use
yafax getseq --width INT chrK:start-end genome.fa
- If you do not want the FASTA headers on the sequences, use
yafax getseq --no-header chrK:start-end genome.fa
- By default the output is sent to STDOUT. They can be saved to a file by using
yafax getseq chrK:start-end --outfile FILENAME.fa genome.fa
- If multiple entries to be used it is convenient use a BED file
yafax getseq --bedfile POSITIONS.bed genome.fa
- To fetch reverse-complemented sequence of the interval chrM:10000-11000 with no FASTA header
yafax getseq --revcomp --no-header chrM:10000-11000 genome.fa
Benchmarking with faidx cli and samtools faidx
When benchmarked with both faidx cli from pyFaidx and the original samtools faidx, yafax getseq was almost 43x faster than faidx and only 2.26x slower than samtools faidx. Given that samtools is written in C, pure Python yafax is very much performant. All of the tests were done on Ubuntu 22.04 LTS x86-64 using Intel Xeon 2.30GHz, single core, and a solid state drive. The results are the following.
Pure samtools faidx (SAMtools v1.22.1 with htslib 1.22.1)
BED file was transformed to chr:start-end to meet samtools faidx requirement.
time samtools faidx hg38.fa --region-file seed100_interval1000.txt --output samtoolsfaidxout.fa
real 0m43.708s
user 0m4.674s
sys 0m21.075s
faidx cli from pyFaidx with CPython v3.11 (pyFaidex v0.9.0.3)
time faidx --bed seed100_interval1000.bed --out faidxout.fa hg38.fa
real 42m47.490s
user 3m32.369s
sys 0m53.095s
getseq from yafax with CPython v3.13 (faidex v1.0.0)
time yafax getseq --bedfile seed100_interval1000.bed --outfile yafaxgetseqout.fa hg38.fa
real 1m39.025s
user 1m24.853s
sys 0m13.172s
License
YAFAX is distributed under the GNU General Public License v3. You should have received a copy the license with the program.
Release history
- 1.0.0 First stable release
Typing
YAFAX ships with inline type hints and includes a py.typed marker. It is compatible with mypy and pyright.
Development
For testing tests are located in the tests/ directory and be tested by cloning the repository and running
pip install -e .
pip install pytest
pytest
Reporting issues
If you find any bugs or would like a new feature, please report the same in the GitHub YAFAX issue tracker.
Contributing
- Fork the repository on GitHub (https://github.com/dasprosad/yafax)
- Clone your fork (
git clone https://github.com/yourname/yourproject) - Create your feature branch (
git checkout -b feature/my-new-feature) - Commit your changes (
git commit -am 'Describe your changes here') - Push to the branch (
git push origin feature/my-new-feature) - Create a new Pull Request
Reference
-
Li, H., Handsaker, R. E., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., & Durbin, R.; 1000 Genome Project Data Processing Subgroup (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25(16), 2078–2079. DOI:10.1093/bioinformatics/btp352.
-
Shirley, M. D., Ma, Z., Pedersen, B. S., & Wheelan, S. J. (2015). Efficient “pythonic” access to FASTA files using pyfaidx. PeerJ PrePrints, 3, e1196. DOI:10.7287/peerj.preprints.970v1.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file yafax-1.0.0.tar.gz.
File metadata
- Download URL: yafax-1.0.0.tar.gz
- Upload date:
- Size: 24.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
878091c2c540040ea01570f857c716d0db0ec09a654a7a7a155db89f8e1f5ee5
|
|
| MD5 |
54032cce7987ca584dd06ce33f3da923
|
|
| BLAKE2b-256 |
8c28d450d00701db30308aa99dabe44a2b4ce9280193c5ba0fe812d8c0e010aa
|
File details
Details for the file yafax-1.0.0-py3-none-any.whl.
File metadata
- Download URL: yafax-1.0.0-py3-none-any.whl
- Upload date:
- Size: 23.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8a9c6018eba0cac41fcb88c7564f76ffebd1b61754c23d75abb4cc622ae9e268
|
|
| MD5 |
f2b5d39a121a921e57200c2372c7577e
|
|
| BLAKE2b-256 |
04042aad8a06752180c9e9679dde3c4c331408dd2000da5bff78a64e9662a295
|