Filter UniProt FASTA files by NCBI taxonomy
Project description
taxafasta
Filter UniProt protein FASTA files by NCBI taxonomy.
Overview
taxafasta filters large UniProt FASTA files (Swiss-Prot and/or TrEMBL) to include only proteins from specified NCBI taxonomy subtrees. It can work with local FASTA files or stream directly from UniProt — filtering on the fly without ever saving the full database to disk. It is designed for files at the scale of full UniProt TrEMBL (250M+ entries, hundreds of GB). The tool uses the NCBI taxonomy hierarchy to automatically include all descendants of specified taxonomy IDs and handles merged/deprecated taxonomy IDs transparently.
How It Works
The tool parses NCBI taxonomy dump files (nodes.dmp, merged.dmp) to build a parent→child tree in memory. From user-supplied taxonomy IDs, it pre-computes a flat set of all allowed taxonomy IDs (the specified IDs plus all their descendants), reducing per-entry filtering to an O(1) set-membership check.
The FASTA file is streamed line-by-line and never loaded into memory. Each entry's OX= field is extracted and checked against the pre-computed set. Matching entries are written to the (gzip-compressed by default) output. A log file is generated for every run recording parameters, taxonomy version, warnings, and summary statistics.
When --input is omitted, TrEMBL and Swiss-Prot are streamed directly from UniProt's FTP server, decompressed on the fly, and filtered without saving the full databases to disk.
Requirements
- Python 3.10 or newer (Or Docker)
Installation
pip
pip install taxafasta
# With recommended performance dependencies:
pip install taxafasta[all]
Troubleshooting: If you see an error like:
ERROR: Could not find a version that satisfies the requirement taxafasta (from versions: none) ERROR: No matching distribution found for taxafastaYour Python version is likely too old. Verify with
python --version— taxafasta requires Python 3.10+.
Docker
docker pull ghcr.io/mriffle/taxafasta:latest
Quick Start
# Stream from UniProt directly (no local FASTA needed)
taxafasta -t 2 -o bacteria.fasta
# Or filter a local file
taxafasta -i uniprot_trembl.fasta.gz -t 2 -o bacteria.fasta
# This produces:
# bacteria.fasta.gz — gzip-compressed FASTA with only bacterial proteins
# bacteria.fasta.log — run log with parameters, warnings, and statistics
Usage
Filter to a single taxonomic group
taxafasta -i uniprot_trembl.fasta.gz -t 2 -o bacteria.fasta
Filter to multiple groups (bacteria + viruses)
taxafasta -i uniprot_trembl.fasta.gz -t 2 -t 10239 -o bacteria_viruses.fasta
Exclude a subtree (eukaryotes minus mammals)
taxafasta -i uniprot_trembl.fasta.gz -t 2759 -e 40674 -o euk_no_mammals.fasta
Use pre-downloaded taxonomy files
taxafasta -i uniprot_trembl.fasta.gz -t 2 --taxdump /path/to/taxdump/ -o bacteria.fasta
Uncompressed output
taxafasta -i uniprot_trembl.fasta.gz -t 9606 -o human.fasta --no-gzip
Verbose progress
taxafasta -i uniprot_trembl.fasta.gz -t 2 -o bacteria.fasta -v
Filter multiple local files (TrEMBL + Swiss-Prot)
taxafasta -i uniprot_trembl.fasta.gz -i uniprot_sprot.fasta.gz -t 2 -o bacteria.fasta
Stream from UniProt directly (no local FASTA needed)
taxafasta -t 2 -o bacteria.fasta
Stream only Swiss-Prot (skip TrEMBL)
taxafasta -t 9606 -o human.fasta --no-trembl
Stream only TrEMBL (skip Swiss-Prot)
taxafasta -t 2 -o bacteria_trembl.fasta --no-swissprot
Docker Usage
docker run --rm --user "$(id -u):$(id -g)" -v "$PWD:$PWD" -w "$PWD" ghcr.io/mriffle/taxafasta:latest \
-i uniprot_trembl.fasta.gz -t 2 -o bacteria.fasta
# With pre-downloaded taxonomy
docker run --rm \
--user "$(id -u):$(id -g)" \
-v "$PWD:$PWD" \
-w "$PWD" \
ghcr.io/mriffle/taxafasta:latest \
-i uniprot_trembl.fasta.gz -t 2 --taxdump taxonomy -o bacteria.fasta
Common Taxonomy IDs
| Taxonomy ID | Name |
|---|---|
| 2 | Bacteria |
| 2157 | Archaea |
| 2759 | Eukaryota |
| 10239 | Viruses |
| 9606 | Homo sapiens |
| 7742 | Vertebrata |
| 40674 | Mammalia |
| 33208 | Metazoa |
| 3193 | Embryophyta (land plants) |
| 4751 | Fungi |
NCBI Taxonomy Data
By default, the tool automatically downloads and caches taxdump.tar.gz from NCBI's FTP server on first run. Users can supply pre-downloaded taxonomy files with --taxdump. See: https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/
Development
git clone https://github.com/mriffle/taxafasta.git
cd taxafasta
pip install -e ".[all,dev]"
# Run tests
pytest
# Run with coverage
pytest --cov=taxafasta
# Run smoke tests (requires network)
pytest -m smoke
# Lint and format
ruff check src/ tests/
ruff format src/ tests/
# Type check
mypy src/
License
Apache 2.0 — see LICENSE for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file taxafasta-1.2.0.tar.gz.
File metadata
- Download URL: taxafasta-1.2.0.tar.gz
- Upload date:
- Size: 47.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1daa8f0f10886187ffad55784740f68d078f762576783813e2c7df01ffb1db4e
|
|
| MD5 |
1ab7d3b747f7103d95ee28b830fb6bc2
|
|
| BLAKE2b-256 |
9b86bba7848e867cf71dd28412f59e6288df7a48a7212009e048ba0b5b0b4389
|
Provenance
The following attestation bundles were made for taxafasta-1.2.0.tar.gz:
Publisher:
release.yml on mriffle/taxafasta
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
taxafasta-1.2.0.tar.gz -
Subject digest:
1daa8f0f10886187ffad55784740f68d078f762576783813e2c7df01ffb1db4e - Sigstore transparency entry: 1046393596
- Sigstore integration time:
-
Permalink:
mriffle/taxafasta@d4f26af5df7fcbee8e495412121454761d7ffe71 -
Branch / Tag:
refs/tags/v1.2.0 - Owner: https://github.com/mriffle
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@d4f26af5df7fcbee8e495412121454761d7ffe71 -
Trigger Event:
release
-
Statement type:
File details
Details for the file taxafasta-1.2.0-py3-none-any.whl.
File metadata
- Download URL: taxafasta-1.2.0-py3-none-any.whl
- Upload date:
- Size: 20.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f323d98d6ef597719284473d69a8e9d0f834526835d79e813e781f6852ca068b
|
|
| MD5 |
52f275475e6778e033a5c78ffd94da45
|
|
| BLAKE2b-256 |
4541e19d70e69cad7e02df5f7e5bc948e93ed0cbb599e1e77d5736d46dc5af84
|
Provenance
The following attestation bundles were made for taxafasta-1.2.0-py3-none-any.whl:
Publisher:
release.yml on mriffle/taxafasta
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
taxafasta-1.2.0-py3-none-any.whl -
Subject digest:
f323d98d6ef597719284473d69a8e9d0f834526835d79e813e781f6852ca068b - Sigstore transparency entry: 1046393646
- Sigstore integration time:
-
Permalink:
mriffle/taxafasta@d4f26af5df7fcbee8e495412121454761d7ffe71 -
Branch / Tag:
refs/tags/v1.2.0 - Owner: https://github.com/mriffle
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@d4f26af5df7fcbee8e495412121454761d7ffe71 -
Trigger Event:
release
-
Statement type: