Skip to main content

Filter UniProt FASTA files by NCBI taxonomy

Project description

taxafasta

Filter UniProt protein FASTA files by NCBI taxonomy.

CI License

Overview

taxafasta filters large UniProt FASTA files (Swiss-Prot and/or TrEMBL) to include only proteins from specified NCBI taxonomy subtrees. It can work with local FASTA files or stream directly from UniProt — filtering on the fly without ever saving the full database to disk. It is designed for files at the scale of full UniProt TrEMBL (250M+ entries, hundreds of GB). The tool uses the NCBI taxonomy hierarchy to automatically include all descendants of specified taxonomy IDs and handles merged/deprecated taxonomy IDs transparently.

How It Works

The tool parses NCBI taxonomy dump files (nodes.dmp, merged.dmp) to build a parent→child tree in memory. From user-supplied taxonomy IDs, it pre-computes a flat set of all allowed taxonomy IDs (the specified IDs plus all their descendants), reducing per-entry filtering to an O(1) set-membership check.

The FASTA file is streamed line-by-line and never loaded into memory. Each entry's OX= field is extracted and checked against the pre-computed set. Matching entries are written to the (gzip-compressed by default) output. A log file is generated for every run recording parameters, taxonomy version, warnings, and summary statistics.

When --input is omitted, TrEMBL and Swiss-Prot are streamed directly from UniProt's FTP server, decompressed on the fly, and filtered without saving the full databases to disk.

Requirements

  • Python 3.10 or newer (Or Docker)

Installation

pip

pip install taxafasta

# With recommended performance dependencies:
pip install taxafasta[all]

Troubleshooting: If you see an error like:

ERROR: Could not find a version that satisfies the requirement taxafasta (from versions: none)
ERROR: No matching distribution found for taxafasta

Your Python version is likely too old. Verify with python --version — taxafasta requires Python 3.10+.

Docker

docker pull ghcr.io/mriffle/taxafasta:latest

Quick Start

# Stream from UniProt directly (no local FASTA needed)
taxafasta -t 2 -o bacteria.fasta

# Or filter a local file
taxafasta -i uniprot_trembl.fasta.gz -t 2 -o bacteria.fasta

# This produces:
#   bacteria.fasta.gz   — gzip-compressed FASTA with only bacterial proteins
#   bacteria.fasta.log  — run log with parameters, warnings, and statistics

Usage

Filter to a single taxonomic group

taxafasta -i uniprot_trembl.fasta.gz -t 2 -o bacteria.fasta

Filter to multiple groups (bacteria + viruses)

taxafasta -i uniprot_trembl.fasta.gz -t 2 -t 10239 -o bacteria_viruses.fasta

Exclude a subtree (eukaryotes minus mammals)

taxafasta -i uniprot_trembl.fasta.gz -t 2759 -e 40674 -o euk_no_mammals.fasta

Use pre-downloaded taxonomy files

taxafasta -i uniprot_trembl.fasta.gz -t 2 --taxdump /path/to/taxdump/ -o bacteria.fasta

Uncompressed output

taxafasta -i uniprot_trembl.fasta.gz -t 9606 -o human.fasta --no-gzip

Verbose progress

taxafasta -i uniprot_trembl.fasta.gz -t 2 -o bacteria.fasta -v

Filter multiple local files (TrEMBL + Swiss-Prot)

taxafasta -i uniprot_trembl.fasta.gz -i uniprot_sprot.fasta.gz -t 2 -o bacteria.fasta

Stream from UniProt directly (no local FASTA needed)

taxafasta -t 2 -o bacteria.fasta

Stream only Swiss-Prot (skip TrEMBL)

taxafasta -t 9606 -o human.fasta --no-trembl

Stream only TrEMBL (skip Swiss-Prot)

taxafasta -t 2 -o bacteria_trembl.fasta --no-swissprot

Network resilience: When streaming, transient network errors (broken pipes, connection resets) are automatically retried up to 5 times with exponential backoff. The download resumes from the exact byte offset via HTTP Range headers, so no data is lost or reprocessed.

Docker Usage

docker run --rm --user "$(id -u):$(id -g)" -v "$PWD:$PWD" -w "$PWD" ghcr.io/mriffle/taxafasta:latest \
  -i uniprot_trembl.fasta.gz -t 2 -o bacteria.fasta

# With pre-downloaded taxonomy
docker run --rm \
  --user "$(id -u):$(id -g)" \
  -v "$PWD:$PWD" \
  -w "$PWD" \
  ghcr.io/mriffle/taxafasta:latest \
  -i uniprot_trembl.fasta.gz -t 2 --taxdump taxonomy -o bacteria.fasta

Common Taxonomy IDs

Taxonomy ID Name
2 Bacteria
2157 Archaea
2759 Eukaryota
10239 Viruses
9606 Homo sapiens
7742 Vertebrata
40674 Mammalia
33208 Metazoa
3193 Embryophyta (land plants)
4751 Fungi

NCBI Taxonomy Data

By default, the tool automatically downloads and caches taxdump.tar.gz from NCBI's FTP server on first run. Users can supply pre-downloaded taxonomy files with --taxdump. See: https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/

Development

git clone https://github.com/mriffle/taxafasta.git
cd taxafasta
pip install -e ".[all,dev]"

# Run tests
pytest

# Run with coverage
pytest --cov=taxafasta

# Run smoke tests (requires network)
pytest -m smoke

# Lint and format
ruff check src/ tests/
ruff format src/ tests/

# Type check
mypy src/

License

Apache 2.0 — see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

taxafasta-1.2.1.tar.gz (51.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

taxafasta-1.2.1-py3-none-any.whl (21.4 kB view details)

Uploaded Python 3

File details

Details for the file taxafasta-1.2.1.tar.gz.

File metadata

  • Download URL: taxafasta-1.2.1.tar.gz
  • Upload date:
  • Size: 51.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for taxafasta-1.2.1.tar.gz
Algorithm Hash digest
SHA256 54ac732e2cf2ea17a9b44a596ab449ead724db4f173485378eb528be29f561cb
MD5 75fe0983fa53ee9210b2425453a5898c
BLAKE2b-256 552ec9c38125ee4c98b9cf5e7e23ce67b551aff09164b67b365ca6118cbdbbca

See more details on using hashes here.

Provenance

The following attestation bundles were made for taxafasta-1.2.1.tar.gz:

Publisher: release.yml on mriffle/taxafasta

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file taxafasta-1.2.1-py3-none-any.whl.

File metadata

  • Download URL: taxafasta-1.2.1-py3-none-any.whl
  • Upload date:
  • Size: 21.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for taxafasta-1.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a9eb728215fbdc04cf6aee2e70dbe29d3565ff82db5371b1cc2e35f5c2db6667
MD5 7d9427a2b30b6338697714f4718f709f
BLAKE2b-256 3cd84d70efaac517d1ab51a7ffa394adf46ba91dd911970cfa3bcb32e29aeb82

See more details on using hashes here.

Provenance

The following attestation bundles were made for taxafasta-1.2.1-py3-none-any.whl:

Publisher: release.yml on mriffle/taxafasta

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page