Skip to main content

Filter UniProt FASTA files by NCBI taxonomy

Project description

taxafasta

Filter UniProt protein FASTA files by NCBI taxonomy.

CI License

Overview

taxafasta filters large UniProt FASTA files (Swiss-Prot and/or TrEMBL) to include only proteins from specified NCBI taxonomy subtrees. It can work with local FASTA files or stream directly from UniProt — filtering on the fly without ever saving the full database to disk. It is designed for files at the scale of full UniProt TrEMBL (250M+ entries, hundreds of GB). The tool uses the NCBI taxonomy hierarchy to automatically include all descendants of specified taxonomy IDs and handles merged/deprecated taxonomy IDs transparently.

How It Works

The tool parses NCBI taxonomy dump files (nodes.dmp, merged.dmp) to build a parent→child tree in memory. From user-supplied taxonomy IDs, it pre-computes a flat set of all allowed taxonomy IDs (the specified IDs plus all their descendants), reducing per-entry filtering to an O(1) set-membership check.

The FASTA file is streamed line-by-line and never loaded into memory. Each entry's OX= field is extracted and checked against the pre-computed set. Matching entries are written to the (gzip-compressed by default) output. A log file is generated for every run recording parameters, taxonomy version, warnings, and summary statistics.

When --input is omitted, TrEMBL and Swiss-Prot are streamed directly from UniProt's FTP server, decompressed on the fly, and filtered without saving the full databases to disk.

Requirements

  • Python 3.10 or newer (Or Docker)

Installation

pip

pip install taxafasta

# With recommended performance dependencies:
pip install taxafasta[all]

Troubleshooting: If you see an error like:

ERROR: Could not find a version that satisfies the requirement taxafasta (from versions: none)
ERROR: No matching distribution found for taxafasta

Your Python version is likely too old. Verify with python --version — taxafasta requires Python 3.10+.

Docker

docker pull ghcr.io/mriffle/taxafasta:latest

Quick Start

# Stream from UniProt directly (no local FASTA needed)
taxafasta -t 2 -o bacteria.fasta

# Or filter a local file
taxafasta -i uniprot_trembl.fasta.gz -t 2 -o bacteria.fasta

# This produces:
#   bacteria.fasta.gz   — gzip-compressed FASTA with only bacterial proteins
#   bacteria.fasta.log  — run log with parameters, warnings, and statistics

Usage

Filter to a single taxonomic group

taxafasta -i uniprot_trembl.fasta.gz -t 2 -o bacteria.fasta

Filter to multiple groups (bacteria + viruses)

taxafasta -i uniprot_trembl.fasta.gz -t 2 -t 10239 -o bacteria_viruses.fasta

Exclude a subtree (eukaryotes minus mammals)

taxafasta -i uniprot_trembl.fasta.gz -t 2759 -e 40674 -o euk_no_mammals.fasta

Use pre-downloaded taxonomy files

taxafasta -i uniprot_trembl.fasta.gz -t 2 --taxdump /path/to/taxdump/ -o bacteria.fasta

Uncompressed output

taxafasta -i uniprot_trembl.fasta.gz -t 9606 -o human.fasta --no-gzip

Verbose progress

taxafasta -i uniprot_trembl.fasta.gz -t 2 -o bacteria.fasta -v

Stream from UniProt directly (no local FASTA needed)

taxafasta -t 2 -o bacteria.fasta

Stream only Swiss-Prot (skip TrEMBL)

taxafasta -t 9606 -o human.fasta --no-trembl

Stream only TrEMBL (skip Swiss-Prot)

taxafasta -t 2 -o bacteria_trembl.fasta --no-swissprot

Docker Usage

docker run --rm --user "$(id -u):$(id -g)" -v "$PWD:$PWD" -w "$PWD" ghcr.io/mriffle/taxafasta:latest \
  -i uniprot_trembl.fasta.gz -t 2 -o bacteria.fasta

# With pre-downloaded taxonomy
docker run --rm \
  --user "$(id -u):$(id -g)" \
  -v "$PWD:$PWD" \
  -w "$PWD" \
  ghcr.io/mriffle/taxafasta:latest \
  -i uniprot_trembl.fasta.gz -t 2 --taxdump taxonomy -o bacteria.fasta

Common Taxonomy IDs

Taxonomy ID Name
2 Bacteria
2157 Archaea
2759 Eukaryota
10239 Viruses
9606 Homo sapiens
7742 Vertebrata
40674 Mammalia
33208 Metazoa
3193 Embryophyta (land plants)
4751 Fungi

NCBI Taxonomy Data

By default, the tool automatically downloads and caches taxdump.tar.gz from NCBI's FTP server on first run. Users can supply pre-downloaded taxonomy files with --taxdump. See: https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/

Development

git clone https://github.com/mriffle/taxafasta.git
cd taxafasta
pip install -e ".[all,dev]"

# Run tests
pytest

# Run with coverage
pytest --cov=taxafasta

# Run smoke tests (requires network)
pytest -m smoke

# Lint and format
ruff check src/ tests/
ruff format src/ tests/

# Type check
mypy src/

License

Apache 2.0 — see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

taxafasta-1.1.1.tar.gz (47.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

taxafasta-1.1.1-py3-none-any.whl (20.0 kB view details)

Uploaded Python 3

File details

Details for the file taxafasta-1.1.1.tar.gz.

File metadata

  • Download URL: taxafasta-1.1.1.tar.gz
  • Upload date:
  • Size: 47.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for taxafasta-1.1.1.tar.gz
Algorithm Hash digest
SHA256 b93f6e0adc4fc8319fc8e03ed5ad456cedde7321de512c59ba02375075a765db
MD5 9e84b2188be8971079a3563a7c555561
BLAKE2b-256 5d1ff68598f78f7adf6ad28ca2461bde2ce90b6120a471465ac402f63f8cc849

See more details on using hashes here.

Provenance

The following attestation bundles were made for taxafasta-1.1.1.tar.gz:

Publisher: release.yml on mriffle/taxafasta

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file taxafasta-1.1.1-py3-none-any.whl.

File metadata

  • Download URL: taxafasta-1.1.1-py3-none-any.whl
  • Upload date:
  • Size: 20.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for taxafasta-1.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0978e525529009cbcb5dc950e23d31a1a600dfdee0849a44131cea7b7066c1d3
MD5 a8d8a936e2a3242c9be77712d9a68891
BLAKE2b-256 81849676588706c7d7d964a159ce826a7eec3637a773c371a00a81cd6a4e7ecd

See more details on using hashes here.

Provenance

The following attestation bundles were made for taxafasta-1.1.1-py3-none-any.whl:

Publisher: release.yml on mriffle/taxafasta

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page