Skip to main content

Filter UniProt FASTA files by NCBI taxonomy

Project description

taxafasta

Filter UniProt protein FASTA files by NCBI taxonomy.

CI License

Overview

taxafasta filters large UniProt FASTA files (Swiss-Prot and/or TrEMBL) to include only proteins from specified NCBI taxonomy subtrees. It is designed for files at the scale of full UniProt TrEMBL (250M+ entries, hundreds of GB). The tool uses the NCBI taxonomy hierarchy to automatically include all descendants of specified taxonomy IDs and handles merged/deprecated taxonomy IDs transparently.

How It Works

The tool parses NCBI taxonomy dump files (nodes.dmp, merged.dmp) to build a parent→child tree in memory. From user-supplied taxonomy IDs, it pre-computes a flat set of all allowed taxonomy IDs (the specified IDs plus all their descendants), reducing per-entry filtering to an O(1) set-membership check.

The FASTA file is streamed line-by-line and never loaded into memory. Each entry's OX= field is extracted and checked against the pre-computed set. Matching entries are written to the (gzip-compressed by default) output. A log file is generated for every run recording parameters, taxonomy version, warnings, and summary statistics.

Installation

pip

pip install taxafasta

# With recommended performance dependencies:
pip install taxafasta[all]

Docker

docker pull ghcr.io/mriffle/taxafasta:latest

Quick Start

# Download NCBI taxonomy (automatic on first run, or provide manually)
taxafasta -i uniprot_trembl.fasta.gz -t 2 -o bacteria.fasta

# This produces:
#   bacteria.fasta.gz   — gzip-compressed FASTA with only bacterial proteins
#   bacteria.fasta.log  — run log with parameters, warnings, and statistics

Usage

Filter to a single taxonomic group

taxafasta -i uniprot_trembl.fasta.gz -t 2 -o bacteria.fasta

Filter to multiple groups (bacteria + viruses)

taxafasta -i uniprot_trembl.fasta.gz -t 2 10239 -o bacteria_viruses.fasta

Exclude a subtree (eukaryotes minus mammals)

taxafasta -i uniprot_trembl.fasta.gz -t 2759 -e 40674 -o euk_no_mammals.fasta

Use pre-downloaded taxonomy files

taxafasta -i uniprot_trembl.fasta.gz -t 2 --taxdump /path/to/taxdump/ -o bacteria.fasta

Uncompressed output

taxafasta -i uniprot_trembl.fasta.gz -t 9606 -o human.fasta --no-gzip

Verbose progress

taxafasta -i uniprot_trembl.fasta.gz -t 2 -o bacteria.fasta -v

Docker Usage

docker run --rm -v /data:/data ghcr.io/mriffle/taxafasta:latest \
  -i /data/uniprot_trembl.fasta.gz -t 2 -o /data/bacteria.fasta

# With pre-downloaded taxonomy
docker run --rm \
  -v /data:/data \
  -v /taxonomy:/taxonomy:ro \
  ghcr.io/mriffle/taxafasta:latest \
  -i /data/uniprot_trembl.fasta.gz -t 2 --taxdump /taxonomy -o /data/bacteria.fasta

Common Taxonomy IDs

Taxonomy ID Name
2 Bacteria
2157 Archaea
2759 Eukaryota
10239 Viruses
9606 Homo sapiens
7742 Vertebrata
40674 Mammalia
33208 Metazoa
3193 Embryophyta (land plants)
4751 Fungi

NCBI Taxonomy Data

By default, the tool automatically downloads and caches taxdump.tar.gz from NCBI's FTP server on first run. Users can supply pre-downloaded taxonomy files with --taxdump. See: https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/

Development

git clone https://github.com/mriffle/taxafasta.git
cd taxafasta
pip install -e ".[all,dev]"

# Run tests
pytest

# Run with coverage
pytest --cov=taxafasta

# Lint and format
ruff check src/ tests/
ruff format src/ tests/

# Type check
mypy src/

License

Apache 2.0 — see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

taxafasta-1.0.0.tar.gz (43.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

taxafasta-1.0.0-py3-none-any.whl (17.7 kB view details)

Uploaded Python 3

File details

Details for the file taxafasta-1.0.0.tar.gz.

File metadata

  • Download URL: taxafasta-1.0.0.tar.gz
  • Upload date:
  • Size: 43.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for taxafasta-1.0.0.tar.gz
Algorithm Hash digest
SHA256 56ab9ad989e7e5c1844376681e669d74dc0751dbd8369d151daa118fa5bea1ba
MD5 d1cdafa429eddf0c4f5be6afd152539c
BLAKE2b-256 5dc728796f9bdd7fd3bffcf95b11af97cb9931c83d7a5c447a2c9f9f1dce98c4

See more details on using hashes here.

Provenance

The following attestation bundles were made for taxafasta-1.0.0.tar.gz:

Publisher: release.yml on mriffle/taxafasta

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file taxafasta-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: taxafasta-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 17.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for taxafasta-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a1eccbc2f9fa6bf067274ff4fc596207542dec346d84d836cd0f5f2edba6e571
MD5 2f69d0377221f921ba679c0bc7d8cb03
BLAKE2b-256 d3286c9c749e71cd0e432d5bf9ec3b1d694240eb8f99cfbce362e3433b19a442

See more details on using hashes here.

Provenance

The following attestation bundles were made for taxafasta-1.0.0-py3-none-any.whl:

Publisher: release.yml on mriffle/taxafasta

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page