Skip to main content

Filter UniProt FASTA files by NCBI taxonomy

Project description

taxafasta

Filter UniProt protein FASTA files by NCBI taxonomy.

CI License

Overview

taxafasta filters large UniProt FASTA files (Swiss-Prot and/or TrEMBL) to include only proteins from specified NCBI taxonomy subtrees. It can work with local FASTA files or stream directly from UniProt — filtering on the fly without ever saving the full database to disk. It is designed for files at the scale of full UniProt TrEMBL (250M+ entries, hundreds of GB). The tool uses the NCBI taxonomy hierarchy to automatically include all descendants of specified taxonomy IDs and handles merged/deprecated taxonomy IDs transparently.

How It Works

The tool parses NCBI taxonomy dump files (nodes.dmp, merged.dmp) to build a parent→child tree in memory. From user-supplied taxonomy IDs, it pre-computes a flat set of all allowed taxonomy IDs (the specified IDs plus all their descendants), reducing per-entry filtering to an O(1) set-membership check.

The FASTA file is streamed line-by-line and never loaded into memory. Each entry's OX= field is extracted and checked against the pre-computed set. Matching entries are written to the (gzip-compressed by default) output. A log file is generated for every run recording parameters, taxonomy version, warnings, and summary statistics.

When --input is omitted, TrEMBL and Swiss-Prot are streamed directly from UniProt's FTP server, decompressed on the fly, and filtered without saving the full databases to disk.

Requirements

  • Python 3.10 or newer (Or Docker)

Installation

pip

pip install taxafasta

# With recommended performance dependencies:
pip install taxafasta[all]

Troubleshooting: If you see an error like:

ERROR: Could not find a version that satisfies the requirement taxafasta (from versions: none)
ERROR: No matching distribution found for taxafasta

Your Python version is likely too old. Verify with python --version — taxafasta requires Python 3.10+.

Docker

docker pull ghcr.io/mriffle/taxafasta:latest

Quick Start

# Stream from UniProt directly (no local FASTA needed)
taxafasta -t 2 -o bacteria.fasta

# Or filter a local file
taxafasta -i uniprot_trembl.fasta.gz -t 2 -o bacteria.fasta

# This produces:
#   bacteria.fasta.gz   — gzip-compressed FASTA with only bacterial proteins
#   bacteria.fasta.log  — run log with parameters, warnings, and statistics

Usage

Filter to a single taxonomic group

taxafasta -i uniprot_trembl.fasta.gz -t 2 -o bacteria.fasta

Filter to multiple groups (bacteria + viruses)

taxafasta -i uniprot_trembl.fasta.gz -t 2 -t 10239 -o bacteria_viruses.fasta

Exclude a subtree (eukaryotes minus mammals)

taxafasta -i uniprot_trembl.fasta.gz -t 2759 -e 40674 -o euk_no_mammals.fasta

Use pre-downloaded taxonomy files

taxafasta -i uniprot_trembl.fasta.gz -t 2 --taxdump /path/to/taxdump/ -o bacteria.fasta

Uncompressed output

taxafasta -i uniprot_trembl.fasta.gz -t 9606 -o human.fasta --no-gzip

Verbose progress

taxafasta -i uniprot_trembl.fasta.gz -t 2 -o bacteria.fasta -v

Filter multiple local files (TrEMBL + Swiss-Prot)

taxafasta -i uniprot_trembl.fasta.gz -i uniprot_sprot.fasta.gz -t 2 -o bacteria.fasta

Stream from UniProt directly (no local FASTA needed)

taxafasta -t 2 -o bacteria.fasta

Stream only Swiss-Prot (skip TrEMBL)

taxafasta -t 9606 -o human.fasta --no-trembl

Stream only TrEMBL (skip Swiss-Prot)

taxafasta -t 2 -o bacteria_trembl.fasta --no-swissprot

Docker Usage

docker run --rm --user "$(id -u):$(id -g)" -v "$PWD:$PWD" -w "$PWD" ghcr.io/mriffle/taxafasta:latest \
  -i uniprot_trembl.fasta.gz -t 2 -o bacteria.fasta

# With pre-downloaded taxonomy
docker run --rm \
  --user "$(id -u):$(id -g)" \
  -v "$PWD:$PWD" \
  -w "$PWD" \
  ghcr.io/mriffle/taxafasta:latest \
  -i uniprot_trembl.fasta.gz -t 2 --taxdump taxonomy -o bacteria.fasta

Common Taxonomy IDs

Taxonomy ID Name
2 Bacteria
2157 Archaea
2759 Eukaryota
10239 Viruses
9606 Homo sapiens
7742 Vertebrata
40674 Mammalia
33208 Metazoa
3193 Embryophyta (land plants)
4751 Fungi

NCBI Taxonomy Data

By default, the tool automatically downloads and caches taxdump.tar.gz from NCBI's FTP server on first run. Users can supply pre-downloaded taxonomy files with --taxdump. See: https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/

Development

git clone https://github.com/mriffle/taxafasta.git
cd taxafasta
pip install -e ".[all,dev]"

# Run tests
pytest

# Run with coverage
pytest --cov=taxafasta

# Run smoke tests (requires network)
pytest -m smoke

# Lint and format
ruff check src/ tests/
ruff format src/ tests/

# Type check
mypy src/

License

Apache 2.0 — see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

taxafasta-1.2.0.tar.gz (47.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

taxafasta-1.2.0-py3-none-any.whl (20.1 kB view details)

Uploaded Python 3

File details

Details for the file taxafasta-1.2.0.tar.gz.

File metadata

  • Download URL: taxafasta-1.2.0.tar.gz
  • Upload date:
  • Size: 47.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for taxafasta-1.2.0.tar.gz
Algorithm Hash digest
SHA256 1daa8f0f10886187ffad55784740f68d078f762576783813e2c7df01ffb1db4e
MD5 1ab7d3b747f7103d95ee28b830fb6bc2
BLAKE2b-256 9b86bba7848e867cf71dd28412f59e6288df7a48a7212009e048ba0b5b0b4389

See more details on using hashes here.

Provenance

The following attestation bundles were made for taxafasta-1.2.0.tar.gz:

Publisher: release.yml on mriffle/taxafasta

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file taxafasta-1.2.0-py3-none-any.whl.

File metadata

  • Download URL: taxafasta-1.2.0-py3-none-any.whl
  • Upload date:
  • Size: 20.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for taxafasta-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f323d98d6ef597719284473d69a8e9d0f834526835d79e813e781f6852ca068b
MD5 52f275475e6778e033a5c78ffd94da45
BLAKE2b-256 4541e19d70e69cad7e02df5f7e5bc948e93ed0cbb599e1e77d5736d46dc5af84

See more details on using hashes here.

Provenance

The following attestation bundles were made for taxafasta-1.2.0-py3-none-any.whl:

Publisher: release.yml on mriffle/taxafasta

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page