Skip to main content

ePLACE: environmental Phylogenetic Localisation and Clade Estimation - A library for analyzing eDNA sequences

Project description

Edwards Lab DOI License: MIT GitHub language count

eplace

ePLACE: environmental Phylogenetic Localisation and Clade Estimation

A Python library for analyzing environmental DNA (eDNA) sequences through BLAST comparison and taxonomic classification.

Features

  • NCBI Database Management: Download and manage NCBI BLAST databases (core_nt)
  • FASTA File Processing: Read and validate FASTA files
  • BLAST Search: Run blastn searches with configurable parameters
  • Result Filtering: Filter BLAST results by identity and coverage thresholds
  • Taxonomic Analysis: Extract representative sequences per taxonomic rank
  • Sequence Extraction: Retrieve sequences from BLAST databases
  • Sequence Trimming: Trim reference sequences to aligned regions based on BLAST coordinates
  • Multiple Sequence Alignment: Align sequences using MAFFT with auto-orientation
  • Phylogenetic Trees: Build and label phylogenetic trees using IQTree
  • Results Summary Output: Creates a tab separated output that summarises the per-sequence matches.

Installation

# Clone the repository
git clone https://github.com/linsalrob/eplace.git
cd eplace

# Install the package
pip install .

# Or install in development mode
pip install -e .

# Or with development dependencies
pip install -e ".[dev]"

After installation, the eplace command will be available in your environment.

Requirements

  • Python 3.8 or higher
  • BLAST+ tools (blastn, blastdbcmd) must be installed separately:
    # Ubuntu/Debian
    sudo apt-get install ncbi-blast+
    
    # macOS with Homebrew
    brew install blast
    
  • TaxonKit (for taxonomy lookup):
    # Download from: https://github.com/shenwei356/taxonkit/releases
    # Or install via conda:
    conda install -c bioconda taxonkit
    
  • MAFFT (optional, for sequence alignment):
    # Ubuntu/Debian
    sudo apt-get install mafft
    
    # macOS with Homebrew
    brew install mafft
    
    # Or via conda:
    conda install -c bioconda mafft
    
  • IQTree (optional, for phylogenetic tree building):
    # Ubuntu/Debian
    sudo apt-get install iqtree
    
    # macOS with Homebrew
    brew install iqtree
    
    # Or via conda:
    conda install -c bioconda iqtree
    

Quick Start

ePLACE provides a unified command-line interface with three main commands:

1. Download NCBI Database

# Download the core_nt database to default location
eplace download

# Force redownload even if database exists
eplace download --force

2. Run Individual BLAST Analysis

Run BLAST search and build one phylogenetic tree per query sequence:

# Basic usage with default parameters
eplace blast query.fasta output_dir

# With custom parameters
eplace blast query.fasta output_dir \
    --rank genus \
    --min-identity 95 \
    --min-coverage 85 \
    --num-threads 4

# Skip alignment and tree building (BLAST and extraction only)
eplace blast query.fasta output_dir --skip-alignment

# Show help
eplace blast --help

3. Run Grouped BLAST Analysis

Run BLAST search and group queries by taxonomic rank for joint phylogenetic analysis:

# Basic usage (group by class, default)
eplace grouped query.fasta output_dir

# Group by different taxonomic rank
eplace grouped query.fasta output_dir --group-rank order

# Specify both representative and grouping ranks
eplace grouped query.fasta output_dir --rank genus --group-rank family

# Show help
eplace grouped --help

Using the Library API

You can also use ePLACE as a Python library:

from eplace_lib import setup_ncbi_database

# Download the core_nt database
success, message = setup_ncbi_database()
from pathlib import Path
from eplace_lib import run_blast_search, process_blast_results_for_taxonomy

# Run BLAST search with filtering
success, filtered_hits = run_blast_search(
    query_fasta=Path("query.fasta"),
    output_file=Path("blast_results.txt"),
    min_identity=90.0,    # 90% identity threshold
    min_coverage=80.0     # 80% query coverage threshold
)

# Extract representative sequences by taxonomic rank
results = process_blast_results_for_taxonomy(
    blast_hits=filtered_hits,
    output_dir=Path("output"),
    rank="species"  # Options: phylum, class, order, family, genus, species
)

Command-Line Interface

The eplace command provides three subcommands:

eplace download

Download and setup the NCBI core_nt BLAST database.

Usage:

eplace download [--force]

Options:

  • --force: Force redownload even if database exists

Notes:

  • Database will be stored in $BLASTDB if set, otherwise ~/blastdb
  • The download is large (several GB) and may take time
  • MD5 checksums are verified automatically

eplace blast

Run BLAST search with individual taxonomy analysis. Creates one phylogenetic tree per query sequence.

Usage:

eplace blast QUERY_FASTA OUTPUT_DIR [OPTIONS]

Required Arguments:

  • QUERY_FASTA: Path to query FASTA file
  • OUTPUT_DIR: Output directory for results

Optional Arguments:

  • --rank {phylum,class,order,family,genus,species}: Taxonomic rank for representative selection (default: genus)
  • --tree-label-rank {phylum,class,order,family,genus,species}: Taxonomic rank for tree labeling (default: genus)
  • --min-identity FLOAT: Minimum percent identity for BLAST hits (default: 90.0)
  • --min-coverage FLOAT: Minimum query coverage percentage (default: 80.0)
  • --database NAME: BLAST database name (default: core_nt)
  • --blastdb-path PATH: Path to BLAST database directory
  • --num-threads INT: Number of threads for BLAST and alignment (default: 1)
  • --overwrite-existing-blast: Overwrite existing BLAST results
  • --skip-alignment: Skip alignment and tree building steps
  • --output-classification PATH: Path to output classification TSV file

eplace grouped

Run BLAST search with grouped taxonomy analysis. Groups queries by taxonomic rank and creates one phylogenetic tree per group.

Usage:

eplace grouped QUERY_FASTA OUTPUT_DIR [OPTIONS]

Required Arguments:

  • QUERY_FASTA: Path to query FASTA file
  • OUTPUT_DIR: Output directory for results

Optional Arguments:

  • --rank {phylum,class,order,family,genus,species}: Taxonomic rank for representative selection (default: genus)
  • --group-rank {phylum,class,order,family,genus,species}: Taxonomic rank for grouping sequences (default: class)
  • --tree-label-rank {phylum,class,order,family,genus,species}: Taxonomic rank for tree labeling (default: genus)
  • --min-identity FLOAT: Minimum percent identity for BLAST hits (default: 90.0)
  • --min-coverage FLOAT: Minimum query coverage percentage (default: 80.0)
  • --database NAME: BLAST database name (default: core_nt)
  • --blastdb-path PATH: Path to BLAST database directory
  • --num-threads INT: Number of threads for BLAST and alignment (default: 1)
  • --overwrite-existing-blast: Overwrite existing BLAST results
  • --skip-alignment: Skip alignment and tree building steps
  • --alignment-tolerance INT: Maximum coordinate difference for alignment consistency (default: 50)
  • --output-classification PATH: Path to output classification TSV file

Documentation

Full documentation is available at Read the Docs.

Local Documentation

You can also build the documentation locally:

cd docs
make html
# Open docs/build/html/index.html in your browser

Workflow Comparison

Individual Workflow (eplace blast)

The individual workflow processes each query sequence independently:

  • Creates one output directory per query sequence
  • Extracts representative sequences for each query at the specified taxonomic rank
  • Builds one multiple sequence alignment per query
  • Creates one phylogenetic tree per query

Use when: You want to analyze each query sequence in its own phylogenetic context.

Grouped Workflow (eplace grouped)

The grouped workflow combines queries by taxonomic classification:

  • Groups all queries that match to the same taxonomic rank (e.g., class, order)
  • Creates one FASTA file per group containing all queries and unique reference sequences
  • Removes redundant reference sequences within each group
  • Builds one alignment and phylogenetic tree per group (instead of per query)

Use when: You want to analyze multiple related queries together in a single phylogenetic context.

Examples

# Group queries by class (default)
eplace grouped query.fasta output_dir

# Group by a different taxonomic rank
eplace grouped query.fasta output_dir --group-rank order

# Specify both representative rank and grouping rank
eplace grouped query.fasta output_dir --rank genus --group-rank family

Testing

Run the test suite:

# Run all tests
pytest tests/ -v

# Run specific test modules
pytest tests/test_blast_analysis.py -v
pytest tests/test_taxonomy.py -v
pytest tests/test_workflow.py -v

# Run with coverage
pytest tests/ --cov=eplace_lib --cov-report=html

Project Structure

eplace/
├── src/
│   └── eplace_lib/
│       ├── __init__.py
│       ├── blast_analysis.py    # BLAST operations
│       ├── ncbi_download.py     # Database management
│       ├── sequences.py         # Sequence analysis utilities
│       └── taxonomy.py          # Taxonomy extraction
├── tests/
│   ├── test_blast_analysis.py
│   ├── test_ncbi_download.py
│   ├── test_taxonomy.py
│   └── test_workflow.py
├── examples/
│   ├── blast_workflow_example.py
│   └── download_ncbi_example.py
├── docs/
│   ├── blast_workflow.md
│   └── ncbi_download.md
└── pyproject.toml

Workflow Overview

  1. Download Database: Use setup_ncbi_database() to download NCBI core_nt database
  2. Prepare Query: Create a FASTA file with your query sequences
  3. Run BLAST: Use run_blast_search() to search against the database
  4. Filter Results: Automatically filter by identity and coverage thresholds
  5. Extract Representatives: Select representative sequences per taxonomic rank
  6. Trim Sequences: Extract aligned regions from reference sequences based on BLAST coordinates
  7. Align Sequences: Use MAFFT to align query with trimmed reference sequences (optional)
  8. Build Tree: Build phylogenetic tree using IQTree with taxonomic labels (optional)
  9. Output: Get FASTA files, alignments, and trees (one set per query)

Grouped Workflow Overview

The grouped workflow adds an additional step: 1-5. Same as standard workflow through representative extraction 6. Group by Rank: Group all queries by specified taxonomic rank (e.g., class) 7. Create Grouped FASTA: Combine all queries and unique references for each group 8. Trim Sequences: Trim references to aligned regions 9. Check Consistency: Verify BLAST hits align to similar locations on references 10. Align and Build Trees: Create one alignment and tree per taxonomic group

Output Structure

Standard Workflow Output

output_dir/
├── blast_results.txt              # Raw BLAST results
├── blast_results_annotated.txt    # BLAST results with taxonomic annotations
├── query1_id/
│   ├── query1_id_representatives.fasta          # Representative sequences
│   ├── query1_id_with_query.fasta              # Query + representatives
│   ├── query1_id_trimmed.fasta                 # Trimmed to aligned regions
│   ├── query1_id_aligned.fasta                 # Multiple sequence alignment
│   ├── query1_id_tree.treefile                 # Phylogenetic tree
│   ├── query1_id_tree_labeled.treefile         # Tree with taxonomic labels
│   └── query1_id_tree.* (other IQTree files)
├── query2_id/
│   └── query2_id_representatives.fasta
└── ...

Grouped Workflow Output

output_dir/
├── blast_results.txt              # Raw BLAST results
├── blast_results_annotated.txt    # BLAST results with taxonomic annotations
├── query1_id/                     # Per-query representative sequences (from step 5)
│   └── query1_id_representatives.fasta
├── query2_id/
│   └── query2_id_representatives.fasta
├── Taxonomic_Group_1/             # One directory per taxonomic group
│   ├── Taxonomic_Group_1_combined.fasta        # All queries + unique references
│   ├── Taxonomic_Group_1_trimmed.fasta         # Trimmed to aligned regions
│   ├── Taxonomic_Group_1_aligned.fasta         # Multiple sequence alignment
│   ├── Taxonomic_Group_1_tree.treefile         # Phylogenetic tree
│   ├── Taxonomic_Group_1_tree_labeled.treefile # Tree with taxonomic labels
│   └── Taxonomic_Group_1_tree.* (other IQTree files)
├── Taxonomic_Group_2/
│   └── ...
└── ...

License

MIT License - See LICENSE file for details

Authors

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Citation

If you use ePLACE in your research, please cite:

Edwards, R. (2024). ePLACE: environmental Phylogenetic Localisation and Clade Estimation.
GitHub repository: https://github.com/linsalrob/eplace

Support

For issues, questions, or suggestions, please open an issue on GitHub: https://github.com/linsalrob/eplace/issues

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

eplace-0.1.1.tar.gz (48.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

eplace-0.1.1-py3-none-any.whl (36.9 kB view details)

Uploaded Python 3

File details

Details for the file eplace-0.1.1.tar.gz.

File metadata

  • Download URL: eplace-0.1.1.tar.gz
  • Upload date:
  • Size: 48.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for eplace-0.1.1.tar.gz
Algorithm Hash digest
SHA256 af14ac5014e000c778dd5a34e55928de6e0e732004ede0c06962fc6185ff24cb
MD5 f2342870765a3b4b6f46974a36415ed2
BLAKE2b-256 fe9471872848ed5926f76b1ee96ef913de51a66416534c66f3f685cab3a961bc

See more details on using hashes here.

Provenance

The following attestation bundles were made for eplace-0.1.1.tar.gz:

Publisher: python-publish.yml on linsalrob/eplace

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file eplace-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: eplace-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 36.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for eplace-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c5ea9bd7164a1948ef172bfa1c8f3b090b088a5999b5fdb535bb0ad351d0c6a0
MD5 24e3d932b09c2ecc3dd354147c7085b4
BLAKE2b-256 eb882e01837146297c7f4ce6d45c05cbbc553c65067cb5e5af2a3a3d07f13ddf

See more details on using hashes here.

Provenance

The following attestation bundles were made for eplace-0.1.1-py3-none-any.whl:

Publisher: python-publish.yml on linsalrob/eplace

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page