Skip to main content

ConSite: conserved-domain alignment and conserved-site visualization from protein FASTA

Project description

ConSite

ConSite is a bioinformatics tool that takes a protein FASTA sequence as input, identifies conserved domains via local Pfam/HMMER searches, detects conserved sites within aligned regions, and outputs both structured data and publication-quality visualizations.

Features

  • FASTA input → conserved domain search using local Pfam database and HMMER
  • Automatic domain alignment using Pfam SEED alignments
  • Per-position conservation scoring (entropy, Jensen–Shannon divergence, consensus frequency)
  • Conserved site detection with adjustable thresholds
  • Publication-quality visualization:
    • Linear domain maps with highlighted conserved sites
    • Per-domain alignment panels with legible sequence display
    • Hollow red circles marking conserved positions
  • Command-line interface (CLI) with comprehensive logging
  • Reproducible outputs (JSON, TSV, PNG, Stockholm alignments)

Installation

Prerequisites

  • Python 3.10 or higher
  • HMMER 3.x installed and available in PATH
  • Pfam database files (see Quick Start below)

Installing HMMER

macOS (Homebrew):

brew install hmmer

Linux (APT):

sudo apt-get update
sudo apt-get install hmmer

Windows (conda):

conda install -c conda-forge hmmer

Verify installation:

hmmsearch --version

From Source (1)

git clone https://github.com/yangli-evo/ConSite.git
cd ConSite

Quick Start

Option 1: Automatic Setup (Recommended) (2)

We provide helper scripts to automate the setup process:

# Make scripts executable
chmod +x scripts/*.sh

# Download and set up Pfam database
./scripts/get_pfam.sh

# Run the demo
./scripts/quickstart.sh

Note: The scripts have different purposes:

  • get_pfam.sh: Downloads and prepares the Pfam database files
  • quickstart.sh: Sets up the Python environment and runs the demo

Option 2: Manual Setup (2)

If you prefer to set up things manually or already have some components:

1. Install ConSite

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -e .

2. Download Pfam Database

# Create directory for Pfam files
mkdir -p pfam_db

# Download Pfam-A HMM library
curl -L -o pfam_db/Pfam-A.hmm.gz https://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz
gunzip pfam_db/Pfam-A.hmm.gz

# Download Pfam-A SEED alignments
curl -L -o pfam_db/Pfam-A.seed.gz https://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.seed.gz
gunzip pfam_db/Pfam-A.seed.gz

# Press the HMM library for HMMER
hmmpress pfam_db/Pfam-A.hmm

3. Run ConSite

# Basic run with example protein
consite \
  --fasta examples/P05362.fasta \
  --pfam-hmm pfam_db/Pfam-A.hmm \
  --pfam-seed pfam_db/Pfam-A.seed \
  --out results \
  --id P05362

# With custom parameters
consite \
  --fasta myprotein.fasta \
  --pfam-hmm pfam_db/Pfam-A.hmm \
  --pfam-seed pfam_db/Pfam-A.seed \
  --out results \
  --topn 5 \
  --cpu 8 \
  --jsd-top-percent 15 \
  --log results/run.log

Output Files (4)

Each run produces a results folder containing:

  • query.fasta - Input sequence used for analysis
  • hits.json - Structured domain hit information
  • scores.tsv - Per-position conservation scores and flags (columns: pos, in_domain, jsd, entropy, is_conserved)
  • domain_map.png - Full sequence domain visualization with conserved sites marked
  • *_panel.png - Individual domain alignment panels showing sequence and conserved positions
  • *_aligned.sto - Stockholm format alignments of query to domain HMMs
  • hmmsearch.domtblout - Raw HMMER domain table output
  • run.log - Complete log of all external tool executions

Command Line Options (5)

Option Description Default
--fasta Input protein FASTA file Required
--pfam-hmm Path to Pfam-A.hmm (pressed) Required
--pfam-seed Path to Pfam-A.seed Required
--out Output directory Required
--id Custom run ID (default: FASTA header) Auto-detected
--topn Number of top domains to analyze 2
--cpu Number of CPU cores for HMMER 4
--jsd-top-percent Top % of positions called conserved 10.0
--log Log file for external tool output results/<id>/run.log
--quiet Suppress console output False
--keep Preserve existing output folder False

How It Works (6)

  1. Domain Detection: Uses hmmsearch against Pfam-A.hmm to identify conserved domains
  2. SEED Extraction: Pulls the corresponding Pfam SEED alignment for each hit
  3. HMM Building: Creates a per-family HMM from the SEED using hmmbuild
  4. Sequence Alignment: Aligns the query protein to the domain HMM using hmmalign
  5. Conservation Scoring: Computes JSD and entropy scores for each position
  6. Visualization: Generates domain maps and alignment panels

Current Status

What works:

  • Complete domain detection and alignment pipeline
  • Publication-quality visualizations
  • Structured data outputs (JSON, TSV)
  • Robust HMMER integration with logging

Known limitations:

  • Conservation scoring currently uses query-only alignment (JSD/entropy are placeholders)
  • Remote CDD mode is not yet implemented
  • Conservation thresholds are relative (top X%) rather than absolute

Next steps planned:

  • Integrate Pfam SEED sequences for real conservation scoring
  • Add absolute conservation thresholds
  • Implement remote CDD mode

Example Output (7)

For the included ICAM1 example (P05362), ConSite identifies:

  • PF03921 (positions 25-115): Ig-like domain
  • PF21146 (positions 219-308): Ig-like domain

The tool produces clean visualizations showing domain boundaries and conserved sites as hollow red circles.

Expected Results

After running the demo, you should see:

  • A results/P05362/ folder containing all outputs
  • Two domain panels: 1_PF03921_panel.png and 2_PF21146_panel.png
  • A full sequence map: domain_map.png
  • Structured data: hits.json and scores.tsv

Troubleshooting (8)

Common Issues

"command not found: hmmsearch"

  • Install HMMER using the instructions in Prerequisites above
  • Ensure it's in your PATH: which hmmsearch

"No such file or directory: pfam_db/Pfam-A.hmm"

  • Run the helper script: ./scripts/get_pfam.sh
  • Or manually download Pfam files as shown in Manual Setup

"Permission denied" when running scripts

  • Make scripts executable: chmod +x scripts/*.sh

Large log files

  • Use --quiet to suppress verbose output
  • Check --log path is writable

Development (9)

Current Project Structure

ConSite/
├── src/consite/          # Main package source
│   ├── cli.py            # Command-line interface
│   ├── hmmer_local.py    # HMMER tool wrappers
│   ├── parse_domtbl.py   # HMMER output parsing
│   ├── pfam.py           # Pfam SEED extraction
│   ├── msa_io.py         # Multiple sequence alignment I/O
│   ├── conserve.py       # Conservation scoring algorithms
│   ├── viz.py            # Visualization functions
│   └── utils.py          # Shared utilities
├── examples/              # Example input files
├── scripts/               # Helper automation scripts
├── pfam_db/              # Pfam database files (not included in repo)
└── results/               # Output directory (not included in repo)

Important Notes for Collaborators

  • Large files are excluded: The pfam_db/ and results/ directories are in .gitignore
  • Pfam database: You'll need to download this using the helper scripts
  • Virtual environment: The .venv/ folder is also excluded - you'll create your own
  • Example data: The examples/P05362.fasta file is included for testing

Dependencies

  • biopython ≥ 1.81 - Sequence and alignment handling
  • numpy ≥ 2.0 - Numerical computations
  • matplotlib ≥ 3.7 - Visualization generation
  • pandas ≥ 2.0 - Data manipulation
  • scipy ≥ 1.16 - Scientific computing
  • Python ≥ 3.10 - Required Python version

Citation (10)

If you use ConSite in your research, please cite:

Joey Wagner, Yang Li. ConSite: conserved-domain alignment and conserved-site visualization from protein FASTA.

License

This project is licensed under the terms specified in the LICENSE file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

consite-0.1.0a1.tar.gz (19.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

consite-0.1.0a1-py3-none-any.whl (18.1 kB view details)

Uploaded Python 3

File details

Details for the file consite-0.1.0a1.tar.gz.

File metadata

  • Download URL: consite-0.1.0a1.tar.gz
  • Upload date:
  • Size: 19.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for consite-0.1.0a1.tar.gz
Algorithm Hash digest
SHA256 c890e358b4852e1b1461aec52e933c73f0169268c07959dc50b280b8a1f2852c
MD5 36e0ff601a90413451d9c51f9cf5b446
BLAKE2b-256 48b7b6abe4651e7f0fa0f5916b779b3b740883db2340c1e3c6a35158d3da4298

See more details on using hashes here.

File details

Details for the file consite-0.1.0a1-py3-none-any.whl.

File metadata

  • Download URL: consite-0.1.0a1-py3-none-any.whl
  • Upload date:
  • Size: 18.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for consite-0.1.0a1-py3-none-any.whl
Algorithm Hash digest
SHA256 08f71efea54c90620d4e4426766476df6191a88085e4b8d0db56451aab944231
MD5 4d29162da25acaa6fa8aade4549a5567
BLAKE2b-256 d7dbbd49a369306eb2665ab3703fc6f35ac25165b53ac1cbfc616df5945b7580

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page