ConSite: conserved-domain alignment and conserved-site visualization from protein FASTA
Project description
ConSite
ConSite is a bioinformatics tool that takes a protein FASTA sequence as input, identifies conserved domains via local Pfam/HMMER searches, detects conserved sites within aligned regions, and outputs both structured data and publication-quality visualizations.
Features
- FASTA input → conserved domain search using local Pfam database and HMMER
- Automatic domain alignment using Pfam SEED alignments
- Per-position conservation scoring (entropy, Jensen–Shannon divergence, consensus frequency)
- Conserved site detection with adjustable thresholds
- Publication-quality visualization:
- Linear domain maps with highlighted conserved sites
- Per-domain alignment panels with legible sequence display
- Hollow red circles marking conserved positions
- Command-line interface (CLI) with comprehensive logging
- Reproducible outputs (JSON, TSV, PNG, Stockholm alignments)
Installation
Prerequisites
- Python 3.10 or higher
- HMMER 3.x installed and available in PATH
- Pfam database files (see Quick Start below)
Installing HMMER
macOS (Homebrew):
brew install hmmer
Linux (APT):
sudo apt-get update
sudo apt-get install hmmer
Windows (conda):
conda install -c conda-forge hmmer
Verify installation:
hmmsearch --version
From Source (1)
git clone https://github.com/yangli-evo/ConSite.git
cd ConSite
Quick Start
Option 1: Automatic Setup (Recommended) (2)
We provide helper scripts to automate the setup process:
# Make scripts executable
chmod +x scripts/*.sh
# Download and set up Pfam database
./scripts/get_pfam.sh
# Run the demo
./scripts/quickstart.sh
Note: The scripts have different purposes:
get_pfam.sh: Downloads and prepares the Pfam database filesquickstart.sh: Sets up the Python environment and runs the demo
Option 2: Manual Setup (2)
If you prefer to set up things manually or already have some components:
1. Install ConSite
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -e .
2. Download Pfam Database
# Create directory for Pfam files
mkdir -p pfam_db
# Download Pfam-A HMM library
curl -L -o pfam_db/Pfam-A.hmm.gz https://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz
gunzip pfam_db/Pfam-A.hmm.gz
# Download Pfam-A SEED alignments
curl -L -o pfam_db/Pfam-A.seed.gz https://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.seed.gz
gunzip pfam_db/Pfam-A.seed.gz
# Press the HMM library for HMMER
hmmpress pfam_db/Pfam-A.hmm
3. Run ConSite
# Basic run with example protein
consite \
--fasta examples/P05362.fasta \
--pfam-hmm pfam_db/Pfam-A.hmm \
--pfam-seed pfam_db/Pfam-A.seed \
--out results \
--id P05362
# With custom parameters
consite \
--fasta myprotein.fasta \
--pfam-hmm pfam_db/Pfam-A.hmm \
--pfam-seed pfam_db/Pfam-A.seed \
--out results \
--topn 5 \
--cpu 8 \
--jsd-top-percent 15 \
--log results/run.log
Output Files (4)
Each run produces a results folder containing:
query.fasta- Input sequence used for analysishits.json- Structured domain hit informationscores.tsv- Per-position conservation scores and flags (columns: pos, in_domain, jsd, entropy, is_conserved)domain_map.png- Full sequence domain visualization with conserved sites marked*_panel.png- Individual domain alignment panels showing sequence and conserved positions*_aligned.sto- Stockholm format alignments of query to domain HMMshmmsearch.domtblout- Raw HMMER domain table outputrun.log- Complete log of all external tool executions
Command Line Options (5)
| Option | Description | Default |
|---|---|---|
--fasta |
Input protein FASTA file | Required |
--pfam-hmm |
Path to Pfam-A.hmm (pressed) | Required |
--pfam-seed |
Path to Pfam-A.seed | Required |
--out |
Output directory | Required |
--id |
Custom run ID (default: FASTA header) | Auto-detected |
--topn |
Number of top domains to analyze | 2 |
--cpu |
Number of CPU cores for HMMER | 4 |
--jsd-top-percent |
Top % of positions called conserved | 10.0 |
--log |
Log file for external tool output | results/<id>/run.log |
--quiet |
Suppress console output | False |
--keep |
Preserve existing output folder | False |
How It Works (6)
- Domain Detection: Uses
hmmsearchagainst Pfam-A.hmm to identify conserved domains - SEED Extraction: Pulls the corresponding Pfam SEED alignment for each hit
- HMM Building: Creates a per-family HMM from the SEED using
hmmbuild - Sequence Alignment: Aligns the query protein to the domain HMM using
hmmalign - Conservation Scoring: Computes JSD and entropy scores for each position
- Visualization: Generates domain maps and alignment panels
Current Status
What works:
- Complete domain detection and alignment pipeline
- Publication-quality visualizations
- Structured data outputs (JSON, TSV)
- Robust HMMER integration with logging
Known limitations:
- Conservation scoring currently uses query-only alignment (JSD/entropy are placeholders)
- Remote CDD mode is not yet implemented
- Conservation thresholds are relative (top X%) rather than absolute
Next steps planned:
- Integrate Pfam SEED sequences for real conservation scoring
- Add absolute conservation thresholds
- Implement remote CDD mode
Example Output (7)
For the included ICAM1 example (P05362), ConSite identifies:
- PF03921 (positions 25-115): Ig-like domain
- PF21146 (positions 219-308): Ig-like domain
The tool produces clean visualizations showing domain boundaries and conserved sites as hollow red circles.
Expected Results
After running the demo, you should see:
- A
results/P05362/folder containing all outputs - Two domain panels:
1_PF03921_panel.pngand2_PF21146_panel.png - A full sequence map:
domain_map.png - Structured data:
hits.jsonandscores.tsv
Troubleshooting (8)
Common Issues
"command not found: hmmsearch"
- Install HMMER using the instructions in Prerequisites above
- Ensure it's in your PATH:
which hmmsearch
"No such file or directory: pfam_db/Pfam-A.hmm"
- Run the helper script:
./scripts/get_pfam.sh - Or manually download Pfam files as shown in Manual Setup
"Permission denied" when running scripts
- Make scripts executable:
chmod +x scripts/*.sh
Large log files
- Use
--quietto suppress verbose output - Check
--logpath is writable
Development (9)
Current Project Structure
ConSite/
├── src/consite/ # Main package source
│ ├── cli.py # Command-line interface
│ ├── hmmer_local.py # HMMER tool wrappers
│ ├── parse_domtbl.py # HMMER output parsing
│ ├── pfam.py # Pfam SEED extraction
│ ├── msa_io.py # Multiple sequence alignment I/O
│ ├── conserve.py # Conservation scoring algorithms
│ ├── viz.py # Visualization functions
│ └── utils.py # Shared utilities
├── examples/ # Example input files
├── scripts/ # Helper automation scripts
├── pfam_db/ # Pfam database files (not included in repo)
└── results/ # Output directory (not included in repo)
Important Notes for Collaborators
- Large files are excluded: The
pfam_db/andresults/directories are in.gitignore - Pfam database: You'll need to download this using the helper scripts
- Virtual environment: The
.venv/folder is also excluded - you'll create your own - Example data: The
examples/P05362.fastafile is included for testing
Dependencies
- biopython ≥ 1.81 - Sequence and alignment handling
- numpy ≥ 2.0 - Numerical computations
- matplotlib ≥ 3.7 - Visualization generation
- pandas ≥ 2.0 - Data manipulation
- scipy ≥ 1.16 - Scientific computing
- Python ≥ 3.10 - Required Python version
Citation (10)
If you use ConSite in your research, please cite:
Joey Wagner, Yang Li. ConSite: conserved-domain alignment and conserved-site visualization from protein FASTA.
License
This project is licensed under the terms specified in the LICENSE file.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file consite-0.1.0a4.tar.gz.
File metadata
- Download URL: consite-0.1.0a4.tar.gz
- Upload date:
- Size: 20.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dddfbdef1c7d8a424644bd6c9ba6d5dd27d7ea04abaf5310547950ce5404b3bf
|
|
| MD5 |
218416ea1ad8d72a51a2fff3dd7e3271
|
|
| BLAKE2b-256 |
f093fe2014293acbd74ea4585cd454ee2b39ce7f2046a4be0558c933616e1761
|
Provenance
The following attestation bundles were made for consite-0.1.0a4.tar.gz:
Publisher:
publish.yml on liyang-lab/ConSite
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
consite-0.1.0a4.tar.gz -
Subject digest:
dddfbdef1c7d8a424644bd6c9ba6d5dd27d7ea04abaf5310547950ce5404b3bf - Sigstore transparency entry: 514860198
- Sigstore integration time:
-
Permalink:
liyang-lab/ConSite@83e3851feeaa257ea77f168dd9b621c9bdf0db44 -
Branch / Tag:
refs/tags/v0.1.0a5 - Owner: https://github.com/liyang-lab
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@83e3851feeaa257ea77f168dd9b621c9bdf0db44 -
Trigger Event:
push
-
Statement type:
File details
Details for the file consite-0.1.0a4-py3-none-any.whl.
File metadata
- Download URL: consite-0.1.0a4-py3-none-any.whl
- Upload date:
- Size: 19.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f4fc32aa6fb2ee7248b6803e43f9f1fdf42005e15e5460f7ad3b7fa948eccc5f
|
|
| MD5 |
4cad02c90d3a0c337a51abd8d265790c
|
|
| BLAKE2b-256 |
a410e199a26a5fafaff0f1cc159b28736e052f52ff94ec6cddd7ff60f54563ab
|
Provenance
The following attestation bundles were made for consite-0.1.0a4-py3-none-any.whl:
Publisher:
publish.yml on liyang-lab/ConSite
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
consite-0.1.0a4-py3-none-any.whl -
Subject digest:
f4fc32aa6fb2ee7248b6803e43f9f1fdf42005e15e5460f7ad3b7fa948eccc5f - Sigstore transparency entry: 514860224
- Sigstore integration time:
-
Permalink:
liyang-lab/ConSite@83e3851feeaa257ea77f168dd9b621c9bdf0db44 -
Branch / Tag:
refs/tags/v0.1.0a5 - Owner: https://github.com/liyang-lab
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@83e3851feeaa257ea77f168dd9b621c9bdf0db44 -
Trigger Event:
push
-
Statement type: