Quantify information content across multiple biological representations derived from genomic sequences
Project description
genome_entropy
Quantify information content across multiple biological representations derived from genomic sequences.
genome_entropy is a complete bioinformatics pipeline that converts DNA sequences → ORFs → proteins → 3Di structural tokens, computing Shannon entropy at each representation level.
Why genome_entropy?
We refer to this framework as genome-entropy to emphasise its unifying focus on quantifying information content across multiple biological representations derived from the same genomic sequence. Rather than restricting analysis to a single abstraction, such as nucleotide composition or predicted coding regions, genome-entropy integrates DNA sequences, open reading frames, translated proteins, and structure-derived encodings (3Di) within a common information-theoretic framework. The name reflects both the biological scope of the approach—operating at the level of whole genomes and metagenomes—and the central analytical principle, entropy, which provides a consistent and comparable measure of complexity, organisation, and constraint across representations. This design allows direct comparison of informational signatures across molecular layers while remaining extensible to additional encodings as methods and data evolve.
Documentation
📚 Read the full documentation on GitHub Pages
📚 Read the full documentation on Read The Docs
The documentation includes:
- Installation guide
- Quick start tutorial
- Complete CLI reference
- Python API documentation
- User guide with detailed explanations
- Developer guide for contributors
Features
- 🧬 ORF Finding: Extract Open Reading Frames from DNA sequences using customizable genetic codes
- 🔄 Translation: Convert ORFs to protein sequences with support for all NCBI genetic code tables
- 🏗️ 3Di Encoding: Predict structural alphabet tokens directly from sequences using ProstT5
- 📊 Entropy Analysis: Calculate Shannon entropy at DNA, ORF, protein, and 3Di levels
- ⚡ GPU Acceleration: Auto-detect and use CUDA, MPS (Apple Silicon), or CPU
- 🚀 Multi-GPU Support: Parallelize 3Di encoding across multiple GPUs for faster processing
- 🔧 Modular CLI: Run complete pipeline or individual steps
- 📝 Comprehensive Logging: Configurable log levels and output to file or STDOUT
Quick Start
Installation
Recommended
Install with pip:
pip install genome-entropy
For developers
# Clone repository
git clone https://github.com/linsalrob/genome_entropy.git
cd genome_entropy
# Create virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install development dependencies (optional)
pip install -e ".[dev]"
Basic Usage
# Run complete pipeline
genome_entropy run --input examples/example_small.fasta --output results.json
# Or run individual steps
genome_entropy orf --input input.fasta --output orfs.json
genome_entropy translate --input orfs.json --output proteins.json
genome_entropy encode3di --input proteins.json --output 3di.json
genome_entropy entropy --input 3di.json --output entropy.json
Multi-GPU Usage
Speed up 3Di encoding by distributing batches across multiple GPUs:
# Auto-discover and use all available GPUs
genome_entropy run --input input.fasta --output results.json --multi-gpu
# Use specific GPUs
genome_entropy run --input input.fasta --output results.json --multi-gpu --gpu-ids 0,1,2
# Works with SLURM job schedulers (GPUs auto-discovered from SLURM_JOB_GPUS)
srun --gres=gpu:4 genome_entropy run --input input.fasta --output results.json --multi-gpu
# Multi-GPU encoding also works for the encode3di command
genome_entropy encode3di --input proteins.json --output 3di.json --multi-gpu
GPU Discovery Priority:
SLURM_JOB_GPUSenvironment variable (SLURM job allocations)SLURM_GPUSenvironment variableCUDA_VISIBLE_DEVICESenvironment variabletorch.cuda.device_count()(all available GPUs)
See examples/multi_gpu_example.py for more usage examples.
Requirements
Python Dependencies
- Python 3.8 or higher
- PyTorch >= 2.0.0 (GPU support optional)
- Transformers >= 4.30.0 (HuggingFace)
- pygenetic-code >= 0.1.0
- typer >= 0.9.0
External Binary: get_orfs
The ORF finder requires the get_orfs binary from https://github.com/linsalrob/get_orfs
Installation:
# Clone and build get_orfs
git clone https://github.com/linsalrob/get_orfs.git /tmp/get_orfs
cd /tmp/get_orfs
mkdir build && cd build
cmake ..
make
cmake --install . --prefix ..
# Add to PATH or set environment variable
export PATH="/tmp/get_orfs/bin:$PATH"
# Or set GET_ORFS_PATH environment variable
export GET_ORFS_PATH=/tmp/get_orfs/bin/get_orfs
CLI Commands
genome_entropy run - Complete Pipeline
Run all steps from DNA to 3Di with entropy calculation:
genome_entropy run \
--input input.fasta \
--output results.json \
--table 11 \
--min-aa 30 \
--model Rostlab/ProstT5_fp16 \
--device auto
Options:
--input, -i: Input FASTA file (required)--output, -o: Output JSON file (required)--table, -t: NCBI genetic code table ID (default: 11)--min-aa: Minimum protein length in amino acids (default: 30)--model, -m: ProstT5 model name (default: Rostlab/ProstT5_fp16)--device, -d: Device for inference (auto/cuda/mps/cpu)--skip-entropy: Skip entropy calculation
genome_entropy orf - Find ORFs
Extract Open Reading Frames from DNA sequences:
genome_entropy orf --input input.fasta --output orfs.json --table 11 --min-nt 90
genome_entropy translate - Translate ORFs
Translate ORFs to protein sequences:
genome_entropy translate --input orfs.json --output proteins.json --table 11
genome_entropy encode3di - Encode to 3Di
Convert proteins to 3Di structural tokens using ProstT5:
genome_entropy encode3di \
--input proteins.json \
--output 3di.json \
--model Rostlab/ProstT5_fp16 \
--device auto \
--batch-size 4
genome_entropy entropy - Calculate Entropy
Compute Shannon entropy at all representation levels:
genome_entropy entropy --input 3di.json --output entropy.json --normalize
genome_entropy download - Pre-download Models
Pre-download ProstT5 models to cache:
genome_entropy download --model Rostlab/ProstT5_fp16
Logging
All genome_entropy commands support comprehensive logging with configurable output and verbosity.
Global Logging Options
Every command accepts these logging options:
genome_entropy [OPTIONS] COMMAND [ARGS]
Global Options:
--log-level, -l TEXT Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL) [default: INFO]
--log-file PATH Path to log file (default: log to STDOUT)
Usage Examples
Default logging (INFO level to STDOUT):
genome_entropy run --input data.fasta --output results.json
Debug logging to see detailed progress:
genome_entropy --log-level DEBUG run --input data.fasta --output results.json
Log to a file:
genome_entropy --log-file pipeline.log run --input data.fasta --output results.json
Debug logging to file:
genome_entropy --log-level DEBUG --log-file debug.log run --input data.fasta --output results.json
Quiet mode (only warnings and errors):
genome_entropy --log-level WARNING run --input data.fasta --output results.json
Log Levels
- DEBUG: Detailed information for diagnosing problems (sequence lengths, batch info, etc.)
- INFO: General informational messages (default - shows major steps and progress)
- WARNING: Warning messages for unusual conditions
- ERROR: Error messages for failures
- CRITICAL: Critical errors that may cause the program to abort
What Gets Logged
The logging system tracks:
- File I/O: Reading/writing FASTA and JSON files with sequence counts
- ORF Finding: Number of ORFs found, binary checks, parsing progress
- Translation: Translation progress, codon handling, error details
- 3Di Encoding: Model loading, batch processing, memory usage, timing estimates
- Entropy Calculation: Entropy values at each representation level
- Pipeline Progress: Step-by-step progress through the complete pipeline
Example log output (INFO level):
2026-01-19 10:30:15 - genome_entropy.io.fasta - INFO - Reading FASTA file: input.fasta
2026-01-19 10:30:15 - genome_entropy.io.fasta - INFO - Successfully read 5 sequence(s) from input.fasta
2026-01-19 10:30:15 - genome_entropy.orf.finder - INFO - Starting ORF finding for 5 sequence(s) (table=11, min_length=90)
2026-01-19 10:30:16 - genome_entropy.orf.finder - INFO - Found 47 ORF(s) in 5 sequence(s)
2026-01-19 10:30:16 - genome_entropy.translate.translator - INFO - Translating 47 ORF(s) with table 11
2026-01-19 10:30:16 - genome_entropy.encode3di.encoder - INFO - Loading ProstT5 model: Rostlab/ProstT5_fp16
2026-01-19 10:30:20 - genome_entropy.encode3di.encoder - INFO - Loaded model Rostlab/ProstT5_fp16 on device cuda
2026-01-19 10:30:20 - genome_entropy.encode3di.encoding - INFO - 3Di encoding batch 1 of 12 batches...
Data Flow
DNA FASTA → ORF Finder → ORFs (nucleotides)
↓
Translator → Proteins (amino acids)
↓
ProstT5 → 3Di tokens (structural alphabet)
↓
Shannon Entropy → Entropy Report
Genetic Code Tables
The pipeline supports all NCBI genetic code tables. Common ones:
- Table 1: Standard genetic code
- Table 11: Bacterial, archaeal, and plant plastid code (default)
- Table 4: Mold, protozoan, and coelenterate mitochondrial code
See full list at: https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi
Output Format
Results are saved as JSON with the following structure:
[
{
"input_id": "seq1",
"input_dna_length": 1000,
"orfs": [...],
"proteins": [...],
"three_dis": [...],
"entropy": {
"dna_entropy_global": 2.5,
"orf_nt_entropy": {"orf1": 1.8},
"protein_aa_entropy": {"orf1": 3.2},
"three_di_entropy": {"orf1": 2.9},
"alphabet_sizes": {"dna": 4, "protein": 20, "three_di": 20}
}
}
]
Development
Running Tests
# Run unit tests
pytest
# Run with coverage
pytest --cov=genome_entropy
# Skip integration tests (default)
pytest -k "not integration"
# Run integration tests (downloads models, slow)
RUN_INTEGRATION=1 pytest -v -m integration
Code Quality
# Format code
black src/ tests/
# Lint
ruff check src/ tests/
# Type check
mypy src/genome_entropy/
Project Structure
genome_entropy/
├── src/genome_entropy/
│ ├── io/ # FASTA and JSON I/O
│ ├── orf/ # ORF finding and types
│ ├── translate/ # Protein translation
│ ├── encode3di/ # 3Di encoding (ProstT5)
│ ├── entropy/ # Shannon entropy calculation
│ ├── pipeline/ # End-to-end orchestration
│ └── cli/ # Command-line interface
├── tests/ # Unit and integration tests
└── examples/ # Example data and scripts
Citation
If you use this software, please cite:
- ProstT5: Heinzinger et al. (2023), "ProstT5: Bilingual Language Model for Protein Sequence and Structure"
- get_orfs: https://github.com/linsalrob/get_orfs
- pygenetic-code: https://github.com/linsalrob/genetic_codes
License
MIT License - see LICENSE file for details.
Author
Rob Edwards (@linsalrob)
Email: raedwards@gmail.com
Contributing
Contributions welcome! Please:
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure all tests pass
- Submit a pull request
Troubleshooting
Common Issues
ModuleNotFoundError: No module named 'genome_entropy'
- Run
pip install -e .from repository root
get_orfs binary not found
- Install get_orfs and add to PATH or set GET_ORFS_PATH environment variable
CUDA out of memory
- Use CPU with
--device cpuor reduce batch size with--batch-size 1
Model download fails
- Check internet connection
- Verify HuggingFace cache permissions (~/.cache/huggingface/)
Integration tests run unexpectedly
- Use
pytest -k "not integration"to skip them
Acknowledgments
- ProstT5 model by Rostlab
- get_orfs by Rob Edwards
- genetic_codes by Rob Edwards
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file genome_entropy-0.1.3.tar.gz.
File metadata
- Download URL: genome_entropy-0.1.3.tar.gz
- Upload date:
- Size: 57.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fed6e769c29a784ffebbd62a2647e74d8e7ccf3335176dbcf9545872595fd3bc
|
|
| MD5 |
881deffbfde45788d101443bebac25d8
|
|
| BLAKE2b-256 |
4747e5134f8eb73497b662fedad924d4fb08f3213144a770a82e85ebf55062ef
|
Provenance
The following attestation bundles were made for genome_entropy-0.1.3.tar.gz:
Publisher:
python-publish.yml on linsalrob/genome_entropy
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
genome_entropy-0.1.3.tar.gz -
Subject digest:
fed6e769c29a784ffebbd62a2647e74d8e7ccf3335176dbcf9545872595fd3bc - Sigstore transparency entry: 843969192
- Sigstore integration time:
-
Permalink:
linsalrob/genome_entropy@e206c2f02bb691a4f503fab2e013f64b8afb6eaf -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/linsalrob
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@e206c2f02bb691a4f503fab2e013f64b8afb6eaf -
Trigger Event:
release
-
Statement type:
File details
Details for the file genome_entropy-0.1.3-py3-none-any.whl.
File metadata
- Download URL: genome_entropy-0.1.3-py3-none-any.whl
- Upload date:
- Size: 52.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5b0006b69327af177f27088d8736d2a118fe381ce784d41ae98c1fa9bf196306
|
|
| MD5 |
9551cfb5476fce14313ee669597e98a5
|
|
| BLAKE2b-256 |
d3e1023bcb1f4fd4cdfe769ad095ef37dab3aabf39c7332bd1a57ea85ec3acfb
|
Provenance
The following attestation bundles were made for genome_entropy-0.1.3-py3-none-any.whl:
Publisher:
python-publish.yml on linsalrob/genome_entropy
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
genome_entropy-0.1.3-py3-none-any.whl -
Subject digest:
5b0006b69327af177f27088d8736d2a118fe381ce784d41ae98c1fa9bf196306 - Sigstore transparency entry: 843969197
- Sigstore integration time:
-
Permalink:
linsalrob/genome_entropy@e206c2f02bb691a4f503fab2e013f64b8afb6eaf -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/linsalrob
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@e206c2f02bb691a4f503fab2e013f64b8afb6eaf -
Trigger Event:
release
-
Statement type: