Fast gene prediction for prokaryotic genomes (Rust reimplementation of Prodigal)
Project description
Rustygal
Prokaryotic Dynamic Programming Genefinding Algorithm
A high-performance Rust reimplementation of Prodigal, the widely-used prokaryotic gene prediction tool.
Overview
Rustygal is a fast, memory-safe reimplementation of Prodigal v2.6.3, designed for identifying protein-coding genes in bacterial and archaeal genomes. It maintains 100% compatibility with the original C version while offering significantly improved performance.
Key Features
- 🚀 96% faster than the original C implementation (3.3s vs 6.4s on E. coli K-12)
- 🎯 100% accurate - identical output to Prodigal v2.6.3 (all 4,319 genes match)
- 🔒 Memory-safe - leverages Rust's ownership system to prevent segfaults and memory leaks
- ⚡ Optimized - Advanced optimizations for improved performance
- 🧬 Complete - implements full Prodigal algorithm with all features
- 📦 Easy to build - standard Rust cargo workflow
Performance
Benchmark on E. coli K-12 MG1655 (4.6 Mbp, 4,319 genes)
| Implementation | Time | Speedup | Accuracy |
|---|---|---|---|
| C Prodigal v2.6.3 | 6.4s | - | 100% |
| Rustygal v0.1.0 | 3.3s | 1.96× faster ⚡ | 100% ✓ |
Phase 1 Optimizations
Rustygal achieves its performance through three key optimizations:
- 3-bit nucleotide encoding with XOR-based complement (eliminates rseq array)
- Pre-computed translation tables (2,176-byte lookup tables for O(1) translation)
- Specialized connection scoring functions (eliminates redundant branch checks)
See OPTIMIZATION_RESULTS.md for detailed analysis.
Installation
From Source
# Clone the repository
git clone https://github.com/yourusername/rustygal.git
cd rustygal
# Build release version
cargo build --release
# Binary will be in target/release/prodigal
Requirements
- Rust 1.70 or later
- Cargo (included with Rust)
Usage
Command-Line Interface
Rustygal supports all Prodigal command-line options and file formats.
Basic usage
# Single genome mode (with training)
./target/release/prodigal -i genome.fna -o genes.gff
# Use existing training file
./target/release/prodigal -i genome.fna -t training.trn -o genes.gff
# Metagenomic mode
./target/release/prodigal -i metagenome.fna -p meta -o genes.gff
# Write protein translations
./target/release/prodigal -i genome.fna -a proteins.faa -o genes.gff
# Specify output format (gff, gbk, sco)
./target/release/prodigal -i genome.fna -f gff -o genes.gff
Command-line options
Usage: prodigal [-a trans_file] [-c] [-d nuc_file] [-f output_type]
[-g tr_table] [-h] [-i input_file] [-m] [-n] [-o output_file]
[-p mode] [-q] [-s start_file] [-t training_file] [-v]
-a: Write protein translations to the selected file.
-c: Closed ends. Do not allow genes to run off edges.
-d: Write nucleotide sequences of genes to the selected file.
-f: Select output format (gbk, gff, or sco). Default is gbk.
-g: Specify a translation table to use (default 11).
-h: Print help menu and exit.
-i: Specify FASTA/Genbank input file (default reads from stdin).
-m: Treat runs of N as masked sequence; don't build genes across them.
-n: Bypass Shine-Dalgarno trainer and force a full motif scan.
-o: Specify output file (default writes to stdout).
-p: Select procedure (single or meta). Default is single.
-q: Run quietly (suppress normal stderr output).
-s: Write all potential genes (with scores) to the selected file.
-t: Write a training file (if none exists); otherwise, read and use
the specified training file.
-v: Print version number and exit.
Examples
# Train on a genome and save the training file
./target/release/prodigal -i ecoli.fna -t ecoli.trn -o genes.gff
# Analyze multiple contigs using the same training
./target/release/prodigal -i contigs.fna -t ecoli.trn -o genes.gff
# Metagenomic analysis
./target/release/prodigal -i metagenome.fna -p meta -o genes.gff
# Get protein and nucleotide sequences
./target/release/prodigal -i genome.fna -a proteins.faa -d genes.fna -o genes.gff
Compatibility
Rustygal produces identical output to Prodigal v2.6.3 for:
- Gene predictions (start/stop coordinates)
- Scores and confidence values
- RBS motif detection
- GC content calculations
- Training file format
Validation:
- ✅ E. coli K-12 MG1655: All 4,319 genes match exactly
- ✅ 100% accuracy verified programmatically
- ✅ All 59 unit tests pass
Algorithm Details
Rustygal implements the complete Prodigal algorithm:
-
Training phase - analyzes genome composition to learn organism-specific features:
- GC content and codon usage
- Shine-Dalgarno motif patterns
- Start codon preferences (ATG, GTG, TTG)
- Translation table selection
-
Gene finding phase - uses dynamic programming to identify genes:
- Builds a directed acyclic graph of potential genes
- Scores genes based on coding potential and regulatory signals
- Finds the optimal gene set via Viterbi-like algorithm
-
Metagenomic mode - uses pre-trained parameters for mixed communities
For technical details, see the original Prodigal paper:
Hyatt D, Chen GL, Locascio PF, Land ML, Larimer FW, Hauser LJ. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010;11:119.
Differences from C Prodigal
Improvements
- 96% faster - optimized implementation with specialized functions
- Memory safety - no buffer overflows, use-after-free, or null pointer dereferences
- Better error handling - descriptive error messages instead of segfaults
- Modern tooling - integrated with Rust ecosystem (cargo, docs.rs)
- Parallel processing - multi-threaded for multi-FASTA files (via rayon)
Compatibility notes
- Training files are binary compatible with C Prodigal
- Output files are identical (verified via systematic comparison)
- Command-line interface is 100% compatible
- Can be used as a drop-in replacement
Optimization Details
Rustygal includes three Phase 1 optimizations:
1. Sequence Processing (3-bit encoding)
Original C: 9 function calls + 6 bitmap accesses per codon Rustygal: O(1) lookup with 3-bit encoding (A=000, G=001, C=010, T=011)
- XOR-based complement:
nucleotide ^ 0b011 - Pre-computed stop/start codon lookup tables
- Eliminates rseq array allocation
2. Translation Tables
Original C: 64-way if-else branching Rustygal: 2,176-byte pre-computed lookup tables
- 34 genetic codes × 64 codons
- Index formula:
(x0 << 4) + (x1 << 2) + x2 - O(1) amino acid lookup
3. Connection Scoring
Original C: Generic function with repeated strand/type checks Rustygal: 4 specialized functions
score_connection_forward_start()score_connection_forward_stop()score_connection_backward_start()score_connection_backward_stop()- Eliminates redundant checks in inner loop
Result: 38% faster than unoptimized Rust, 96% faster than C Prodigal
Building from Source
# Clone and build
git clone https://github.com/yourusername/rustygal.git
cd rustygal
cargo build --release
# Run tests
cargo test
# Run on test genome
./target/release/prodigal -i ../test/MG1655.fna -o /tmp/test.gff -q
Testing
# Run all unit tests (59 tests)
cargo test
# Run with output
cargo test -- --nocapture
# Test specific module
cargo test sequence::tests
License
Rustygal is licensed under the GNU General Public License v3.0 or later, the same license as the original Prodigal.
This ensures that improvements to the algorithm remain open source and available to the scientific community.
See LICENSE for the full license text.
Authors
- Sunju Kim - Rust reimplementation
- Doug Hyatt - Original Prodigal C implementation
Acknowledgments
- Original Prodigal by Doug Hyatt, Oak Ridge National Laboratory
- University of Tennessee / UT-Battelle
- Rust community for excellent tooling and libraries
Citation
If you use Rustygal in your research, please cite both:
Original Prodigal:
Hyatt D, Chen GL, Locascio PF, Land ML, Larimer FW, Hauser LJ.
Prodigal: prokaryotic gene recognition and translation initiation site identification.
BMC Bioinformatics. 2010 Mar 8;11:119. doi: 10.1186/1471-2105-11-119.
Rustygal:
Kim S. Rustygal: A high-performance Rust reimplementation of Prodigal.
Version 0.1.0. 2026. https://github.com/yourusername/rustygal
Links
- Original Prodigal: https://github.com/hyattpd/Prodigal
- Issues: https://github.com/yourusername/rustygal/issues
Version History
v0.1.0 (2026-04-16)
Performance:
- ⚡ 96% faster than C Prodigal (3.3s vs 6.4s on E. coli)
- Complete reimplementation of Prodigal v2.6.3
- 100% output compatibility verified
Phase 1 Optimizations:
- 3-bit nucleotide encoding with XOR complement
- Pre-computed translation tables (34 tables × 64 codons)
- Specialized connection scoring functions (4 functions)
- SIMD experiments attempted and removed (18-20% slowdown)
Validation:
- ✓ 100% accuracy: All 4,319 E. coli genes match C Prodigal
- ✓ All 59 unit tests pass
- ✓ Systematic correctness verification
Features:
- Memory-safe Rust implementation
- Parallel processing support (rayon)
- All core Prodigal features implemented
- Binary compatible training files
Documentation:
- OPTIMIZATION_RESULTS.md: Detailed analysis and benchmarks
- PYRODIGAL_OPTIMIZATIONS.md: Deep dive into optimization techniques
- Comprehensive README with examples
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rustygal-0.1.2.tar.gz.
File metadata
- Download URL: rustygal-0.1.2.tar.gz
- Upload date:
- Size: 763.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7cbe949446e14aa5adf6edb865b1f567f37b79c7a30e36ccfaf4688a03790984
|
|
| MD5 |
891a84dba6ad1ec0078bc01a6ae283fa
|
|
| BLAKE2b-256 |
c711220c206667d83c2f83abb72b8ec4ee6705131d31029ad5593ba9ea928ba8
|
File details
Details for the file rustygal-0.1.2-cp312-cp312-macosx_11_0_arm64.whl.
File metadata
- Download URL: rustygal-0.1.2-cp312-cp312-macosx_11_0_arm64.whl
- Upload date:
- Size: 321.3 kB
- Tags: CPython 3.12, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0243e304192a0a684e99487c55294ade90e542e9daf33d09f58b2e579febc172
|
|
| MD5 |
71fc5f2383c1a9df5bb49f4d29f23005
|
|
| BLAKE2b-256 |
4413dcf7ae8a8bc9117f1448b208a004733732c5b46c94ebc28d563e30c5e7c6
|