Unsupervised Gene Discovery via Evo 2 & SAE Interpretability
Project description
Unsupervised Biological Significance Mapping via |
PlatyGeno identifies genomic landmarks directly from raw sequence data. By leveraging the Evo 2 foundation model, it isolates biologically significant DNA structures (promoters, coding sequences, precise motifs) based purely on AI confidence—without requiring labels, databases, or BLAST.
🧪 Technical Validation (IBD-MDB)
PlatyGeno v1.0.4 has been validated using the clinical IBD Metagenomic Database dataset. For complete statistical data and methodology, see the PlatyGeno Technical Audit and the PlatyGeno Technical Supplemental.
Summary of Results:
-
Novel Genomic Landmarks: Identified Feature 7393, a 101bp sequence with no prior database matches and a high-confidence structural model (AlphaFold2 best prediction; pLDDT ≈ 80).
Representative best structural prediction (AlphaFold2) of the novel Feature 7393 discovered autonomously by PlatyGeno. -
Statistical Correlation: Verified a Pearson correlation of r = 0.84 (p < 10-50) between sequence length and match significance.
-
Resolution Gain: Consensus assembly provided a 1038 increase in E-value confidence over isolated 60bp fragments.
-
Taxonomic Profile: 72% of high-activation discoveries successfully cross-validated with target gut microbiota.
🏗️ Technical Foundation
PlatyGeno operates as a Reference-Free Microscope, detecting the "Signal" of life directly from genomic grammar.
🔭 The Discovery Core
- AI-Native Interpretation: We use a Sparse Autoencoder (SAE) to translate the complex DNA "grammar" understood by Evo 2 into 32,768 human-interpretable biological concepts (e.g., promoters, viral motifs).
- Peak Pinpointing (Layer 26): The engine intercepts signals at Layer 26 to identify the exact coordinate where a biological feature fires with the highest intensity.
- Dual-Mode Discovery: Preserves both narrow Precision Snippets (separatedly high-interest DNA clips) and Consensus Assemblies (overlapping sequences from multiple reads of the same feature pieced together ).
[!IMPORTANT] Performance Highlight: While both modes are preserved in discovery, validation benchmarks confirm that Consensus Assembly yields statistically superior significance (E-values) and cleaner taxonomic resolution.
👉 For a full hierarchical deep-dive into the methodology and validation trail, see Technical Architecture.
⚙️ Setup & Installation
⚙️ Installation & Quick Start
PlatyGeno requires a CUDA-enabled GPU (RTX 3090, 4090, A100, or H100).
# 1. Install the core package
pip install platygeno
# 2. Install high-performance GPU kernels (Mandatory for speed)
pip install ninja # for faster installation of flash-attn
pip install flash-attn --no-build-isolation
# 3. Verify & Run Discovery (on the validation sample)
platygeno --input data/sample.fastq --limit 5000 --threshold 5.0
🚀 Quick Start for GitHub Clones
# 1. Clone & Enter
git clone https://github.com/khoatran1995/PlatyGeno.git
cd PlatyGeno
# 2. Install High-Performance Kernels & editable package
pip install flash-attn --no-build-isolation
pip install -e .
# 3. Trigger Discovery
platygeno --input data/sample.fastq --limit 5000
📚 Documentation
API Reference: Details on Evo 2 integrations and technical Python parameters.
🚀 Usage & API Reference
🚀 Advanced Python Discovery
Researchers can integrate the engine into custom discovery pipelines:
import platygeno
# Advanced Discovery: Tuning parameters for clinical audits
results = platygeno.discover_genes(
input_path="data/sample.fastq",
scan_end=5000,
min_activation=8.0, # High-confidence threshold
batch_size=32 # GPU-optimized batching
)
# View discovered biological features
print(results[['feature_id', 'feature_name', 'activation', 'sequence']])
platygeno.discover_genes() Reference
| Parameter | Type | Default | Description |
|---|---|---|---|
input_path |
str |
Req | Path to sequence file. |
min_activation |
float |
5.0 |
Minimum signal strength. |
rel_freq_max |
float |
1.0 |
Rarity cap (1.0 = All significance). |
scan_end |
int |
None |
Last read index (None for end of file). |
top_n |
int |
-1 |
Max features to return (-1 for ALL). |
⚡ Performance
| Mode | Engine Implementation | Runtime (20k Reads) | Discovery Speed |
|---|---|---|---|
| v1.0.4 | Batched Mean-Pooling | ~4.8 Minutes | 🚀 100% (High Speed) |
📜 References
⚠️ Technical Limitations
- Pre-training Bias: Sensitivity depends on the Evo 2 pre-training corpus.
- SAE Bottleneck: Discrete compression may miss extremely subtle biological nuances.
- Validation Requirement: High significance is a "Beacon," not final functional proof.
📜 References
@software{PlatyGeno2026,
author = {Khoa Tu Tran},
title = {PlatyGeno: Unsupervised Significance Mapping via Evo 2},
url = {https://github.com/khoatran1995/PlatyGeno},
doi = {10.5281/zenodo.19581708},
year = {2026}
}
Thanks to Together AI (Evo 2) and Goodfire AI (SAE interpretability) and the IBD Metagenomic Database (IBD-MDB). Please cite the relevant references when using PlatyGeno.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file platygeno-1.0.4.tar.gz.
File metadata
- Download URL: platygeno-1.0.4.tar.gz
- Upload date:
- Size: 48.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
60c87f4072fef365f70853c134ea99d8146a14fbaca10b98c2c037f07e29f529
|
|
| MD5 |
d59b298f2a3474667cc67dd8fb3a678b
|
|
| BLAKE2b-256 |
3f90f80e37f2c5128d0c3edbdd771865fad8fbeabec4a5172756ab4a42549d53
|
File details
Details for the file platygeno-1.0.4-py3-none-any.whl.
File metadata
- Download URL: platygeno-1.0.4-py3-none-any.whl
- Upload date:
- Size: 45.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ab9d7c1552165b5ba113fd39c651f20b16cefa43102095d21dc9da171cb37546
|
|
| MD5 |
f8fcc7201738a90a43e601d76f74bcf5
|
|
| BLAKE2b-256 |
457d675447a5617f02fb2c221c01dcdde1eb106850c56682c287ca03dc0354aa
|