Skip to main content

Unsupervised Gene Discovery via Evo 2 & SAE Interpretability

Project description

PlatyGeno Icon

Unsupervised Biological Significance Mapping via
Evo 2 & Sparse Autoencoders

PyPI version License DOI

PlatyGeno identifies genomic landmarks directly from raw sequence data. By leveraging the Evo 2 foundation model, it isolates biologically significant DNA structures (promoters, coding sequences, precise motifs) based purely on AI confidence—without requiring labels, databases, or BLAST.

🧪 Technical Validation (IBD-MDB)

PlatyGeno v1.0.4 has been validated using the clinical IBD Metagenomic Database dataset. For complete statistical data and methodology, see the PlatyGeno Technical Audit and the PlatyGeno Technical Supplemental.

Summary of Results:

  • Novel Genomic Landmarks: Identified Feature 7393, a 101bp sequence with no prior database matches and a high-confidence structural model (AlphaFold2 best prediction; pLDDT ≈ 80).

    Feature 7393 Structure
    Representative best structural prediction (AlphaFold2) of the novel Feature 7393 discovered autonomously by PlatyGeno.

  • Statistical Correlation: Verified a Pearson correlation of r = 0.84 (p < 10-50) between sequence length and match significance.

  • Resolution Gain: Consensus assembly provided a 1038 increase in E-value confidence over isolated 60bp fragments.

  • Taxonomic Profile: 72% of high-activation discoveries successfully cross-validated with target gut microbiota.


🏗️ Technical Foundation

PlatyGeno operates as a Reference-Free Microscope, detecting the "Signal" of life directly from genomic grammar.

🔭 The Discovery Core

  • AI-Native Interpretation: We use a Sparse Autoencoder (SAE) to translate the complex DNA "grammar" understood by Evo 2 into 32,768 human-interpretable biological concepts (e.g., promoters, viral motifs).
  • Peak Pinpointing (Layer 26): The engine intercepts signals at Layer 26 to identify the exact coordinate where a biological feature fires with the highest intensity.
  • Dual-Mode Discovery: Preserves both narrow Precision Snippets (separatedly high-interest DNA clips) and Consensus Assemblies (overlapping sequences from multiple reads of the same feature pieced together ).

[!IMPORTANT] Performance Highlight: While both modes are preserved in discovery, validation benchmarks confirm that Consensus Assembly yields statistically superior significance (E-values) and cleaner taxonomic resolution.

👉 For a full hierarchical deep-dive into the methodology and validation trail, see Technical Architecture.


⚙️ Setup & Installation

⚙️ Installation & Quick Start

PlatyGeno requires a CUDA-enabled GPU (RTX 3090, 4090, A100, or H100).

# 1. Install the core package
pip install platygeno

# 2. Install high-performance GPU kernels (Mandatory for speed)
pip install ninja # for faster installation of flash-attn
pip install flash-attn --no-build-isolation

# 3. Verify & Run Discovery (on the validation sample)
platygeno --input data/sample.fastq --limit 5000 --threshold 5.0

🚀 Quick Start for GitHub Clones

# 1. Clone & Enter
git clone https://github.com/khoatran1995/PlatyGeno.git
cd PlatyGeno

# 2. Install High-Performance Kernels & editable package
pip install flash-attn --no-build-isolation
pip install -e .

# 3. Trigger Discovery
platygeno --input data/sample.fastq --limit 5000

📚 Documentation

API Reference: Details on Evo 2 integrations and technical Python parameters.


🚀 Usage & API Reference

🚀 Advanced Python Discovery

Researchers can integrate the engine into custom discovery pipelines:

import platygeno

# Advanced Discovery: Tuning parameters for clinical audits
results = platygeno.discover_genes(
    input_path="data/sample.fastq",
    scan_end=5000,
    min_activation=8.0,      # High-confidence threshold
    batch_size=32            # GPU-optimized batching
)

# View discovered biological features
print(results[['feature_id', 'feature_name', 'activation', 'sequence']])

platygeno.discover_genes() Reference

Parameter Type Default Description
input_path str Req Path to sequence file.
min_activation float 5.0 Minimum signal strength.
rel_freq_max float 1.0 Rarity cap (1.0 = All significance).
scan_end int None Last read index (None for end of file).
top_n int -1 Max features to return (-1 for ALL).

⚡ Performance

Mode Engine Implementation Runtime (20k Reads) Discovery Speed
v1.0.4 Batched Mean-Pooling ~4.8 Minutes 🚀 100% (High Speed)

📜 References

⚠️ Technical Limitations

  • Pre-training Bias: Sensitivity depends on the Evo 2 pre-training corpus.
  • SAE Bottleneck: Discrete compression may miss extremely subtle biological nuances.
  • Validation Requirement: High significance is a "Beacon," not final functional proof.

📜 References

@software{PlatyGeno2026,
  author = {Khoa Tu Tran},
  title = {PlatyGeno: Unsupervised Significance Mapping via Evo 2},
  url = {https://github.com/khoatran1995/PlatyGeno},
  doi = {10.5281/zenodo.19581708},
  year = {2026}
}

Thanks to Together AI (Evo 2) and Goodfire AI (SAE interpretability) and the IBD Metagenomic Database (IBD-MDB). Please cite the relevant references when using PlatyGeno.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

platygeno-1.0.4.tar.gz (48.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

platygeno-1.0.4-py3-none-any.whl (45.2 kB view details)

Uploaded Python 3

File details

Details for the file platygeno-1.0.4.tar.gz.

File metadata

  • Download URL: platygeno-1.0.4.tar.gz
  • Upload date:
  • Size: 48.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for platygeno-1.0.4.tar.gz
Algorithm Hash digest
SHA256 60c87f4072fef365f70853c134ea99d8146a14fbaca10b98c2c037f07e29f529
MD5 d59b298f2a3474667cc67dd8fb3a678b
BLAKE2b-256 3f90f80e37f2c5128d0c3edbdd771865fad8fbeabec4a5172756ab4a42549d53

See more details on using hashes here.

File details

Details for the file platygeno-1.0.4-py3-none-any.whl.

File metadata

  • Download URL: platygeno-1.0.4-py3-none-any.whl
  • Upload date:
  • Size: 45.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for platygeno-1.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 ab9d7c1552165b5ba113fd39c651f20b16cefa43102095d21dc9da171cb37546
MD5 f8fcc7201738a90a43e601d76f74bcf5
BLAKE2b-256 457d675447a5617f02fb2c221c01dcdde1eb106850c56682c287ca03dc0354aa

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page