Skip to main content

Unsupervised Gene Discovery via Evo 2 & SAE Interpretability

Reason this release was yanked:

testing release

Project description

PlatyGeno Icon

Unsupervised Biological Significance Mapping via
Evo 2 & Sparse Autoencoders

PyPI version License

PlatyGeno is a professional Python package for identifying genomic landmarks directly from raw sequence data. By leveraging the Evo 2 foundation model, it identifies biologically significant DNA structures (promoters, coding sequences, precise motifs,...) based purely on AI confidence—without requiring labels, databases, or BLAST.


🔭 Scientific Philosophy: Zero-Reference Significance

Most bioinformatics tools are designed to find "matches" to known lists. PlatyGeno is a Reference-Free Microscope that detects the "Signal" of life itself:

  • Signals over Samples: Traditional tools (like BLAST) find genes by comparing them to a library. PlatyGeno "reads" the DNA grammar and detects significance peaks directly. If a sequence is important, the AI will find it—even if it's never been sequenced before.
  • Significance First: We prioritize Activation Strength (the intensity of the AI's internal response). A high activation score is a "Biological Beacon" that points to a functional region.
  • Optional Novelty Mining: Once significant landmarks are identified, researchers can optionally filter for Rarity to isolate "Genomic Dark Matter" (novel viruses, exotic enzymes, or extremophiles).

⚙️ Installation

PlatyGeno requires a CUDA-enabled GPU (RTX 3090, 4090, A100, or H100).

# 1. Install the core package
pip install platygeno

# 2. Install high-performance GPU kernels (Mandatory for speed)
pip install ninja # for faster installation of flash-attn
pip install flash-attn --no-build-isolation

---

## 🚀 Quick Start for GitHub Clones
If you are cloning the repository for research or development, follow these three steps to run your first discovery:

```bash
# 1. Clone & Enter
git clone https://github.com/khoatran1995/PlatyGeno.git
cd PlatyGeno

# 2. Install in Editable Mode
pip install -e .

# 3. Trigger Discovery (on the validation sample)
platygeno --input data/sample.fastq --limit 5000

🏗️ Simplified Architecture

PlatyGeno layers a "De-coding" layer on top of the Evo 2 foundation model:

  1. Evo 2 (The Brain): A 7B parameter model that understands the grammar of all sequenced DNA on Earth.
  2. Sparse Autoencoders (The Interpreter): 32,768 discrete concept nodes that translate internal AI math into human-interpretable biological signals.
  3. Landmark Scouter: Scans raw FASTQ data to find the precise coordinates where these concept nodes fire with the highest intensity.

📚 Documentation & Reference



🚀 One-Line Discovery Quickstart

Researchers can perform complete biological landmark discovery with just a single function call:

import platygeno

# Complete Discovery: Scan, Pool, Extract, and Annotate
results = platygeno.discover_genes(
    input_path="data/sample.fastq",
    scan_end=5000,          # Scan first 5000 reads
    min_activation=5.0      # Target high-confidence landmarks
)

# View discovered biological features
print(results[['feature_id', 'feature_name', 'activation', 'sequence']])


🚀 Step-by-Step Discovery (Ph.D. Suite)

PlatyGeno is now organized as a unified, Ph.D.-grade discovery workflow:

  1. One-Touch Discovery: python validation/discovery_pipeline.py --input sample.fastq — Performs both significance scanning and automated BLAST validation (via validation/step2_blast.py).
  2. AI-Aware Validation: The engine automatically labels known features (Coding Regions, Alpha Helices) and prioritizes unknown "Dark Matter" for validation.
  3. Recursive OOM Guard: Automatically scales batch sizes to fit your GPU VRAM, ensuring large files don't crash the discovery process.

⚙️ Hardware Optimization

PlatyGeno is optimized for high-performance discovery. To resolve the "12-hour bottleneck" on large datasets, utilize the Batched Inference engine.

Batch Size Guide (--batch-size)

Parallelizing your scan is the fastest way to get results. Match this setting to your GPU VRAM:

Hardware VRAM Recommended Batch Size
A100 / H100 80GB 3264
RTX 3090 / 4090 24GB 816
RTX 3060 / 4070 12GB 12

[!TIP] Out of Memory? If you encounter an OOM error, simply lower the --batch-size.



🚀 One-Line Discovery (Terminal)

If you prefer the command line, you can trigger a full biological scan with one command:

# Scan 5000 reads and generate a landmark report
# Automatically saves to: results/sample_Significance.csv
platygeno --input sample.fastq --limit 5000 --threshold 5.0


🧩 The Scientific Dial: Tuning Significance

PlatyGeno uses the AI's "Excitement" as the primary scientific dial:

1. Signal Strength (min_activation)

  • 3.0 – 5.0: "Significance Scouting." Ideal for mapping the general landscape of a sample.
  • 8.0 – 12.0: "Landmark Identification." Targets high-confidence biological machinery.

2. Novelty Filter (--rarity-only) - Optional

  • Default (Off): Standard mode (Panoramic). Shows all important genes (Known and Unknown).
  • On (--rarity-only): Novelty mode. Automatically subtracts common housekeeping genes to find "Dark Matter."

3. Discovery Breadth (--top-n)

  • Default (-1): Ph.D. Survey Mode. Returns every significant landmark found in the sample (Unlimited).
  • Targeted (10-25): Precision Mode. Focuses only on the strongest outliers.

🧪 Core Methodology

PlatyGeno’s "Golden Configuration" is built on two primary scientific pillars:

1. Mean-Pooling (Global Semantic Averaging)

Instead of scanning token-by-token (which can be noisy), PlatyGeno averages the entire sequence embedding into a single global summary before SAE encoding. This "denoises" the data and allows the model to identify the overall biological identity of the read with high stability.

2. Zero-Gate Discovery (Unrestricted Semantic Census)

Most feature-extraction pipelines use a "Top-K" gate to only record the strongest signals. PlatyGeno's Zero-Gate mode removes this bottleneck. It records every single biological concept that shows activation (up to 64 per read), ensuring that rare regulatory motifs or subtle protein domains are never overshadowed by common genomic grammar.


📈 Validation Stability: The Padding Filter

The "Golden Configuration" (Batched Mean-Pooling) achieves its stable 98-landmark validation by utilizing a natural "Padding Filter." By processing sequences in batches, the engine uses sequence padding to subtly dilute weaker semantic noise. This ensures that only the most powerful, high-confidence biological signals survive the pooling phase, resulting in a high-precision, noise-free discovery report.


🧬 Technical Performance Highlights (v1.0)

  • Mechanism: Mean-Pooling (Iterative sequence averaging).
  • Diversity: Zero-Gate Discovery (Captures ALL active biological signals).
  • Performance: Optimized for the 98-hit Ph.D. validation discovery.

4. Strategic Subtraction (--exclude)

  • Usage: --exclude 212,16509
  • Purpose: "Mutes" features you have already identified as known biology. This forces the engine to look deeper and surface the next layer of genomic candidates.

📂 Unified Directory Structure

PlatyGeno automatically manages your experiment audit trails:

  • results/: Sample-aware Significance and Validation CSVs.
  • data/: Your raw FASTQ/FASTA input files.

🧪 Validation Dataset: Gut Metagenome (IBD-MDB)

PlatyGeno includes a high-density clinical validation set for testing novelty discovery in complex human samples:

  • Origin: Chronic Inflammatory Bowel Disease (IBD) Metagenomic Database.
  • Role: Validating the engine's ability to identify autonomous biological landmarks in high-complexity clinical metagenomes.
  • Local Data: Validation reads are provided in the data/sample.fastq file for Ph.D. reproducibility.

🧪 Use Case: Hunting for the Unknown

While PlatyGeno identifies all important genes, it is uniquely tuned for Genomic Dark Matter:

  • Reference-Free: Identify significance in exotic metagenomes where no reference genomes exist.
  • Structural Discovery: Feed AI-flagged sequences directly into AlphaFold to discover never-before-seen 3D protein folds.

📚 API Reference

platygeno.discover_genes()

Parameter Type Default Description
input_path str Req Path to sequence file.
min_activation float 5.0 Minimum signal strength.
rel_freq_max float 1.0 Rarity cap (1.0 = All significance).
scan_end int None Last read index (None for end of file).
top_n int -1 Max features to return (-1 for ALL, Default).

📜 Primary References

1. PlatyGeno (This Package):

@software{PlatyGeno2026,
  author = {Khoa Tu Tran},
  title = {PlatyGeno: Unsupervised Significance Mapping via Evo 2},
  url = {https://github.com/khoatran1995/PlatyGeno},
  year = {2026}
}

2. Evo 2 Model: Arc Institute. (2026). Genome modeling and design across all domains of life with Evo 2. Nature.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

platygeno-1.0.0.tar.gz (23.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

platygeno-1.0.0-py3-none-any.whl (21.2 kB view details)

Uploaded Python 3

File details

Details for the file platygeno-1.0.0.tar.gz.

File metadata

  • Download URL: platygeno-1.0.0.tar.gz
  • Upload date:
  • Size: 23.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for platygeno-1.0.0.tar.gz
Algorithm Hash digest
SHA256 6a44109d416395abcd33372746e609341d2d195c0a06b96c70815b03bdd4bdbd
MD5 826a8df9535822e74e12a9c75749a15f
BLAKE2b-256 0dda910bca8840559c80d46c8c6ab41794632049f1379a1547559ae4337222a1

See more details on using hashes here.

File details

Details for the file platygeno-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: platygeno-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 21.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for platygeno-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4c472d523ac729f37d0abb73e21141c9009002080f7ae5c4a1fef321c705973d
MD5 1450dda9cbf8b37ca8d5d825cec9e2e1
BLAKE2b-256 9ed95e9df816c42c712b964dedbc9f55730291a21400de70c81906de15e18837

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page