Unsupervised Gene Discovery via Evo 2 & SAE Interpretability

These details have not been verified by PyPI

Project links

Project description

Unsupervised Biological Significance Mapping via
Evo 2 & Sparse Autoencoders

PlatyGeno is a professional Python package for identifying genomic landmarks directly from raw sequence data. By leveraging the Evo 2 foundation model, it identifies biologically significant DNA structures (promoters, coding sequences, precise motifs,...) based purely on AI confidence—without requiring labels, databases, or BLAST.

🔭 Scientific Philosophy: Zero-Reference Significance

Most bioinformatics tools are designed to find "matches" to known lists. PlatyGeno is a Reference-Free Microscope that detects the "Signal" of life itself:

Signals over Samples: Traditional tools (like BLAST) find genes by comparing them to a library. PlatyGeno "reads" the DNA grammar and detects significance peaks directly. If a sequence is important, the AI will find it—even if it's never been sequenced before.
Significance First: We prioritize Activation Strength (the intensity of the AI's internal response). A high activation score is a "Biological Beacon" that points to a functional region.
Optional Novelty Mining: Once significant landmarks are identified, researchers can optionally filter for Rarity to isolate "Genomic Dark Matter" (novel viruses, exotic enzymes, or extremophiles).

⚙️ Installation

PlatyGeno requires a CUDA-enabled GPU (RTX 3090, 4090, A100, or H100).

# 1. Install the core package
pip install platygeno

# 2. Install high-performance GPU kernels (Mandatory for speed)
pip install ninja # for faster installation of flash-attn
pip install flash-attn --no-build-isolation

---

## 🚀 Quick Start for GitHub Clones
If you are cloning the repository for research or development, follow these three steps to run your first discovery:

```bash
# 1. Clone & Enter
git clone https://github.com/khoatran1995/PlatyGeno.git
cd PlatyGeno

# 2. Install in Editable Mode
pip install -e .

# 3. Trigger Discovery (on the validation sample)
platygeno --input data/sample.fastq --limit 5000

🏗️ Simplified Architecture

PlatyGeno layers a "De-coding" layer on top of the Evo 2 foundation model:

Evo 2 (The Brain): A 7B parameter model that understands the grammar of all sequenced DNA on Earth.
Sparse Autoencoders (The Interpreter): 32,768 discrete concept nodes that translate internal AI math into human-interpretable biological signals.
Landmark Scouter: Scans raw FASTQ data to find the precise coordinates where these concept nodes fire with the highest intensity.

📚 Documentation & Reference

Technical API Reference: Detailed documentation for every function in the platygeno core.
Architecture Guide: Deep dive into Evo 2, Sparse Autoencoders, and Max-Pooling theory.
Validation Methodology: Detailed audit trail for clinical gene discovery.

🚀 One-Line Discovery Quickstart

Researchers can perform complete biological landmark discovery with just a single function call:

import platygeno

# Complete Discovery: Scan, Pool, Extract, and Annotate
results = platygeno.discover_genes(
    input_path="data/sample.fastq",
    scan_end=5000,          # Scan first 5000 reads
    min_activation=5.0      # Target high-confidence landmarks
)

# View discovered biological features
print(results[['feature_id', 'feature_name', 'activation', 'sequence']])

🚀 Step-by-Step Discovery (Ph.D. Suite)

PlatyGeno is now organized as a unified, Ph.D.-grade discovery workflow:

One-Touch Discovery: python validation/discovery_pipeline.py --input sample.fastq — Performs both significance scanning and automated BLAST validation (via validation/step2_blast.py).
AI-Aware Validation: The engine automatically labels known features (Coding Regions, Alpha Helices) and prioritizes unknown "Dark Matter" for validation.
Recursive OOM Guard: Automatically scales batch sizes to fit your GPU VRAM, ensuring large files don't crash the discovery process.

⚙️ Hardware Optimization

PlatyGeno is optimized for high-performance discovery. To resolve the "12-hour bottleneck" on large datasets, utilize the Batched Inference engine.

Batch Size Guide (`--batch-size`)

Parallelizing your scan is the fastest way to get results. Match this setting to your GPU VRAM:

Hardware	VRAM	Recommended Batch Size
A100 / H100	80GB	`32` – `64`
RTX 3090 / 4090	24GB	`8` – `16`
RTX 3060 / 4070	12GB	`1` – `2`

[!TIP] Out of Memory? If you encounter an OOM error, simply lower the --batch-size.

🚀 One-Line Discovery (Terminal)

If you prefer the command line, you can trigger a full biological scan with one command:

# Scan 5000 reads and generate a landmark report
# Automatically saves to: results/sample_Significance.csv
platygeno --input sample.fastq --limit 5000 --threshold 5.0

🧩 The Scientific Dial: Tuning Significance

PlatyGeno uses the AI's "Excitement" as the primary scientific dial:

1. Signal Strength (`min_activation`)

3.0 – 5.0: "Significance Scouting." Ideal for mapping the general landscape of a sample.
8.0 – 12.0: "Landmark Identification." Targets high-confidence biological machinery.

2. Novelty Filter (`--rarity-only`) - Optional

Default (Off): Standard mode (Panoramic). Shows all important genes (Known and Unknown).
On (--rarity-only): Novelty mode. Automatically subtracts common housekeeping genes to find "Dark Matter."

3. Discovery Breadth (`--top-n`)

Default (-1): Ph.D. Survey Mode. Returns every significant landmark found in the sample (Unlimited).
Targeted (10-25): Precision Mode. Focuses only on the strongest outliers.

🧪 Core Methodology

PlatyGeno’s "Golden Configuration" is built on two primary scientific pillars:

1. Mean-Pooling (Global Semantic Averaging)

Instead of scanning token-by-token (which can be noisy), PlatyGeno averages the entire sequence embedding into a single global summary before SAE encoding. This "denoises" the data and allows the model to identify the overall biological identity of the read with high stability.

2. Zero-Gate Discovery (Unrestricted Semantic Census)

Most feature-extraction pipelines use a "Top-K" gate to only record the strongest signals. PlatyGeno's Zero-Gate mode removes this bottleneck. It records every single biological concept that shows activation (up to 64 per read), ensuring that rare regulatory motifs or subtle protein domains are never overshadowed by common genomic grammar.

📈 Validation Stability: The Padding Filter

The "Golden Configuration" (Batched Mean-Pooling) achieves its stable 98-landmark validation by utilizing a natural "Padding Filter." By processing sequences in batches, the engine uses sequence padding to subtly dilute weaker semantic noise. This ensures that only the most powerful, high-confidence biological signals survive the pooling phase, resulting in a high-precision, noise-free discovery report.

🧬 Technical Performance Highlights (v1.0)

Mechanism: Mean-Pooling (Iterative sequence averaging).
Diversity: Zero-Gate Discovery (Captures ALL active biological signals).
Performance: Optimized for the 98-hit Ph.D. validation discovery.

4. Strategic Subtraction (`--exclude`)

Usage: --exclude 212,16509
Purpose: "Mutes" features you have already identified as known biology. This forces the engine to look deeper and surface the next layer of genomic candidates.

📂 Unified Directory Structure

PlatyGeno automatically manages your experiment audit trails:

results/: Sample-aware Significance and Validation CSVs.
data/: Your raw FASTQ/FASTA input files.

🧪 Validation Dataset: Gut Metagenome (IBD-MDB)

PlatyGeno includes a high-density clinical validation set for testing novelty discovery in complex human samples:

Origin: Chronic Inflammatory Bowel Disease (IBD) Metagenomic Database.
Role: Validating the engine's ability to identify autonomous biological landmarks in high-complexity clinical metagenomes.
Local Data: Validation reads are provided in the data/sample.fastq file for Ph.D. reproducibility.

🧪 Use Case: Hunting for the Unknown

While PlatyGeno identifies all important genes, it is uniquely tuned for Genomic Dark Matter:

Reference-Free: Identify significance in exotic metagenomes where no reference genomes exist.
Structural Discovery: Feed AI-flagged sequences directly into AlphaFold to discover never-before-seen 3D protein folds.

📚 API Reference

`platygeno.discover_genes()`

Parameter	Type	Default	Description
`input_path`	`str`	Req	Path to sequence file.
`min_activation`	`float`	`5.0`	Minimum signal strength.
`rel_freq_max`	`float`	`1.0`	Rarity cap (1.0 = All significance).
`scan_end`	`int`	`None`	Last read index (None for end of file).
`top_n`	`int`	`-1`	Max features to return (-1 for ALL, Default).

📜 Primary References

1. PlatyGeno (This Package):

@software{PlatyGeno2026,
  author = {Khoa Tu Tran},
  title = {PlatyGeno: Unsupervised Significance Mapping via Evo 2},
  url = {https://github.com/khoatran1995/PlatyGeno},
  year = {2026}
}

2. Evo 2 Model: Arc Institute. (2026). Genome modeling and design across all domains of life with Evo 2. Nature.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.4

Apr 14, 2026

1.0.2 yanked

Apr 14, 2026

Reason this release was yanked:

testing release

1.0.1 yanked

Apr 14, 2026

Reason this release was yanked:

testing release

This version

1.0.0 yanked

Apr 14, 2026

Reason this release was yanked:

testing release

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

platygeno-1.0.0.tar.gz (23.8 kB view details)

Uploaded Apr 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

platygeno-1.0.0-py3-none-any.whl (21.2 kB view details)

Uploaded Apr 14, 2026 Python 3

File details

Details for the file platygeno-1.0.0.tar.gz.

File metadata

Download URL: platygeno-1.0.0.tar.gz
Upload date: Apr 14, 2026
Size: 23.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for platygeno-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`6a44109d416395abcd33372746e609341d2d195c0a06b96c70815b03bdd4bdbd`
MD5	`826a8df9535822e74e12a9c75749a15f`
BLAKE2b-256	`0dda910bca8840559c80d46c8c6ab41794632049f1379a1547559ae4337222a1`

See more details on using hashes here.

File details

Details for the file platygeno-1.0.0-py3-none-any.whl.

File metadata

Download URL: platygeno-1.0.0-py3-none-any.whl
Upload date: Apr 14, 2026
Size: 21.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for platygeno-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4c472d523ac729f37d0abb73e21141c9009002080f7ae5c4a1fef321c705973d`
MD5	`1450dda9cbf8b37ca8d5d825cec9e2e1`
BLAKE2b-256	`9ed95e9df816c42c712b964dedbc9f55730291a21400de70c81906de15e18837`

See more details on using hashes here.

platygeno 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Unsupervised Biological Significance Mapping via Evo 2 & Sparse Autoencoders

🔭 Scientific Philosophy: Zero-Reference Significance

⚙️ Installation

🏗️ Simplified Architecture

📚 Documentation & Reference

🚀 One-Line Discovery Quickstart

🚀 Step-by-Step Discovery (Ph.D. Suite)

⚙️ Hardware Optimization

Batch Size Guide (--batch-size)

🚀 One-Line Discovery (Terminal)

🧩 The Scientific Dial: Tuning Significance

1. Signal Strength (min_activation)

2. Novelty Filter (--rarity-only) - Optional

3. Discovery Breadth (--top-n)

🧪 Core Methodology

1. Mean-Pooling (Global Semantic Averaging)

2. Zero-Gate Discovery (Unrestricted Semantic Census)

📈 Validation Stability: The Padding Filter

🧬 Technical Performance Highlights (v1.0)

4. Strategic Subtraction (--exclude)

📂 Unified Directory Structure

🧪 Validation Dataset: Gut Metagenome (IBD-MDB)

🧪 Use Case: Hunting for the Unknown

📚 API Reference

platygeno.discover_genes()

📜 Primary References

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Unsupervised Biological Significance Mapping via
Evo 2 & Sparse Autoencoders

Batch Size Guide (`--batch-size`)

1. Signal Strength (`min_activation`)

2. Novelty Filter (`--rarity-only`) - Optional

3. Discovery Breadth (`--top-n`)

4. Strategic Subtraction (`--exclude`)

`platygeno.discover_genes()`