Genome analysis toolkit powered by Evo
Project description
EvoSeq
EvoSeq is a small Colab-friendly toolkit for preparing paired reference/mutant FASTA files and scoring variants with Evo2.
It is designed for workflows where positive datasets may include a manifest.tsv,
negative datasets may only have paired FASTA files, and the same Evo2 model should
stay loaded once per Colab runtime.
Quick Start
1. Install
From PyPI:
pip install evoseq
For Evo2 scoring support:
pip install "evoseq[evo2]"
In Google Colab, Evo2 often needs runtime-specific GPU packages:
!pip uninstall -y torch torchvision torchaudio
!pip install -q torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 --index-url https://download.pytorch.org/whl/cu128
!pip install -q flash-attn==2.8.0.post2 --no-build-isolation
!pip install -q evo2
!pip install -q evoseq
Preprocessing only needs the base evoseq install. Evo2 scoring requires a GPU
runtime with torch, flash-attn, and evo2.
2. Prepare paired FASTA files
Reference FASTA:
>variant1
ACGTACGTACGT
Mutant FASTA:
>variant1
ACGTTCGTACGT
The FASTA IDs must match between reference and mutant files.
3. Preprocess
from evoseq.preprocess import preprocess_files
evo_df, paths = preprocess_files(
reference_fasta_path="reference.fa",
mutant_fasta_path="mutant.fa",
)
By default, outputs are written next to the input FASTA files in
evoseq_preprocess_output/.
Generated files:
evo2_pairs.tsv: one row per variant withref_seqandmut_seqevo2_reference.faevo2_mutant.faevo2_all.fapreprocessing_report.tsv
4. Score variants with Evo2
from evoseq.scoring import score_pairs_file
result_df, result_paths = score_pairs_file(
pairs_path=paths["pairs"],
model_name="evo2_7b",
batch_size=8,
)
By default, scoring outputs are written inside
evoseq_preprocess_output/evoseq_scoring_output/.
Generated files:
evo2_variant_scores_unique.tsvevo2_variant_scores_manifest.tsvwhen a manifest is availableenvironment_info.tsvscoring_report.tsv
Reference sequences are scored once per unique sequence and reused. This is useful when many variants share the same reference window.
Typical Workflow
reference.fa + mutant.fa
↓
preprocess
↓
evo2_pairs.tsv
↓
Evo2 scoring
↓
variant score tables
Example:
from evoseq.preprocess import preprocess_files
from evoseq.scoring import score_pairs_file
evo_df, paths = preprocess_files(
reference_fasta_path="reference.fa",
mutant_fasta_path="mutant.fa",
)
scores, outputs = score_pairs_file(
pairs_path=paths["pairs"],
model_name="evo2_7b",
)
print(scores.head())
Typical output structure:
Typical output structure:
project/
├── reference.fa
├── mutant.fa
└── evoseq_preprocess_output/
├── evo2_pairs.tsv
├── evo2_reference.fa
├── evo2_mutant.fa
├── evo2_all.fa
├── preprocessing_report.tsv
└── evoseq_scoring_output/
├── evo2_variant_scores_unique.tsv
├── evo2_variant_scores_manifest.tsv
├── scoring_report.tsv
└── environment_info.tsv
The important files are:
evo2_pairs.tsv: the main table passed to Evo2 scoring. It contains matched reference and mutant sequences.evo2_variant_scores_unique.tsv: the main Evo2 scoring result table.evo2_variant_scores_manifest.tsv: the score table merged withmanifest.tsv, when a manifest is available.preprocessing_report.tsv: records what was generated during preprocessing.scoring_report.tsv: records model name, device, batch size, elapsed time, and output paths.environment_info.tsv: records package, CUDA, and GPU versions for reproducibility.
Manifest Support
manifest.tsv is optional.
When present, metadata are merged by record_id:
from evoseq.preprocess import preprocess_files
evo_df, paths = preprocess_files(
reference_fasta_path="reference.fa",
mutant_fasta_path="mutant.fa",
manifest_path="manifest.tsv",
)
You can also let EvoSeq look for a manifest automatically:
evo_df, paths = preprocess_files(
reference_fasta_path="reference.fa",
mutant_fasta_path="mutant.fa",
manifest_path="auto",
)
When no manifest is provided, metadata are inferred from FASTA IDs when possible.
Folder Discovery
If a folder contains paired FASTA files, EvoSeq can discover them:
from evoseq.preprocess import preprocess_folder
evo_df, paths = preprocess_folder("test")
Custom Output Directories
Preprocessing:
evo_df, paths = preprocess_files(
reference_fasta_path="reference.fa",
mutant_fasta_path="mutant.fa",
output_dir="outputs/preprocessing",
)
Scoring:
from evoseq.scoring import score_pairs_file
result_df, result_paths = score_pairs_file(
pairs_path="outputs/preprocessing/evo2_pairs.tsv",
model_name="evo2_7b",
output_dir="outputs/scoring",
)
Per-Base Log-Probabilities
EvoSeq can export per-base Evo2 log-probabilities for aligned sequences:
from evoseq.scoring import export_perbase_logprobs
path = export_perbase_logprobs(
fasta_path="representative_perbase.fasta",
model_name="evo2_7b",
center=4096,
half_window=320,
)
By default, this writes:
evoseq_perbase_output/perbase_logprobs.tsv
The output can be used to visualize:
- reference vs mutant tracks
- Δ log-probability profiles
- local sequence effects
- long-range context effects
Model Handling
EvoSeq caches the loaded Evo2 model inside the Python process:
from evoseq.scoring import Evo2Scorer
scorer = Evo2Scorer(model_name="evo2_7b", device="cuda:0")
scores = scorer.score_sequences(["ACGTACGT"])
Calling another scoring function with the same model reuses it.
Attempting to load a different Evo2 model in the same runtime raises an explicit error by default, because loading multiple large models often exhausts Colab GPU memory. Restart the runtime when switching from 7B to 20B.
Common model names:
evo2_7bevo2_7b_baseevo2_20b
For local model weights:
from evoseq.scoring import score_evo2_pairs
score_evo2_pairs(
base_dir=".",
model_name="evo2_20b",
local_path="/content/drive/MyDrive/Models/evo2_20b.pt",
)
TOML Config
Copy evoseq.example.toml, edit the input paths and model, then run:
from evoseq import run_from_config
outputs = run_from_config("evoseq.example.toml")
or from the command line:
evoseq-run evoseq.example.toml
Reproducibility
EvoSeq writes small TSV reports for methods sections and reruns.
Reports include:
- input paths and output paths
- number of variants and unique reference sequences
- model name, batch size, device, and elapsed time
- Python, PyTorch, CUDA, GPU, NumPy, pandas, Biopython, and Evo2 versions
Save these files alongside each analysis directory.
Development
For local development from this repository:
git clone https://github.com/mizomizo1/EvoSeq.git
cd EvoSeq
pip install -e .
For local development with Evo2 extras:
pip install -e ".[evo2]"
Install a specific GitHub release directly:
pip install "git+https://github.com/mizomizo1/EvoSeq.git@v0.1.0"
Run tests without Evo2, torch, or flash-attn:
python -m unittest discover -s tests -v
These tests cover preprocessing, folder discovery, score-table export with a fake
scorer, and the missing Evo2 dependency message. Real Evo2 scoring requires a
Colab GPU runtime with torch, flash-attn, and evo2 installed.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file evoseq-0.3.3.tar.gz.
File metadata
- Download URL: evoseq-0.3.3.tar.gz
- Upload date:
- Size: 21.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e6b80591e45a187ec366478e421432aa53bb250de28f9ca2bba326f61bb35bb4
|
|
| MD5 |
27f0a41c13ac8d1f5e6005a9b6dc51ea
|
|
| BLAKE2b-256 |
fb9317a12969bf4f44c874c1f41d1dd3e8423ee0f70112301513fc9f5f05c228
|
Provenance
The following attestation bundles were made for evoseq-0.3.3.tar.gz:
Publisher:
python-publish.yml on mizomizo1/EvoSeq
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
evoseq-0.3.3.tar.gz -
Subject digest:
e6b80591e45a187ec366478e421432aa53bb250de28f9ca2bba326f61bb35bb4 - Sigstore transparency entry: 1922656777
- Sigstore integration time:
-
Permalink:
mizomizo1/EvoSeq@2d11a57288b5a938d5a841dbe79d940964d77e4d -
Branch / Tag:
refs/tags/v0.3.3 - Owner: https://github.com/mizomizo1
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@2d11a57288b5a938d5a841dbe79d940964d77e4d -
Trigger Event:
release
-
Statement type:
File details
Details for the file evoseq-0.3.3-py3-none-any.whl.
File metadata
- Download URL: evoseq-0.3.3-py3-none-any.whl
- Upload date:
- Size: 22.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c62933e493341c354ab1093e9df340ad369cf238873ea6afa211821d6897ff5b
|
|
| MD5 |
f9988841a0d42804b5937cc1cb8e8326
|
|
| BLAKE2b-256 |
25bbdd01ff97af4cb494cb7f5a6f38d9f82f162180d3a72f494d44c228412a26
|
Provenance
The following attestation bundles were made for evoseq-0.3.3-py3-none-any.whl:
Publisher:
python-publish.yml on mizomizo1/EvoSeq
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
evoseq-0.3.3-py3-none-any.whl -
Subject digest:
c62933e493341c354ab1093e9df340ad369cf238873ea6afa211821d6897ff5b - Sigstore transparency entry: 1922656969
- Sigstore integration time:
-
Permalink:
mizomizo1/EvoSeq@2d11a57288b5a938d5a841dbe79d940964d77e4d -
Branch / Tag:
refs/tags/v0.3.3 - Owner: https://github.com/mizomizo1
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@2d11a57288b5a938d5a841dbe79d940964d77e4d -
Trigger Event:
release
-
Statement type: