Protein sequence representation: encoders, embeddings, and reductions.
Project description
Sylphy 🧬
Lightweight Python toolkit for protein sequence representation — transform sequences into numerical formats for machine learning and bioinformatics.
Three core components:
- Classical encoders — one-hot, ordinal, frequency, k-mers, physicochemical, FFT
- Embedding extraction — ESM2, ProtT5, ProtBERT, Ankh2, Mistral-Prot, ESM-C
- Dimensionality reduction — PCA, UMAP, t-SNE, and more
Quick Example
import pandas as pd
from sylphy.embedding_extractor import create_embedding
# Extract embeddings from protein sequences
df = pd.DataFrame({"sequence": ["MKTAYIAKQR", "GAVLIMPFWK", "PEPTIDE"]})
embedder = create_embedding(
model_name="facebook/esm2_t6_8M_UR50D",
dataset=df,
column_seq="sequence",
name_device="cuda",
precision="fp16"
)
embedder.run_process(batch_size=8, pool="mean")
embeddings = embedder.coded_dataset # pandas DataFrame with embeddings
embedder.export_encoder("embeddings.parquet")
Installation
Recommended: Use a virtual environment to isolate dependencies:
# Create virtual environment
python -m venv venv
# Activate (Linux/macOS)
source venv/bin/activate
Install with pip:
# Basic installation
pip install sylphy
# With optional variants
pip install 'sylphy[embeddings,parquet]'
The basic installation includes classical sequence encoders and core utilities. For additional features, install optional variants:
Installation Variants
| Variant | Description |
|---|---|
embeddings |
Adds PyTorch, Transformers, and ESM-C SDK for protein language model embedding extraction (ESM2, ProtT5, ProtBERT, Ankh2, Mistral-Prot, ESM-C). |
parquet |
Enables Parquet file format support via PyArrow and FastParquet for efficient storage and loading of large datasets. |
reductions |
Adds UMAP and ClustPy for advanced non-linear dimensionality reduction methods. Requires a C++ compiler and Python development headers to build ClustPy. |
all |
Installs all optional dependencies (embeddings + parquet + reductions). Requires compilation tools for ClustPy. |
tests |
Installs pytest and pytest-cov for running the test suite with coverage reports. |
dev |
Development tools including pytest, mypy, ruff, taskipy, and build utilities for contributing to Sylphy. |
Example installations:
# Embeddings + Parquet support
pip install sylphy-<version>-py3-none-any.whl[embeddings,parquet]
# Full installation with all features
pip install sylphy-<version>-py3-none-any.whl[all]
Requirements:
- Python 3.11–3.12
- Optional: CUDA for GPU-accelerated embedding extraction
- For
reductionsvariant: C++ compiler and Python development headers# Ubuntu/Debian sudo apt-get install build-essential python3-dev # Fedora/RHEL sudo dnf install gcc gcc-c++ python3-devel
Usage
Sequence Encoders
Transform sequences using classical encoding methods:
from sylphy.sequence_encoder import create_encoder
encoder = create_encoder(
"one_hot", # or: ordinal, kmers, frequency, physicochemical, fft
dataset=df,
sequence_column="sequence",
max_length=1024
)
encoder.run_process()
encoded = encoder.coded_dataset
encoder.export_encoder("encoded.csv")
FFT encoding requires numeric input (use a two-stage pipeline):
# Stage 1: physicochemical properties
phys = create_encoder("physicochemical", dataset=df, name_property="ANDN920101")
phys.run_process()
# Stage 2: FFT on numeric matrix
fft = create_encoder("fft", dataset=phys.coded_dataset)
fft.run_process()
Embedding Extraction
Extract embeddings from pretrained protein language models:
from sylphy.embedding_extractor import create_embedding
embedder = create_embedding(
model_name="facebook/esm2_t6_8M_UR50D",
dataset=df,
column_seq="sequence",
name_device="cuda",
precision="fp16", # fp32, fp16, or bf16
oom_backoff=True # auto-reduce batch size on OOM
)
embedder.run_process(
max_length=1024,
batch_size=16,
pool="mean" # mean, cls, or eos
)
Supported models: ESM2 • Ankh2 • ProtT5 • ProtBERT • Mistral-Prot • ESM-C
Dimensionality Reduction
Reduce high-dimensional embeddings for visualization:
from sylphy.reductions import reduce_dimensionality
model, reduced = reduce_dimensionality(
method="umap", # pca, truncated_svd, umap, tsne, isomap, etc.
dataset=embeddings,
n_components=2,
random_state=42,
return_type="numpy" # numpy or pandas
)
Command-Line Interface
# Extract embeddings
sylphy get-embedding run \
--model facebook/esm2_t6_8M_UR50D \
--input-data sequences.csv \
--sequence-identifier sequence \
--output embeddings.parquet \
--device cuda --precision fp16 --batch-size 16
# Encode sequences
sylphy encode-sequences run \
--encoder one_hot \
--input-data sequences.csv \
--sequence-identifier sequence \
--output encoded.csv
# Manage cache
sylphy cache ls # List cached files
sylphy cache stats # Cache statistics
sylphy cache prune # Prune cache (remove old files or reduce size)
sylphy cache rm # Remove files by pattern or age
sylphy cache clear # Clear entire cache
# Version info
sylphy --version
Configuration
Cache Management
Models and intermediate files are cached at:
- Linux:
~/.cache/sylphy - macOS:
~/Library/Caches/sylphy - Windows:
%LOCALAPPDATA%\sylphy
Programmatic control:
from sylphy import get_config, set_cache_root, temporary_cache_root
# View current cache location
cfg = get_config()
print(cfg.cache_paths.cache_root)
# Change cache directory
set_cache_root("/custom/cache/path")
# Temporary override
with temporary_cache_root("/tmp/cache"):
# operations use temporary cache
pass
Environment variables:
export SYLPHY_CACHE_ROOT=/custom/cache # Override cache location
export SYLPHY_DEVICE=cuda # Force device (cpu/cuda)
export SYLPHY_LOG_FILE=/tmp/sylphy.log # Enable file logging
export SYLPHY_SEED=42 # Random seed
Model Registry
Register custom models and aliases:
from sylphy import ModelSpec, register_model, register_alias, resolve_model
# Register a model
register_model(ModelSpec(
name="esm2_small",
provider="huggingface",
ref="facebook/esm2_t6_8M_UR50D"
))
# Create alias
register_alias("my_model", "esm2_small")
# Resolve to path
path = resolve_model("my_model")
Override model paths via environment:
export SYLPHY_MODEL_ESM2_SMALL=/path/to/local/model
Logging
Optional unified logging configuration:
from sylphy.logging import setup_logger
setup_logger(name="sylphy", level="INFO") # DEBUG, INFO, WARNING, ERROR
Examples
The examples/ directory contains complete working examples:
1_quick_start_encoders.ipynb— Jupyter notebook demonstrating all classical encoders2_simple_demo_embedding_extractor.py— Extract embeddings using all supported model families3_quick_start_reduction_process.ipynb— Dimensionality reduction workflows4_demo_embedding_different_layers.py— Layer selection and aggregation strategiesencoder_sequences_using_sylphy.py— Batch encoding with multiple encoder typesextract_embedding_using_sylphy.py— Production-ready embedding extraction script
Run examples:
# Python scripts
python examples/2_simple_demo_embedding_extractor.py
# Jupyter notebooks
jupyter notebook examples/1_quick_start_encoders.ipynb
Development
Setup
Clone the repository and install in editable mode:
git clone https://github.com/kren-ai-lab/sylphy.git
cd sylphy
# Install with development dependencies
pip install -e ".[dev]"
# Or install with all features for testing
pip install -e ".[all,dev]"
Note: The -e flag installs in editable mode, meaning changes to the source code take effect immediately without reinstalling.
Testing
# Run tests
pytest # All tests (offline, mocked)
pytest -v # Verbose
pytest --cov=sylphy # With coverage
# Using taskipy shortcuts
uv run task test # Run tests (quiet)
uv run task test-v # Run tests (verbose)
uv run task test-cov # Run tests with coverage report
Code Quality
# Linting and formatting
ruff check sylphy/ # Lint
ruff format sylphy/ # Format
mypy sylphy/ # Type check
# Using taskipy shortcuts
uv run task lint # Lint check
uv run task lint-fix # Lint and auto-fix
uv run task format # Format code
Architecture
- Fully typed with annotations
- NumPy-style docstrings
- Factory pattern for all components
- Lazy imports for heavy dependencies
- Offline tests with mocked PyTorch/HF
API Reference
Main imports:
from sylphy import (
# Configuration / registry
get_config, set_cache_root, temporary_cache_root,
ModelSpec, register_model, resolve_model,
)
from sylphy.sequence_encoder import create_encoder
from sylphy.embedding_extractor import create_embedding
from sylphy.reductions import reduce_dimensionality
from sylphy.logging import setup_logger, get_logger
See CLAUDE.md for detailed architecture documentation.
License
GPL-3.0-only — See LICENSE for details.
Acknowledgements
Built with:
- Hugging Face Transformers ecosystem
- Meta ESM-C SDK
- scikit-learn • PyTorch • UMAP • ClustPy
Developed by KREN AI Lab at Universidad de Magallanes, Chile.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sylphy-0.1.3.tar.gz.
File metadata
- Download URL: sylphy-0.1.3.tar.gz
- Upload date:
- Size: 66.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
31126845c87b6a3882cde7fc64954c356e9de6ce188d657cc0bc61d22984f05b
|
|
| MD5 |
39624667cddf379fc13fa73eb8b6c29d
|
|
| BLAKE2b-256 |
8cd841d272c4445752d00c36a4c4fc3a01d1d0d335d3e9b24d57585807ccda79
|
Provenance
The following attestation bundles were made for sylphy-0.1.3.tar.gz:
Publisher:
publish-pypi.yml on kren-ai-lab/sylphy
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sylphy-0.1.3.tar.gz -
Subject digest:
31126845c87b6a3882cde7fc64954c356e9de6ce188d657cc0bc61d22984f05b - Sigstore transparency entry: 1066684435
- Sigstore integration time:
-
Permalink:
kren-ai-lab/sylphy@fb5a2d4f891ff5b509c28a7eaca47fa8aef0ea53 -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/kren-ai-lab
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@fb5a2d4f891ff5b509c28a7eaca47fa8aef0ea53 -
Trigger Event:
push
-
Statement type:
File details
Details for the file sylphy-0.1.3-py3-none-any.whl.
File metadata
- Download URL: sylphy-0.1.3-py3-none-any.whl
- Upload date:
- Size: 92.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
10d3c46d28bb970e6107d2bd1a1c02e269c1493f1d84295aa3e2140f780c4e50
|
|
| MD5 |
85b21b9363e01a63fd5a82763278c76b
|
|
| BLAKE2b-256 |
4d5e82df02af6d940cfd38c8e6c4b924be1d1837cac47df529aa98676bcbbd00
|
Provenance
The following attestation bundles were made for sylphy-0.1.3-py3-none-any.whl:
Publisher:
publish-pypi.yml on kren-ai-lab/sylphy
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sylphy-0.1.3-py3-none-any.whl -
Subject digest:
10d3c46d28bb970e6107d2bd1a1c02e269c1493f1d84295aa3e2140f780c4e50 - Sigstore transparency entry: 1066684436
- Sigstore integration time:
-
Permalink:
kren-ai-lab/sylphy@fb5a2d4f891ff5b509c28a7eaca47fa8aef0ea53 -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/kren-ai-lab
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@fb5a2d4f891ff5b509c28a7eaca47fa8aef0ea53 -
Trigger Event:
push
-
Statement type: