Skip to main content

Protein sequence representation: encoders, embeddings, and reductions.

Project description

Sylphy 🧬

Release Tests License

Lightweight Python toolkit for protein sequence representation — transform sequences into numerical formats for machine learning and bioinformatics.

Three core components:

  • Classical encoders — one-hot, ordinal, frequency, k-mers, physicochemical, FFT
  • Embedding extraction — ESM2, ProtT5, ProtBERT, Ankh2, Mistral-Prot, ESM-C
  • Dimensionality reduction — PCA, UMAP, t-SNE, and more

Quick Example

import pandas as pd
from sylphy.embedding_extractor import create_embedding

# Extract embeddings from protein sequences
df = pd.DataFrame({"sequence": ["MKTAYIAKQR", "GAVLIMPFWK", "PEPTIDE"]})

embedder = create_embedding(
    model_name="facebook/esm2_t6_8M_UR50D",
    dataset=df,
    column_seq="sequence",
    name_device="cuda",
    precision="fp16"
)

embedder.run_process(batch_size=8, pool="mean")
embeddings = embedder.coded_dataset  # pandas DataFrame with embeddings
embedder.export_encoder("embeddings.parquet")

Installation

Recommended: Use a virtual environment to isolate dependencies:

# Create virtual environment
python -m venv venv

# Activate (Linux/macOS)
source venv/bin/activate

Install with pip:

# Basic installation
pip install sylphy

# With optional variants
pip install 'sylphy[embeddings,parquet]'

The basic installation includes classical sequence encoders and core utilities. For additional features, install optional variants:

Installation Variants

Variant Description
embeddings Adds PyTorch, Transformers, and ESM-C SDK for protein language model embedding extraction (ESM2, ProtT5, ProtBERT, Ankh2, Mistral-Prot, ESM-C).
parquet Enables Parquet file format support via PyArrow and FastParquet for efficient storage and loading of large datasets.
reductions Adds UMAP and ClustPy for advanced non-linear dimensionality reduction methods. Requires a C++ compiler and Python development headers to build ClustPy.
all Installs all optional dependencies (embeddings + parquet + reductions). Requires compilation tools for ClustPy.
tests Installs pytest and pytest-cov for running the test suite with coverage reports.
dev Development tools including pytest, mypy, ruff, taskipy, and build utilities for contributing to Sylphy.

Example installations:

# Embeddings + Parquet support
pip install sylphy-<version>-py3-none-any.whl[embeddings,parquet]

# Full installation with all features
pip install sylphy-<version>-py3-none-any.whl[all]

Requirements:

  • Python 3.11–3.12
  • Optional: CUDA for GPU-accelerated embedding extraction
  • For reductions variant: C++ compiler and Python development headers
    # Ubuntu/Debian
    sudo apt-get install build-essential python3-dev
    
    # Fedora/RHEL
    sudo dnf install gcc gcc-c++ python3-devel
    

Usage

Sequence Encoders

Transform sequences using classical encoding methods:

from sylphy.sequence_encoder import create_encoder

encoder = create_encoder(
    "one_hot",  # or: ordinal, kmers, frequency, physicochemical, fft
    dataset=df,
    sequence_column="sequence",
    max_length=1024
)

encoder.run_process()
encoded = encoder.coded_dataset
encoder.export_encoder("encoded.csv")

FFT encoding requires numeric input (use a two-stage pipeline):

# Stage 1: physicochemical properties
phys = create_encoder("physicochemical", dataset=df, name_property="ANDN920101")
phys.run_process()

# Stage 2: FFT on numeric matrix
fft = create_encoder("fft", dataset=phys.coded_dataset)
fft.run_process()

Embedding Extraction

Extract embeddings from pretrained protein language models:

from sylphy.embedding_extractor import create_embedding

embedder = create_embedding(
    model_name="facebook/esm2_t6_8M_UR50D",
    dataset=df,
    column_seq="sequence",
    name_device="cuda",
    precision="fp16",  # fp32, fp16, or bf16
    oom_backoff=True  # auto-reduce batch size on OOM
)

embedder.run_process(
    max_length=1024,
    batch_size=16,
    pool="mean"  # mean, cls, or eos
)

Supported models: ESM2 • Ankh2 • ProtT5 • ProtBERT • Mistral-Prot • ESM-C

Dimensionality Reduction

Reduce high-dimensional embeddings for visualization:

from sylphy.reductions import reduce_dimensionality

model, reduced = reduce_dimensionality(
    method="umap",  # pca, truncated_svd, umap, tsne, isomap, etc.
    dataset=embeddings,
    n_components=2,
    random_state=42,
    return_type="numpy"  # numpy or pandas
)

Command-Line Interface

# Extract embeddings
sylphy get-embedding run \
  --model facebook/esm2_t6_8M_UR50D \
  --input-data sequences.csv \
  --sequence-identifier sequence \
  --output embeddings.parquet \
  --device cuda --precision fp16 --batch-size 16

# Encode sequences
sylphy encode-sequences run \
  --encoder one_hot \
  --input-data sequences.csv \
  --sequence-identifier sequence \
  --output encoded.csv

# Manage cache
sylphy cache ls        # List cached files
sylphy cache stats     # Cache statistics
sylphy cache prune     # Prune cache (remove old files or reduce size)
sylphy cache rm        # Remove files by pattern or age
sylphy cache clear     # Clear entire cache

# Version info
sylphy --version

Configuration

Cache Management

Models and intermediate files are cached at:

  • Linux: ~/.cache/sylphy
  • macOS: ~/Library/Caches/sylphy
  • Windows: %LOCALAPPDATA%\sylphy

Programmatic control:

from sylphy import get_config, set_cache_root, temporary_cache_root

# View current cache location
cfg = get_config()
print(cfg.cache_paths.cache_root)

# Change cache directory
set_cache_root("/custom/cache/path")

# Temporary override
with temporary_cache_root("/tmp/cache"):
    # operations use temporary cache
    pass

Environment variables:

export SYLPHY_CACHE_ROOT=/custom/cache     # Override cache location
export SYLPHY_DEVICE=cuda                  # Force device (cpu/cuda)
export SYLPHY_LOG_FILE=/tmp/sylphy.log     # Enable file logging
export SYLPHY_SEED=42                      # Random seed

Model Registry

Register custom models and aliases:

from sylphy import ModelSpec, register_model, register_alias, resolve_model

# Register a model
register_model(ModelSpec(
    name="esm2_small",
    provider="huggingface",
    ref="facebook/esm2_t6_8M_UR50D"
))

# Create alias
register_alias("my_model", "esm2_small")

# Resolve to path
path = resolve_model("my_model")

Override model paths via environment:

export SYLPHY_MODEL_ESM2_SMALL=/path/to/local/model

Logging

Optional unified logging configuration:

from sylphy.logging import setup_logger

setup_logger(name="sylphy", level="INFO")  # DEBUG, INFO, WARNING, ERROR

Examples

The examples/ directory contains complete working examples:

  • 1_quick_start_encoders.ipynb — Jupyter notebook demonstrating all classical encoders
  • 2_simple_demo_embedding_extractor.py — Extract embeddings using all supported model families
  • 3_quick_start_reduction_process.ipynb — Dimensionality reduction workflows
  • 4_demo_embedding_different_layers.py — Layer selection and aggregation strategies
  • encoder_sequences_using_sylphy.py — Batch encoding with multiple encoder types
  • extract_embedding_using_sylphy.py — Production-ready embedding extraction script

Run examples:

# Python scripts
python examples/2_simple_demo_embedding_extractor.py

# Jupyter notebooks
jupyter notebook examples/1_quick_start_encoders.ipynb

Development

Setup

Clone the repository and install in editable mode:

git clone https://github.com/kren-ai-lab/sylphy.git
cd sylphy

# Install with development dependencies
pip install -e ".[dev]"

# Or install with all features for testing
pip install -e ".[all,dev]"

Note: The -e flag installs in editable mode, meaning changes to the source code take effect immediately without reinstalling.

Testing

# Run tests
pytest                # All tests (offline, mocked)
pytest -v             # Verbose
pytest --cov=sylphy   # With coverage

# Using taskipy shortcuts
uv run task test      # Run tests (quiet)
uv run task test-v    # Run tests (verbose)
uv run task test-cov  # Run tests with coverage report

Code Quality

# Linting and formatting
ruff check sylphy/    # Lint
ruff format sylphy/   # Format
mypy sylphy/          # Type check

# Using taskipy shortcuts
uv run task lint      # Lint check
uv run task lint-fix  # Lint and auto-fix
uv run task format    # Format code

Architecture

  • Fully typed with annotations
  • NumPy-style docstrings
  • Factory pattern for all components
  • Lazy imports for heavy dependencies
  • Offline tests with mocked PyTorch/HF

API Reference

Main imports:

from sylphy import (
    # Configuration / registry
    get_config, set_cache_root, temporary_cache_root,
    ModelSpec, register_model, resolve_model,
)
from sylphy.sequence_encoder import create_encoder
from sylphy.embedding_extractor import create_embedding
from sylphy.reductions import reduce_dimensionality
from sylphy.logging import setup_logger, get_logger

See CLAUDE.md for detailed architecture documentation.

License

GPL-3.0-only — See LICENSE for details.

Acknowledgements

Built with:

  • Hugging Face Transformers ecosystem
  • Meta ESM-C SDK
  • scikit-learnPyTorchUMAPClustPy

Developed by KREN AI Lab at Universidad de Magallanes, Chile.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sylphy-0.1.3.tar.gz (66.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sylphy-0.1.3-py3-none-any.whl (92.7 kB view details)

Uploaded Python 3

File details

Details for the file sylphy-0.1.3.tar.gz.

File metadata

  • Download URL: sylphy-0.1.3.tar.gz
  • Upload date:
  • Size: 66.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sylphy-0.1.3.tar.gz
Algorithm Hash digest
SHA256 31126845c87b6a3882cde7fc64954c356e9de6ce188d657cc0bc61d22984f05b
MD5 39624667cddf379fc13fa73eb8b6c29d
BLAKE2b-256 8cd841d272c4445752d00c36a4c4fc3a01d1d0d335d3e9b24d57585807ccda79

See more details on using hashes here.

Provenance

The following attestation bundles were made for sylphy-0.1.3.tar.gz:

Publisher: publish-pypi.yml on kren-ai-lab/sylphy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sylphy-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: sylphy-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 92.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sylphy-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 10d3c46d28bb970e6107d2bd1a1c02e269c1493f1d84295aa3e2140f780c4e50
MD5 85b21b9363e01a63fd5a82763278c76b
BLAKE2b-256 4d5e82df02af6d940cfd38c8e6c4b924be1d1837cac47df529aa98676bcbbd00

See more details on using hashes here.

Provenance

The following attestation bundles were made for sylphy-0.1.3-py3-none-any.whl:

Publisher: publish-pypi.yml on kren-ai-lab/sylphy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page