Protein sequence representation: encoders, embeddings, and reductions.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

dialvarezs

These details have not been verified by PyPI

Project description

Sylphy 🧬

License

Lightweight Python toolkit for protein sequence representation — transform sequences into numerical formats for machine learning and bioinformatics.

Three core components:

Classical encoders — one-hot, ordinal, frequency, k-mers, physicochemical, FFT
Embedding extraction — ESM2, ProtT5, ProtBERT, Ankh2, Mistral-Prot, ESM-C
Dimensionality reduction — PCA, UMAP, t-SNE, and more

Quick Example

import pandas as pd
from sylphy.embedding_extractor import create_embedding

# Extract embeddings from protein sequences
df = pd.DataFrame({"sequence": ["MKTAYIAKQR", "GAVLIMPFWK", "PEPTIDE"]})

embedder = create_embedding(
    model_name="facebook/esm2_t6_8M_UR50D",
    dataset=df,
    column_seq="sequence",
    name_device="cuda",
    precision="fp16"
)

embedder.run_process(batch_size=8, pool="mean")
embeddings = embedder.coded_dataset  # pandas DataFrame with embeddings
embedder.export_encoder("embeddings.parquet")

Installation

Recommended: Use a virtual environment to isolate dependencies:

# Create virtual environment
python -m venv venv

# Activate (Linux/macOS)
source venv/bin/activate

Install with pip:

# Basic installation
pip install sylphy

# With optional variants
pip install 'sylphy[embeddings,parquet]'

The basic installation includes classical sequence encoders and core utilities. For additional features, install optional variants:

Installation Variants

Variant	Description
`embeddings`	Adds PyTorch, Transformers, and ESM-C SDK for protein language model embedding extraction (ESM2, ProtT5, ProtBERT, Ankh2, Mistral-Prot, ESM-C).
`parquet`	Enables Parquet file format support via PyArrow and FastParquet for efficient storage and loading of large datasets.
`reductions`	Adds UMAP and ClustPy for advanced non-linear dimensionality reduction methods. Requires a C++ compiler and Python development headers to build ClustPy.
`all`	Installs all optional dependencies (embeddings + parquet + reductions). Requires compilation tools for ClustPy.
`tests`	Installs pytest and pytest-cov for running the test suite with coverage reports.
`dev`	Development tools including pytest, mypy, ruff, taskipy, and build utilities for contributing to Sylphy.

Example installations:

# Embeddings + Parquet support
pip install sylphy-<version>-py3-none-any.whl[embeddings,parquet]

# Full installation with all features
pip install sylphy-<version>-py3-none-any.whl[all]

Requirements:

Python 3.11–3.12
Optional: CUDA for GPU-accelerated embedding extraction

For reductions variant: C++ compiler and Python development headers

# Ubuntu/Debian
sudo apt-get install build-essential python3-dev

# Fedora/RHEL
sudo dnf install gcc gcc-c++ python3-devel

Usage

Sequence Encoders

Transform sequences using classical encoding methods:

from sylphy.sequence_encoder import create_encoder

encoder = create_encoder(
    "one_hot",  # or: ordinal, kmers, frequency, physicochemical, fft
    dataset=df,
    sequence_column="sequence",
    max_length=1024
)

encoder.run_process()
encoded = encoder.coded_dataset
encoder.export_encoder("encoded.csv")

FFT encoding requires numeric input (use a two-stage pipeline):

# Stage 1: physicochemical properties
phys = create_encoder("physicochemical", dataset=df, name_property="ANDN920101")
phys.run_process()

# Stage 2: FFT on numeric matrix
fft = create_encoder("fft", dataset=phys.coded_dataset)
fft.run_process()

Embedding Extraction

Extract embeddings from pretrained protein language models:

from sylphy.embedding_extractor import create_embedding

embedder = create_embedding(
    model_name="facebook/esm2_t6_8M_UR50D",
    dataset=df,
    column_seq="sequence",
    name_device="cuda",
    precision="fp16",  # fp32, fp16, or bf16
    oom_backoff=True  # auto-reduce batch size on OOM
)

embedder.run_process(
    max_length=1024,
    batch_size=16,
    pool="mean"  # mean, cls, or eos
)

Supported models: ESM2 • Ankh2 • ProtT5 • ProtBERT • Mistral-Prot • ESM-C

Dimensionality Reduction

Reduce high-dimensional embeddings for visualization:

from sylphy.reductions import reduce_dimensionality

model, reduced = reduce_dimensionality(
    method="umap",  # pca, truncated_svd, umap, tsne, isomap, etc.
    dataset=embeddings,
    n_components=2,
    random_state=42,
    return_type="numpy"  # numpy or pandas
)

Command-Line Interface

# Extract embeddings
sylphy get-embedding run \
  --model facebook/esm2_t6_8M_UR50D \
  --input-data sequences.csv \
  --sequence-identifier sequence \
  --output embeddings.parquet \
  --device cuda --precision fp16 --batch-size 16

# Encode sequences
sylphy encode-sequences run \
  --encoder one_hot \
  --input-data sequences.csv \
  --sequence-identifier sequence \
  --output encoded.csv

# Manage cache
sylphy cache ls        # List cached files
sylphy cache stats     # Cache statistics
sylphy cache prune     # Prune cache (remove old files or reduce size)
sylphy cache rm        # Remove files by pattern or age
sylphy cache clear     # Clear entire cache

# Version info
sylphy --version

Configuration

Cache Management

Models and intermediate files are cached at:

Linux: ~/.cache/sylphy
macOS: ~/Library/Caches/sylphy
Windows: %LOCALAPPDATA%\sylphy

Programmatic control:

from sylphy import get_config, set_cache_root, temporary_cache_root

# View current cache location
cfg = get_config()
print(cfg.cache_paths.cache_root)

# Change cache directory
set_cache_root("/custom/cache/path")

# Temporary override
with temporary_cache_root("/tmp/cache"):
    # operations use temporary cache
    pass

Environment variables:

export SYLPHY_CACHE_ROOT=/custom/cache     # Override cache location
export SYLPHY_DEVICE=cuda                  # Force device (cpu/cuda)
export SYLPHY_LOG_FILE=/tmp/sylphy.log     # Enable file logging
export SYLPHY_SEED=42                      # Random seed

Model Registry

from sylphy import ModelSpec, register_model, register_alias, resolve_model

# Register a model
register_model(ModelSpec(
    name="esm2_small",
    provider="huggingface",
    ref="facebook/esm2_t6_8M_UR50D"
))

# Create alias
register_alias("my_model", "esm2_small")

# Resolve to path
path = resolve_model("my_model")

Override model paths via environment:

export SYLPHY_MODEL_ESM2_SMALL=/path/to/local/model

Logging

Optional unified logging configuration:

from sylphy.logging import setup_logger

setup_logger(name="sylphy", level="INFO")  # DEBUG, INFO, WARNING, ERROR

Examples

The examples/ directory contains complete working examples:

1_quick_start_encoders.ipynb — Jupyter notebook demonstrating all classical encoders
2_simple_demo_embedding_extractor.py — Extract embeddings using all supported model families
3_quick_start_reduction_process.ipynb — Dimensionality reduction workflows
4_demo_embedding_different_layers.py — Layer selection and aggregation strategies
encoder_sequences_using_sylphy.py — Batch encoding with multiple encoder types
extract_embedding_using_sylphy.py — Production-ready embedding extraction script

Run examples:

# Python scripts
python examples/2_simple_demo_embedding_extractor.py

# Jupyter notebooks
jupyter notebook examples/1_quick_start_encoders.ipynb

Development

Setup

Clone the repository and install in editable mode:

git clone https://github.com/kren-ai-lab/sylphy.git
cd sylphy

# Install with development dependencies
pip install -e ".[dev]"

# Or install with all features for testing
pip install -e ".[all,dev]"

Note: The -e flag installs in editable mode, meaning changes to the source code take effect immediately without reinstalling.

Testing

# Run tests
pytest                # All tests (offline, mocked)
pytest -v             # Verbose
pytest --cov=sylphy   # With coverage

# Using taskipy shortcuts
uv run task test      # Run tests (quiet)
uv run task test-v    # Run tests (verbose)
uv run task test-cov  # Run tests with coverage report

Code Quality

# Linting and formatting
ruff check sylphy/    # Lint
ruff format sylphy/   # Format
mypy sylphy/          # Type check

# Using taskipy shortcuts
uv run task lint      # Lint check
uv run task lint-fix  # Lint and auto-fix
uv run task format    # Format code

Architecture

Fully typed with annotations
NumPy-style docstrings
Factory pattern for all components
Lazy imports for heavy dependencies
Offline tests with mocked PyTorch/HF

API Reference

Main imports:

from sylphy import (
    # Configuration / registry
    get_config, set_cache_root, temporary_cache_root,
    ModelSpec, register_model, resolve_model,
)
from sylphy.sequence_encoder import create_encoder
from sylphy.embedding_extractor import create_embedding
from sylphy.reductions import reduce_dimensionality
from sylphy.logging import setup_logger, get_logger

See CLAUDE.md for detailed architecture documentation.

License

GPL-3.0-only — See LICENSE for details.

Acknowledgements

Built with:

Hugging Face Transformers ecosystem
Meta ESM-C SDK
scikit-learn • PyTorch • UMAP • ClustPy

Developed by KREN AI Lab at Universidad de Magallanes, Chile.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

dialvarezs

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.0

Apr 16, 2026

This version

0.1.3

Mar 9, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sylphy-0.1.3.tar.gz (66.2 kB view details)

Uploaded Mar 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sylphy-0.1.3-py3-none-any.whl (92.7 kB view details)

Uploaded Mar 9, 2026 Python 3

File details

Details for the file sylphy-0.1.3.tar.gz.

File metadata

Download URL: sylphy-0.1.3.tar.gz
Upload date: Mar 9, 2026
Size: 66.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sylphy-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`31126845c87b6a3882cde7fc64954c356e9de6ce188d657cc0bc61d22984f05b`
MD5	`39624667cddf379fc13fa73eb8b6c29d`
BLAKE2b-256	`8cd841d272c4445752d00c36a4c4fc3a01d1d0d335d3e9b24d57585807ccda79`

See more details on using hashes here.

Provenance

The following attestation bundles were made for sylphy-0.1.3.tar.gz:

Publisher: publish-pypi.yml on kren-ai-lab/sylphy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: sylphy-0.1.3.tar.gz
- Subject digest: 31126845c87b6a3882cde7fc64954c356e9de6ce188d657cc0bc61d22984f05b
- Sigstore transparency entry: 1066684435
- Sigstore integration time: Mar 9, 2026
Source repository:
- Permalink: kren-ai-lab/sylphy@fb5a2d4f891ff5b509c28a7eaca47fa8aef0ea53
- Branch / Tag: refs/tags/v0.1.3
- Owner: https://github.com/kren-ai-lab
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-pypi.yml@fb5a2d4f891ff5b509c28a7eaca47fa8aef0ea53
- Trigger Event: push

File details

Details for the file sylphy-0.1.3-py3-none-any.whl.

File metadata

Download URL: sylphy-0.1.3-py3-none-any.whl
Upload date: Mar 9, 2026
Size: 92.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sylphy-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`10d3c46d28bb970e6107d2bd1a1c02e269c1493f1d84295aa3e2140f780c4e50`
MD5	`85b21b9363e01a63fd5a82763278c76b`
BLAKE2b-256	`4d5e82df02af6d940cfd38c8e6c4b924be1d1837cac47df529aa98676bcbbd00`

See more details on using hashes here.

Provenance

The following attestation bundles were made for sylphy-0.1.3-py3-none-any.whl:

Publisher: publish-pypi.yml on kren-ai-lab/sylphy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: sylphy-0.1.3-py3-none-any.whl
- Subject digest: 10d3c46d28bb970e6107d2bd1a1c02e269c1493f1d84295aa3e2140f780c4e50
- Sigstore transparency entry: 1066684436
- Sigstore integration time: Mar 9, 2026
Source repository:
- Permalink: kren-ai-lab/sylphy@fb5a2d4f891ff5b509c28a7eaca47fa8aef0ea53
- Branch / Tag: refs/tags/v0.1.3
- Owner: https://github.com/kren-ai-lab
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-pypi.yml@fb5a2d4f891ff5b509c28a7eaca47fa8aef0ea53
- Trigger Event: push

sylphy 0.1.3

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Sylphy 🧬

Quick Example

Installation

Installation Variants

Usage

Sequence Encoders

Embedding Extraction

Dimensionality Reduction

Command-Line Interface

Configuration

Cache Management

Model Registry

Logging

Examples

Development

Setup

Testing

Code Quality

Architecture

API Reference

License

Acknowledgements

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance