Skip to main content

Protein sequence representation: encoders, embeddings, and reductions.

Project description

Sylphy 🧬

PyPI PyVersions Tests License

Sylphy is a Python toolkit for turning protein sequences into machine-learning-ready representations.

It covers three main workflows:

  • Classical sequence encoders: one-hot, ordinal, frequency, k-mers, physicochemical, FFT
  • Embedding extraction from pretrained protein models: ESM2, ProtT5, ProtBERT, Ankh2, Mistral-Prot, ESM-C
  • Dimensionality reduction for downstream analysis and visualization

Installation

Sylphy supports Python 3.11 and 3.12.

pip install sylphy

Install optional extras as needed:

  • embeddings for PyTorch and Transformers-based embedding extraction
  • parquet for Parquet export support
  • reductions for UMAP and related optional reducers
  • all for all optional runtime dependencies

The reductions extra may require a C++ compiler and Python development headers because of optional native dependencies such as ClustPy.

pip install 'sylphy[embeddings,parquet]'
pip install 'sylphy[all]'

On Debian or Ubuntu systems, install the build prerequisites with:

sudo apt-get install build-essential python3-dev

On Fedora or RHEL systems:

sudo dnf install gcc gcc-c++ python3-devel

Quick Start

Classical sequence encoding:

import pandas as pd
from sylphy.sequence_encoder import create_encoder

df = pd.DataFrame({"sequence": ["MKTAYIAKQR", "GAVLIMPFWK", "PEPTIDE"]})

encoder = create_encoder(
    "one_hot",  # or: ordinal, kmers, frequency, physicochemical, fft
    dataset=df,
    sequence_column="sequence",
)
encoder.run_process()
encoded = encoder.coded_dataset

Embedding extraction:

import pandas as pd
from sylphy.embedding_extractor import create_embedding

df = pd.DataFrame({"sequence": ["MKTAYIAKQR", "GAVLIMPFWK", "PEPTIDE"]})

embedder = create_embedding(
    model_name="facebook/esm2_t6_8M_UR50D",
    dataset=df,
    column_seq="sequence",
    name_device="cuda",
    precision="fp16",  # fp32, fp16, or bf16
)

embedder.run_process(batch_size=8, pool="mean")  # mean, cls, or eos
embeddings = embedder.coded_dataset
embedder.export_encoder("embeddings.parquet")

Dimensionality reduction:

from sylphy.reductions import reduce_dimensionality

model, reduced = reduce_dimensionality(
    method="pca",  # pca, truncated_svd, umap, tsne, isomap, etc.
    dataset=embeddings,
    n_components=2,
    random_state=42,
)

CLI

sylphy --help

sylphy get-embedding \
  --model facebook/esm2_t6_8M_UR50D \
  --input-data sequences.csv \
  --sequence-identifier sequence \
  --output embeddings.parquet \
  --device cuda --precision fp16 --batch-size 16

sylphy encode-sequences \
  --encoder one_hot \
  --input-data sequences.csv \
  --sequence-identifier sequence \
  --output encoded.csv

sylphy cache stats

Configuration

By default Sylphy stores cache data in the platform cache directory:

  • Linux: ~/.cache/sylphy
  • macOS: ~/Library/Caches/sylphy
  • Windows: %LOCALAPPDATA%\\sylphy\\Cache

Useful environment variables:

  • SYLPHY_CACHE_ROOT to override the cache location
  • SYLPHY_DEVICE to force cpu or cuda
  • SYLPHY_MODEL_<NAME> to override a registered model path

Learn More

License

GPL-3.0-only. See LICENSE.

Acknowledgements

Built with the Hugging Face Transformers ecosystem, the Meta ESM-C SDK, and the broader scientific Python stack including scikit-learn, PyTorch, UMAP, and ClustPy.

Developed by KREN AI Lab at Universidad de Magallanes, Chile.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sylphy-0.2.0.tar.gz (64.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sylphy-0.2.0-py3-none-any.whl (92.0 kB view details)

Uploaded Python 3

File details

Details for the file sylphy-0.2.0.tar.gz.

File metadata

  • Download URL: sylphy-0.2.0.tar.gz
  • Upload date:
  • Size: 64.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sylphy-0.2.0.tar.gz
Algorithm Hash digest
SHA256 e0197f06f1e4fb8f5df0f2a936af5a2703f0c26ccd807e6839c81293db423198
MD5 9136aa3804db5cc522af05f40e5ab19e
BLAKE2b-256 147cc5de5a34378720892fec6e344695330dbb118644c9447408e7aa02acc819

See more details on using hashes here.

Provenance

The following attestation bundles were made for sylphy-0.2.0.tar.gz:

Publisher: publish-pypi.yml on kren-ai-lab/sylphy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sylphy-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: sylphy-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 92.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sylphy-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7e7fd0b3424a557437c35871e3a7d0d353c9e31e822faf30e8acd5239b7e0f0a
MD5 0031c51a2b0a6f8da73d6c4b1f3ad3d5
BLAKE2b-256 34fc9f36d408c985cb3e3770939653e8610c384549a9503126c8f3b8409a4a0b

See more details on using hashes here.

Provenance

The following attestation bundles were made for sylphy-0.2.0-py3-none-any.whl:

Publisher: publish-pypi.yml on kren-ai-lab/sylphy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page