Skip to main content

No project description provided

Project description

fast-seqfunc

Painless sequence-function models for proteins and nucleotides.

Made with ❤️ by Eric Ma (@ericmjl).

Overview

Fast-SeqFunc is a Python package designed for efficient sequence-function modeling for proteins and nucleotide machine learning problems. It provides a simple, high-level API that handles sequence embedding methods and automates model selection and training.

The core purpose of Fast-SeqFunc is to quickly determine if there is meaningful "signal" in your sequence-function data. By rapidly building baseline models, you can discover early whether predictive relationships exist in your data and opportunistically use these models for scoring and ranking candidate sequences to test. When signal is detected, you can invest your time more effectively in developing advanced models (such as deep neural networks) as a second iteration.

Key Features

  • Multiple Embedding Methods:

    • One-hot encoding (currently implemented)
    • CARP (Microsoft's protein-sequence-models) - planned for future releases
    • ESM2 (Facebook's ESM) - planned for future releases
  • Automated Machine Learning:

    • Uses PyCaret for model selection and hyperparameter tuning
    • Supports regression and classification tasks
    • Evaluates performance with appropriate metrics
  • Sequence Handling:

    • Flexible handling of variable-length sequences
    • Configurable padding options for consistent embeddings
    • Custom alphabets support
  • Simple API:

    • Single function call to train models
    • Handles data loading and preprocessing
  • Command-line Interface:

    • Train models directly from the command line
    • Make predictions on new sequences
    • Compare different embedding methods

Installation

Using pip

pip install fast-seqfunc

From Source

git clone git@github.com:ericmjl/fast-seqfunc
cd fast-seqfunc
pixi install

Quick Start

Python API

from fast_seqfunc import train_model, predict
import pandas as pd

# Load your sequence-function data
train_data = pd.read_csv("train_data.csv")
test_data = pd.read_csv("test_data.csv")

# Train a model
model = train_model(
    train_data=train_data,
    test_data=test_data,
    sequence_col="sequence",
    target_col="function",
    embedding_method="one-hot",  # currently the only implemented method
    model_type="regression",     # or "classification"
)

# Make predictions on new sequences
new_data = pd.read_csv("new_sequences.csv")
predictions = predict(model, new_data["sequence"])

# Save the model for later use
model.save("my_model.pkl")

Command-line Interface

Train a model:

# All outputs (model, metrics, cache) will be saved to the 'outputs' directory
fast-seqfunc train train_data.csv --sequence-col sequence --target-col function --embedding-method one-hot --output-dir outputs

Make predictions:

# All prediction outputs will be saved to the 'prediction_outputs' directory
fast-seqfunc predict-cmd outputs/model.pkl new_sequences.csv --output-dir prediction_outputs

Compare embedding methods:

# All outputs (comparison results, metrics, models, cache) will be saved to the 'comparison_outputs' directory
fast-seqfunc compare-embeddings train_data.csv --test-data test_data.csv --output-dir comparison_outputs

Advanced Usage

Using Multiple Embedding Methods

Currently, only one-hot encoding is implemented. Support for multiple embedding methods is planned for future releases.

model = train_model(
    train_data=train_data,
    embedding_method="one-hot",
)

Detailed Performance Metrics and Visualizations

The output directories from CLI commands contain comprehensive model performance metrics and visualizations:

outputs/                          # Main output directory
├── model.pkl                     # Saved model
├── summary.json                  # Summary of output locations and parameters
├── metrics/                      # Performance metrics and visualizations
│   ├── one-hot_metrics.json      # Detailed metrics in JSON format
│   ├── one-hot_predictions.csv   # Raw predictions and true values
│   ├── one-hot_scatter_plot.png  # Visualization plots
│   ├── one-hot_residual_plot.png
│   └── ...
└── cache/                        # Cached embeddings

For predictions:

prediction_outputs/               # Prediction output directory
├── predictions.csv               # Saved predictions
├── predictions_histogram.png     # Histogram of prediction values (for regression)
└── prediction_summary.json       # Summary of prediction parameters

When comparing embedding methods, a similar structure is created:

comparison_outputs/
├── embedding_comparison.csv      # Table comparing all methods
├── embedding_comparison_plot.png # Bar chart comparing metrics across methods
├── summary.json                  # Summary of output locations and parameters
├── models/                       # Saved models for each method
│   ├── one-hot_model.pkl
├── metrics/                      # Performance metrics for each method
│   ├── one-hot_metrics.json
└── cache/                        # Cached embeddings

You can also generate these outputs programmatically:

from pathlib import Path
from fast_seqfunc import train_model, save_model, save_detailed_metrics

# Create output directories
output_dir = Path("my_model_outputs")
output_dir.mkdir(exist_ok=True)
metrics_dir = output_dir / "metrics"
metrics_dir.mkdir(exist_ok=True)
cache_dir = output_dir / "cache"
cache_dir.mkdir(exist_ok=True)

# Train model
model_info = train_model(
    train_data=train_data,
    test_data=test_data,
    embedding_method="one-hot",
    cache_dir=cache_dir,
)

# Save model
save_model(model_info, output_dir / "model.pkl")

# Save detailed metrics if test data was provided
if model_info.get("test_results"):
    save_detailed_metrics(
        metrics_data=model_info["test_results"],
        output_dir=metrics_dir,
        model_type=model_info["model_type"],
        embedding_method="one-hot"
    )

Custom Metrics for Optimization

Specify metrics to optimize during model selection:

model = train_model(
    train_data=train_data,
    model_type="regression",
    optimization_metric="r2"  # or "rmse", "mae", etc.
)

Handling Variable Length Sequences

Fast-SeqFunc handles variable length sequences with configurable padding:

# Default behavior pads all sequences to the max length with "-"
model = train_model(
    train_data=train_data,
    embedding_method="one-hot",
    embedder_kwargs={"pad_sequences": True, "gap_character": "-"}
)

# Disable padding for sequences of different lengths
model = train_model(
    train_data=train_data,
    embedding_method="one-hot",
    embedder_kwargs={"pad_sequences": False}
)

# Set a fixed maximum length and custom gap character
model = train_model(
    train_data=train_data,
    embedding_method="one-hot",
    embedder_kwargs={"max_length": 100, "gap_character": "X"}
)

For a complete example, see examples/variable_length_sequences.py.

Documentation

For full documentation, visit https://ericmjl.github.io/fast-seqfunc/.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fast_seqfunc-0.2.0.tar.gz (195.8 kB view details)

Uploaded Source

Built Distribution

fast_seqfunc-0.2.0-py3-none-any.whl (40.2 kB view details)

Uploaded Python 3

File details

Details for the file fast_seqfunc-0.2.0.tar.gz.

File metadata

  • Download URL: fast_seqfunc-0.2.0.tar.gz
  • Upload date:
  • Size: 195.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for fast_seqfunc-0.2.0.tar.gz
Algorithm Hash digest
SHA256 8bd6b864bb7e54520062332137cf10b239562e9838d0ec5126aa530bbf8f45b1
MD5 4e119ac104727924a7a72d1d93ce6162
BLAKE2b-256 57f6f2c265c322a6dcd485fb29d82773cd15c639f8d350c6d81125b94cfad0df

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_seqfunc-0.2.0.tar.gz:

Publisher: release-pypi-package.yaml on ericmjl/fast-seqfunc

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_seqfunc-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: fast_seqfunc-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 40.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for fast_seqfunc-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 056e0879f811a1f432c6478797e33aa9858929b7f440d93a0276c4047e2da89d
MD5 a55d99be143596f8caad1153da2c2da9
BLAKE2b-256 b99e9403b2f5d60863c3020728244e241078d66fea1a034e1fdbdb83ce95f4e7

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_seqfunc-0.2.0-py3-none-any.whl:

Publisher: release-pypi-package.yaml on ericmjl/fast-seqfunc

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page