No project description provided
Project description
fast-seqfunc
Painless sequence-function models for proteins and nucleotides.
Made with ❤️ by Eric Ma (@ericmjl).
Overview
Fast-SeqFunc is a Python package designed for efficient sequence-function modeling for proteins and nucleotide machine learning problems. It provides a simple, high-level API that handles sequence embedding methods and automates model selection and training.
The core purpose of Fast-SeqFunc is to quickly determine if there is meaningful "signal" in your sequence-function data. By rapidly building baseline models, you can discover early whether predictive relationships exist in your data and opportunistically use these models for scoring and ranking candidate sequences to test. When signal is detected, you can invest your time more effectively in developing advanced models (such as deep neural networks) as a second iteration.
Key Features
-
Multiple Embedding Methods:
- One-hot encoding (currently implemented)
- CARP (Microsoft's protein-sequence-models) - planned for future releases
- ESM2 (Facebook's ESM) - planned for future releases
-
Automated Machine Learning:
- Uses PyCaret for model selection and hyperparameter tuning
- Supports regression and classification tasks
- Evaluates performance with appropriate metrics
-
Sequence Handling:
- Flexible handling of variable-length sequences
- Configurable padding options for consistent embeddings
- Custom alphabets support
-
Simple API:
- Single function call to train models
- Handles data loading and preprocessing
-
Command-line Interface:
- Train models directly from the command line
- Make predictions on new sequences
- Compare different embedding methods
Installation
Using pip
pip install fast-seqfunc
From Source
git clone git@github.com:ericmjl/fast-seqfunc
cd fast-seqfunc
pixi install
Quick Start
Python API
from fast_seqfunc import train_model, predict
import pandas as pd
# Load your sequence-function data
train_data = pd.read_csv("train_data.csv")
test_data = pd.read_csv("test_data.csv")
# Train a model
model = train_model(
train_data=train_data,
test_data=test_data,
sequence_col="sequence",
target_col="function",
embedding_method="one-hot", # currently the only implemented method
model_type="regression", # or "classification"
)
# Make predictions on new sequences
new_data = pd.read_csv("new_sequences.csv")
predictions = predict(model, new_data["sequence"])
# Save the model for later use
model.save("my_model.pkl")
Command-line Interface
Train a model:
# All outputs (model, metrics, cache) will be saved to the 'outputs' directory
fast-seqfunc train train_data.csv --sequence-col sequence --target-col function --embedding-method one-hot --output-dir outputs
Make predictions:
# All prediction outputs will be saved to the 'prediction_outputs' directory
fast-seqfunc predict-cmd outputs/model.pkl new_sequences.csv --output-dir prediction_outputs
Compare embedding methods:
# All outputs (comparison results, metrics, models, cache) will be saved to the 'comparison_outputs' directory
fast-seqfunc compare-embeddings train_data.csv --test-data test_data.csv --output-dir comparison_outputs
Advanced Usage
Using Multiple Embedding Methods
Currently, only one-hot encoding is implemented. Support for multiple embedding methods is planned for future releases.
model = train_model(
train_data=train_data,
embedding_method="one-hot",
)
Detailed Performance Metrics and Visualizations
The output directories from CLI commands contain comprehensive model performance metrics and visualizations:
outputs/ # Main output directory
├── model.pkl # Saved model
├── summary.json # Summary of output locations and parameters
├── metrics/ # Performance metrics and visualizations
│ ├── one-hot_metrics.json # Detailed metrics in JSON format
│ ├── one-hot_predictions.csv # Raw predictions and true values
│ ├── one-hot_scatter_plot.png # Visualization plots
│ ├── one-hot_residual_plot.png
│ └── ...
└── cache/ # Cached embeddings
For predictions:
prediction_outputs/ # Prediction output directory
├── predictions.csv # Saved predictions
├── predictions_histogram.png # Histogram of prediction values (for regression)
└── prediction_summary.json # Summary of prediction parameters
When comparing embedding methods, a similar structure is created:
comparison_outputs/
├── embedding_comparison.csv # Table comparing all methods
├── embedding_comparison_plot.png # Bar chart comparing metrics across methods
├── summary.json # Summary of output locations and parameters
├── models/ # Saved models for each method
│ ├── one-hot_model.pkl
├── metrics/ # Performance metrics for each method
│ ├── one-hot_metrics.json
└── cache/ # Cached embeddings
You can also generate these outputs programmatically:
from pathlib import Path
from fast_seqfunc import train_model, save_model, save_detailed_metrics
# Create output directories
output_dir = Path("my_model_outputs")
output_dir.mkdir(exist_ok=True)
metrics_dir = output_dir / "metrics"
metrics_dir.mkdir(exist_ok=True)
cache_dir = output_dir / "cache"
cache_dir.mkdir(exist_ok=True)
# Train model
model_info = train_model(
train_data=train_data,
test_data=test_data,
embedding_method="one-hot",
cache_dir=cache_dir,
)
# Save model
save_model(model_info, output_dir / "model.pkl")
# Save detailed metrics if test data was provided
if model_info.get("test_results"):
save_detailed_metrics(
metrics_data=model_info["test_results"],
output_dir=metrics_dir,
model_type=model_info["model_type"],
embedding_method="one-hot"
)
Custom Metrics for Optimization
Specify metrics to optimize during model selection:
model = train_model(
train_data=train_data,
model_type="regression",
optimization_metric="r2" # or "rmse", "mae", etc.
)
Handling Variable Length Sequences
Fast-SeqFunc handles variable length sequences with configurable padding:
# Default behavior pads all sequences to the max length with "-"
model = train_model(
train_data=train_data,
embedding_method="one-hot",
embedder_kwargs={"pad_sequences": True, "gap_character": "-"}
)
# Disable padding for sequences of different lengths
model = train_model(
train_data=train_data,
embedding_method="one-hot",
embedder_kwargs={"pad_sequences": False}
)
# Set a fixed maximum length and custom gap character
model = train_model(
train_data=train_data,
embedding_method="one-hot",
embedder_kwargs={"max_length": 100, "gap_character": "X"}
)
For a complete example, see examples/variable_length_sequences.py
.
Documentation
For full documentation, visit https://ericmjl.github.io/fast-seqfunc/.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file fast_seqfunc-0.2.0.tar.gz
.
File metadata
- Download URL: fast_seqfunc-0.2.0.tar.gz
- Upload date:
- Size: 195.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8bd6b864bb7e54520062332137cf10b239562e9838d0ec5126aa530bbf8f45b1 |
|
MD5 | 4e119ac104727924a7a72d1d93ce6162 |
|
BLAKE2b-256 | 57f6f2c265c322a6dcd485fb29d82773cd15c639f8d350c6d81125b94cfad0df |
Provenance
The following attestation bundles were made for fast_seqfunc-0.2.0.tar.gz
:
Publisher:
release-pypi-package.yaml
on ericmjl/fast-seqfunc
-
Statement:
- Statement type:
https://in-toto.io/Statement/v1
- Predicate type:
https://docs.pypi.org/attestations/publish/v1
- Subject name:
fast_seqfunc-0.2.0.tar.gz
- Subject digest:
8bd6b864bb7e54520062332137cf10b239562e9838d0ec5126aa530bbf8f45b1
- Sigstore transparency entry: 191689222
- Sigstore integration time:
- Permalink:
ericmjl/fast-seqfunc@0d300023109c59876f1fc4b9541f5a2c87f5d017
- Branch / Tag:
refs/heads/main
- Owner: https://github.com/ericmjl
- Access:
public
- Token Issuer:
https://token.actions.githubusercontent.com
- Runner Environment:
github-hosted
- Publication workflow:
release-pypi-package.yaml@0d300023109c59876f1fc4b9541f5a2c87f5d017
- Trigger Event:
workflow_dispatch
- Statement type:
File details
Details for the file fast_seqfunc-0.2.0-py3-none-any.whl
.
File metadata
- Download URL: fast_seqfunc-0.2.0-py3-none-any.whl
- Upload date:
- Size: 40.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 056e0879f811a1f432c6478797e33aa9858929b7f440d93a0276c4047e2da89d |
|
MD5 | a55d99be143596f8caad1153da2c2da9 |
|
BLAKE2b-256 | b99e9403b2f5d60863c3020728244e241078d66fea1a034e1fdbdb83ce95f4e7 |
Provenance
The following attestation bundles were made for fast_seqfunc-0.2.0-py3-none-any.whl
:
Publisher:
release-pypi-package.yaml
on ericmjl/fast-seqfunc
-
Statement:
- Statement type:
https://in-toto.io/Statement/v1
- Predicate type:
https://docs.pypi.org/attestations/publish/v1
- Subject name:
fast_seqfunc-0.2.0-py3-none-any.whl
- Subject digest:
056e0879f811a1f432c6478797e33aa9858929b7f440d93a0276c4047e2da89d
- Sigstore transparency entry: 191689226
- Sigstore integration time:
- Permalink:
ericmjl/fast-seqfunc@0d300023109c59876f1fc4b9541f5a2c87f5d017
- Branch / Tag:
refs/heads/main
- Owner: https://github.com/ericmjl
- Access:
public
- Token Issuer:
https://token.actions.githubusercontent.com
- Runner Environment:
github-hosted
- Publication workflow:
release-pypi-package.yaml@0d300023109c59876f1fc4b9541f5a2c87f5d017
- Trigger Event:
workflow_dispatch
- Statement type: