An Ensemble Machine Learning algorithm for classification of bacterial strains.
Project description
StrainFish
strainfish is a weighted ensemble machine learning algorithm with multiple DNA sequence encoders and logic, specifically designed for classification of marker sequences.
Conceived and built by Kranti Konganti, HFP
v0.2.0
- Multiple DNA sequence encoders for GPU-accelerated training.
- A weighted Ensemble machine-learning model generation with sensible defaults.
- GPU-accelerated Learning and Prediction only!
- Important Note: This software is under active development and as such some features are experimental. Results should be thoroughly validated and independently verified before use in critical applications or publications.
Table of Contents
- Installation
- Quick Start
- Training Models
- Making Predictions
- Configuration Options
- Test Data and Examples
- Dependencies
- License
Installation
StrainFish requires Python 3.11+ and can be installed via pip:
pip install strainfish
For development installation:
git clone https://github.com/your-repo/strainfish.git
cd strainfish
pip install -e .
Quick Start
Training a Model
To train a model on your DNA sequences:
strainfish train run \
-f path/to/sequences.fasta \
-l path/to/labels.csv \
-o /path/to/models_output_dir/model_prefix
Predicting using a Model
To predict bacterial strains using a trained model:
strainfish predict run \
-f path/to/predict_sequences.fasta \
-m /path/to/models_output_dir/model_prefix \
-o path/to/results_directory
Training Models
StrainFish uses an ensemble approach for both training and prediction (XGBoost, RandomForest and NaiveBayes), with custom DNA sequence encodings optimized for GPU acceleration.
Basic Training Command
strainfish train run \
-f training_sequences.fasta \ # Input FASTA file
-l labels.csv \ # Labels CSV (id,label)
-o /path/to/models_output_dir/model_prefix # Output directory for models
Advanced Configuration
StrainFish configuration options during training:
strainfish train run \
-f training_sequences.fasta \
-l labels.csv \
-o model_output_dir \
--encode-method tf \ # Encoding method: sm, sp, or tf
--kmer 7 \ # K-mer size for hashing
--num-hashes 100 \ # Number of hashes per sequence
--factor 21 \ # Sequence overlap factor
--chunk-size 200 \ # Size of DNA chunks
--pseknc-weight 0.1 \ # Weight for PseKNC encoding
--xgb-n-estimators 300 \ # XGBoost parameters
--rf-n-estimators 100 \ # RandomForest parameters
Encoding Methods
StrainFish supports three DNA sequence encoding methods:
tf(TF-IDF): Traditional TF-IDF vectorizationsp(SentencePiece): Subword tokenization using SentencePiece models (Experimental)sm(SOMH): MinHash based approach with PseKNC and sequencing composition weights (AT/GC ratio) (Experimental)
Making Predictions
Basic Prediction Command
strainfish predict run \
-f prediction_sequences.fasta \ # Input FASTA file(s)
-m /path/to/models_output_dir/model_prefix \ # Path to trained model
-o results_dir # Output directory for predictions
Model Management
List available models:
strainfish predict list-models
# Or list models stored at a particular models directory:
strainfish predict list-models -md /path/to/models_dir
Configuration Options
StrainFish provides configuration options for training.
XGBoost Parameters
View all configurable XGBoost parameters:
strainfish train show-xgb-params
Key parameters:
--xgb-n-estimators: Number of boosting rounds--xgb-max-depth: Maximum tree depth--xgb-learning-rate: Learning rate for boosting--xgb-subsample: Subsample ratio of the training instance
RandomForest Parameters
View all configurable RandomForest parameters:
strainfish train show-rf-params
Key parameters:
--rf-n-estimators: Number of trees in the forest--rf-max-depth: Maximum depth of the tree--rf-random-state: Random seed for reproducibility--rf-min-samples-leaf: Minimum samples required at a leaf node
SentencePiece Parameters
View all configurable SentencePiece parameters:
strainfish train show-sp-params
Key parameters:
--sp-vocab-size: Vocabulary size for tokenization--sp-max-sentence-length: Maximum sentence length--sp-char-cov: Character coverage ratio
Imbalance Handling Parameters
View all imbalance handling parameters:
strainfish train show-imb-params
Key parameters:
--imb-smote-k-neighbors: Number of neighbors for SMOTE--imb-enn-n-neighbors: Number of neighbors for ENN cleaning
Test Data and Examples
The repository includes test data in the tests/test_input/ directory:
test.train.fasta: Training sequences in FASTA formattest.train.csv: Labels file withid,labelcolumnspredict.fasta: Sequences for prediction using trained models
You can use these to test StrainFish functionality:
# Train a model using test data
strainfish train run \
-f tests/test_input/test.train.fasta \
-l tests/test_input/test.train.csv \
-o test_output/test_model
# Make predictions on the trained model
strainfish predict run \
-f tests/test_input/predict.fasta \
-m test_output/test_model \
-o prediction_results
Dependencies
StrainFish has the following key dependencies:
- Core ML Libraries: numpy, pandas, scikit-learn, xgboost, cuml (GPU-accelerated)
- Sequence Processing: biopython, sourmash, sentencepiece
- CLI Interface: rich, rich-click
- Utilities: joblib, psutil, humanize, pynvml
- Testing: pytest, pytest-cov
For a complete list of dependencies, see pyproject.toml.
License
This project is licensed under the MIT License - see the LICENSE.md file for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file strainfish-0.2.0.tar.gz.
File metadata
- Download URL: strainfish-0.2.0.tar.gz
- Upload date:
- Size: 24.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.4 CPython/3.13.5 Linux/5.14.0-570.32.1.el9_6.x86_64
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2634c3334d652a2283ac914db7290bdadfcab64ed21f4acef2ecfafa3e7852e6
|
|
| MD5 |
77c2780b5f545a2cb5b2ce0244a54095
|
|
| BLAKE2b-256 |
2415798de235e627f7abca33416488b561b657a337c79a5a3bd0200c0b0dfbd3
|
File details
Details for the file strainfish-0.2.0-py3-none-any.whl.
File metadata
- Download URL: strainfish-0.2.0-py3-none-any.whl
- Upload date:
- Size: 28.6 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.4 CPython/3.13.5 Linux/5.14.0-570.32.1.el9_6.x86_64
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6007a657fa6ad90562ae0335379d49c456793eda9be30d65043113ee411278e5
|
|
| MD5 |
1e71625959e9c3d0ba86393ce8658446
|
|
| BLAKE2b-256 |
e5eb7d553c21d0a22d4063caa977fa79cdd2e5aff30bf6e94d2950600564211c
|