Skip to main content

An Ensemble Machine Learning algorithm for classification of bacterial strains.

Project description

StrainFish

strainfish is a weighted ensemble machine learning algorithm with multiple DNA sequence encoders and logic, specifically designed for classification of marker sequences.

Conceived and built by Kranti Konganti, HFP

v0.2.0

  • Multiple DNA sequence encoders for GPU-accelerated training.
  • A weighted Ensemble machine-learning model generation with sensible defaults.
  • GPU-accelerated Learning and Prediction only!
  • Important Note: This software is under active development and as such some features are experimental. Results should be thoroughly validated and independently verified before use in critical applications or publications.

Table of Contents

  1. Installation
  2. Quick Start
  3. Training Models
  4. Making Predictions
  5. Configuration Options
  6. Test Data and Examples
  7. Dependencies
  8. License

Installation

StrainFish requires Python 3.11+ and can be installed via pip:

pip install strainfish

For development installation:

git clone https://github.com/your-repo/strainfish.git
cd strainfish
pip install -e .

Quick Start

Training a Model

To train a model on your DNA sequences:

strainfish train run \
  -f path/to/sequences.fasta \
  -l path/to/labels.csv \
  -o /path/to/models_output_dir/model_prefix

Predicting using a Model

To predict bacterial strains using a trained model:

strainfish predict run \
  -f path/to/predict_sequences.fasta \
  -m /path/to/models_output_dir/model_prefix \
  -o path/to/results_directory

Training Models

StrainFish uses an ensemble approach for both training and prediction (XGBoost, RandomForest and NaiveBayes), with custom DNA sequence encodings optimized for GPU acceleration.

Basic Training Command

strainfish train run \
  -f training_sequences.fasta \                             # Input FASTA file
  -l labels.csv \                                           # Labels CSV (id,label)
  -o /path/to/models_output_dir/model_prefix                # Output directory for models

Advanced Configuration

StrainFish configuration options during training:

strainfish train run \
  -f training_sequences.fasta \
  -l labels.csv \
  -o model_output_dir \
  --encode-method tf \              # Encoding method: sm, sp, or tf
  --kmer 7 \                        # K-mer size for hashing
  --num-hashes 100 \                # Number of hashes per sequence
  --factor 21 \                     # Sequence overlap factor
  --chunk-size 200 \                # Size of DNA chunks
  --pseknc-weight 0.1 \             # Weight for PseKNC encoding
  --xgb-n-estimators 300 \          # XGBoost parameters
  --rf-n-estimators 100 \           # RandomForest parameters

Encoding Methods

StrainFish supports three DNA sequence encoding methods:

  • tf (TF-IDF): Traditional TF-IDF vectorization
  • sp (SentencePiece): Subword tokenization using SentencePiece models (Experimental)
  • sm (SOMH): MinHash based approach with PseKNC and sequencing composition weights (AT/GC ratio) (Experimental)

Making Predictions

Basic Prediction Command

strainfish predict run \
  -f prediction_sequences.fasta \                        # Input FASTA file(s)
  -m /path/to/models_output_dir/model_prefix \           # Path to trained model
  -o results_dir                                         # Output directory for predictions

Model Management

List available models:

strainfish predict list-models
# Or list models stored at a particular models directory:
strainfish predict list-models -md /path/to/models_dir

Configuration Options

StrainFish provides configuration options for training.

XGBoost Parameters

View all configurable XGBoost parameters:

strainfish train show-xgb-params

Key parameters:

  • --xgb-n-estimators: Number of boosting rounds
  • --xgb-max-depth: Maximum tree depth
  • --xgb-learning-rate: Learning rate for boosting
  • --xgb-subsample: Subsample ratio of the training instance

RandomForest Parameters

View all configurable RandomForest parameters:

strainfish train show-rf-params

Key parameters:

  • --rf-n-estimators: Number of trees in the forest
  • --rf-max-depth: Maximum depth of the tree
  • --rf-random-state: Random seed for reproducibility
  • --rf-min-samples-leaf: Minimum samples required at a leaf node

SentencePiece Parameters

View all configurable SentencePiece parameters:

strainfish train show-sp-params

Key parameters:

  • --sp-vocab-size: Vocabulary size for tokenization
  • --sp-max-sentence-length: Maximum sentence length
  • --sp-char-cov: Character coverage ratio

Imbalance Handling Parameters

View all imbalance handling parameters:

strainfish train show-imb-params

Key parameters:

  • --imb-smote-k-neighbors: Number of neighbors for SMOTE
  • --imb-enn-n-neighbors: Number of neighbors for ENN cleaning

Test Data and Examples

The repository includes test data in the tests/test_input/ directory:

  • test.train.fasta: Training sequences in FASTA format
  • test.train.csv: Labels file with id,label columns
  • predict.fasta: Sequences for prediction using trained models

You can use these to test StrainFish functionality:

# Train a model using test data
strainfish train run \
  -f tests/test_input/test.train.fasta \
  -l tests/test_input/test.train.csv \
  -o test_output/test_model

# Make predictions on the trained model
strainfish predict run \
  -f tests/test_input/predict.fasta \
  -m test_output/test_model \
  -o prediction_results

Dependencies

StrainFish has the following key dependencies:

  • Core ML Libraries: numpy, pandas, scikit-learn, xgboost, cuml (GPU-accelerated)
  • Sequence Processing: biopython, sourmash, sentencepiece
  • CLI Interface: rich, rich-click
  • Utilities: joblib, psutil, humanize, pynvml
  • Testing: pytest, pytest-cov

For a complete list of dependencies, see pyproject.toml.

License

This project is licensed under the MIT License - see the LICENSE.md file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

strainfish-0.2.0.tar.gz (24.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

strainfish-0.2.0-py3-none-any.whl (28.6 MB view details)

Uploaded Python 3

File details

Details for the file strainfish-0.2.0.tar.gz.

File metadata

  • Download URL: strainfish-0.2.0.tar.gz
  • Upload date:
  • Size: 24.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.4 CPython/3.13.5 Linux/5.14.0-570.32.1.el9_6.x86_64

File hashes

Hashes for strainfish-0.2.0.tar.gz
Algorithm Hash digest
SHA256 2634c3334d652a2283ac914db7290bdadfcab64ed21f4acef2ecfafa3e7852e6
MD5 77c2780b5f545a2cb5b2ce0244a54095
BLAKE2b-256 2415798de235e627f7abca33416488b561b657a337c79a5a3bd0200c0b0dfbd3

See more details on using hashes here.

File details

Details for the file strainfish-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: strainfish-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 28.6 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.4 CPython/3.13.5 Linux/5.14.0-570.32.1.el9_6.x86_64

File hashes

Hashes for strainfish-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6007a657fa6ad90562ae0335379d49c456793eda9be30d65043113ee411278e5
MD5 1e71625959e9c3d0ba86393ce8658446
BLAKE2b-256 e5eb7d553c21d0a22d4063caa977fa79cdd2e5aff30bf6e94d2950600564211c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page