Skip to main content

Classical Machine Learning methods for Reterival Augmented Generation

Project description

RAG-Classic-ML

RAG-Classic-ML is a versatile Python package designed to provide out-of-the-box machine learning pipelines for both basic and advanced tasks. It simplifies the process of building, training, and evaluating models for tasks like classification, regression, autoencoder-based feature extraction, and survival clustering. The package is designed for ease of use, offering pre-built pipelines and customizable parameters for a variety of machine learning algorithms.

Table of Contents

Features

  • Basic Machine Learning Pipelines: Ready-to-use pipelines for common supervised learning tasks, including classification and regression, with a variety of machine learning models (e.g., Logistic Regression, SVC, Random Forest).
  • Advanced Pipelines
    • Autoencoder : Dimensionality reduction and feature extraction using deep learning autoencoders.
    • Survival Clustering Analysis: Performs clustering on patient features and integrates clinical data to generate Kaplan-Meier survival plots and log-rank tests.
  • Customizable Models and Parameters: Easily define and customize machine learning models and hyperparameters.
  • Prediction and Metrics Generation: Generates and saves predictions, feature importance scores, and various performance metrics for each model and run.
  • Aggregation of Results: Aggregates results across runs and models for comprehensive analysis, facilitating comparison and evaluation.
  • Visualization Tools: Generates plots including AUC curves, AUC box plots, feature importance charts, radar charts for model performance comparison, and survival analysis plots.

Installation

You can install the package directly from PyPI:

pip install classic-ml

Alternatively, install from source:

git clone https://github.com/yourusername/classic-ml.git
cd benchmark-adv-ml
pip install .

Useage

The classic-ml package provides a command-line interface (CLI) for ease of use. Below are examples of how to use the various components.

Basic Machine Learning Pipelines

Classification

Train and evaluate a classification model using the classic-ml CLI. You can specify different models and hyperparameters.

Example 1: Support Vector Classifier (SVC)

classic-ml classification \
    --data ./Raisin_Dataset.data \
    --target 'label' \
    --output ./results/svc_rbf/ \
    --model SVC \
    --model_params '{"C": 1.0, "kernel": "rbf", "gamma": "scale", "probability": true}' \
    --test_size 0.2 \
    --seed 42

Example 2: Logistic Regression

classic-ml classification \
    --data ./Raisin_Dataset.data \
    --target 'label' \
    --output ./results/logistic_regression/ \
    --model LogisticRegression \
    --model_params '{"C": 0.5, "penalty": "l1", "solver": "saga", "max_iter": 1000, "class_weight": "balanced"}' \
    --test_size 0.2 \
    --seed 42

Example 3: Random Forest Classifier

classic-ml classification \
    --data ./Raisin_Dataset.data \
    --target 'label' \
    --output ./results/random_forest/ \
    --model RandomForestClassifier \
    --model_params '{"n_estimators": 100, "max_depth": 10}' \
    --test_size 0.2 \
    --seed 42

Benchmark Machine Learning Models

Run the benchmark ML pipeline to evaluate model stability across multiple runs.

benchmark-adv-ml benchmark --data ./your_dataset.csv --output ./final_results --prelim_output ./prelim_results --n_runs 10 --seed 42

Train Autoencoder Model

Train and evaluate an autoencoder model for feature extraction.

classic-ml autoencoder \
    --data ./your_dataset.csv \
    --sampleID 'PatientID' \
    --output_dir ./final_results \
    --prelim_output ./prelim_results \
    --latent_dim 10 \
    --epochs 50 \
    --batch_size 32 \
    --validation_split 0.1 \
    --test_size 0.2 \
    --seed 42

Survival Clustering Analysis

classic-ml survival_clustering \
    --data_path ./latent_features.csv \
    --clinical_df_path ./clinical_data.csv \
    --save_dir ./final_results

Command-Line Arguments

Common Arguments

  • --data: Path to the existing CSV file containing the dataset.
  • --output: Directory to save the final results and plots.
  • --prelim_output: Directory to save the preliminary results (predictions).
  • --seed: Seed for random state (default is 42).
  • --test_size: Fraction of data to use for testing (default: 0.2).

Classification/Regression Command Arguments

  • --target: Target column name in the dataset (e.g., 'label' for classification or 'price' for regression).
  • --model: Specify the machine learning model to use (e.g., SVC, LogisticRegression, RandomForestClassifier, LinearRegression).
  • --model_params: Hyperparameters for the specified model in JSON format (e.g., {"C": 1.0, "kernel": "rbf"}).

Autoencoder Command Arguments

  • --sampleID: Column name representing the sample or patient ID (default: 'sampleID').
  • --latent_dim: Dimensionality of the latent space (default: input_dim // 8).
  • --epochs: Number of training epochs (default: 50).
  • --batch_size: Training batch size (default: 32).
  • --validation_split: Proportion of training data to use as validation set (default: 0.1).
  • --test_size: Proportion of data to use as test set (default: 0.2).
  • --early_stopping: Enable early stopping (use flag to activate).
  • --patience: Patience for early stopping (default: 5).
  • --checkpoint: Enable model checkpointing (use flag to activate).

Benchmark Command Arguments

  • --target: Target column name in the dataset (default: 'label').
  • --n_runs: Number of runs for model stability evaluation (default: 20).

Survival Clustering Command Arguments

  • --data_path: Path to the CSV file containing patient features.
  • --clinical_df_path: Path to the CSV file containing clinical data.
  • --save_dir: Directory to save the results.

Dependencies

  • Python 3.11+
  • numpy
  • pandas
  • scikit-learn
  • matplotlib
  • seaborn
  • tensorflow
  • lifelines
  • yellowbrick

License

This project is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. See the LICENSE file for details.

Author

Vatsal Patel - VatsalPatel18

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rag_classic_ml-0.1.4.tar.gz (82.2 kB view details)

Uploaded Source

Built Distribution

rag_classic_ml-0.1.4-py3-none-any.whl (93.4 kB view details)

Uploaded Python 3

File details

Details for the file rag_classic_ml-0.1.4.tar.gz.

File metadata

  • Download URL: rag_classic_ml-0.1.4.tar.gz
  • Upload date:
  • Size: 82.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.11.7 Linux/6.8.0-35-generic

File hashes

Hashes for rag_classic_ml-0.1.4.tar.gz
Algorithm Hash digest
SHA256 642a813ab9f4e8f396963ba34076b0f95fae7f622e015d8e5c1636be7221d474
MD5 43e308ce93f747bb7b1f31106eaf841b
BLAKE2b-256 393191ece7b2d386aba561cf668fdbc44b0d99f30d9189de007e5f17ee9689d0

See more details on using hashes here.

File details

Details for the file rag_classic_ml-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: rag_classic_ml-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 93.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.11.7 Linux/6.8.0-35-generic

File hashes

Hashes for rag_classic_ml-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 6b1a6192ca3b0d5ecb7361120d39931e3da8a08fc1f1381868fe121a03f1e31f
MD5 69ef666e265a83b28a16b3690d5a6088
BLAKE2b-256 9d3cb1d42881e3bfe477925111c9048303332ba148cdfd42567a11e6ddaa96d6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page