Classical Machine Learning methods for Reterival Augmented Generation
Project description
RAG-Classic-ML
RAG-Classic-ML is a versatile Python package designed to provide out-of-the-box machine learning pipelines for both basic and advanced tasks. It simplifies the process of building, training, and evaluating models for tasks like classification, regression, autoencoder-based feature extraction, and survival clustering. The package is designed for ease of use, offering pre-built pipelines and customizable parameters for a variety of machine learning algorithms.
Table of Contents
Features
- Basic Machine Learning Pipelines: Ready-to-use pipelines for common supervised learning tasks, including classification and regression, with a variety of machine learning models (e.g., Logistic Regression, SVC, Random Forest).
- Advanced Pipelines
- Autoencoder : Dimensionality reduction and feature extraction using deep learning autoencoders.
- Survival Clustering Analysis: Performs clustering on patient features and integrates clinical data to generate Kaplan-Meier survival plots and log-rank tests.
- Customizable Models and Parameters: Easily define and customize machine learning models and hyperparameters.
- Prediction and Metrics Generation: Generates and saves predictions, feature importance scores, and various performance metrics for each model and run.
- Aggregation of Results: Aggregates results across runs and models for comprehensive analysis, facilitating comparison and evaluation.
- Visualization Tools: Generates plots including AUC curves, AUC box plots, feature importance charts, radar charts for model performance comparison, and survival analysis plots.
Installation
You can install the package directly from PyPI:
pip install classic-ml
Alternatively, install from source:
git clone https://github.com/yourusername/classic-ml.git
cd benchmark-adv-ml
pip install .
Useage
The classic-ml package provides a command-line interface (CLI) for ease of use. Below are examples of how to use the various components.
Basic Machine Learning Pipelines
Classification
Train and evaluate a classification model using the classic-ml CLI. You can specify different models and hyperparameters.
Example 1: Support Vector Classifier (SVC)
classic-ml classification \
--data ./Raisin_Dataset.data \
--target 'label' \
--output ./results/svc_rbf/ \
--model SVC \
--model_params '{"C": 1.0, "kernel": "rbf", "gamma": "scale", "probability": true}' \
--test_size 0.2 \
--seed 42
Example 2: Logistic Regression
classic-ml classification \
--data ./Raisin_Dataset.data \
--target 'label' \
--output ./results/logistic_regression/ \
--model LogisticRegression \
--model_params '{"C": 0.5, "penalty": "l1", "solver": "saga", "max_iter": 1000, "class_weight": "balanced"}' \
--test_size 0.2 \
--seed 42
Example 3: Random Forest Classifier
classic-ml classification \
--data ./Raisin_Dataset.data \
--target 'label' \
--output ./results/random_forest/ \
--model RandomForestClassifier \
--model_params '{"n_estimators": 100, "max_depth": 10}' \
--test_size 0.2 \
--seed 42
Benchmark Machine Learning Models
Run the benchmark ML pipeline to evaluate model stability across multiple runs.
benchmark-adv-ml benchmark --data ./your_dataset.csv --output ./final_results --prelim_output ./prelim_results --n_runs 10 --seed 42
Train Autoencoder Model
Train and evaluate an autoencoder model for feature extraction.
classic-ml autoencoder \
--data ./your_dataset.csv \
--sampleID 'PatientID' \
--output_dir ./final_results \
--prelim_output ./prelim_results \
--latent_dim 10 \
--epochs 50 \
--batch_size 32 \
--validation_split 0.1 \
--test_size 0.2 \
--seed 42
Survival Clustering Analysis
classic-ml survival_clustering \
--data_path ./latent_features.csv \
--clinical_df_path ./clinical_data.csv \
--save_dir ./final_results
Command-Line Arguments
Common Arguments
--data
: Path to the existing CSV file containing the dataset.--output
: Directory to save the final results and plots.--prelim_output
: Directory to save the preliminary results (predictions).--seed
: Seed for random state (default is 42).--test_size
: Fraction of data to use for testing (default: 0.2).
Classification/Regression Command Arguments
--target
: Target column name in the dataset (e.g., 'label' for classification or 'price' for regression).--model
: Specify the machine learning model to use (e.g., SVC, LogisticRegression, RandomForestClassifier, LinearRegression).--model_params
: Hyperparameters for the specified model in JSON format (e.g., {"C": 1.0, "kernel": "rbf"}).
Autoencoder Command Arguments
--sampleID
: Column name representing the sample or patient ID (default: 'sampleID').--latent_dim
: Dimensionality of the latent space (default: input_dim // 8).--epochs
: Number of training epochs (default: 50).--batch_size
: Training batch size (default: 32).--validation_split
: Proportion of training data to use as validation set (default: 0.1).--test_size
: Proportion of data to use as test set (default: 0.2).--early_stopping
: Enable early stopping (use flag to activate).--patience
: Patience for early stopping (default: 5).--checkpoint
: Enable model checkpointing (use flag to activate).
Benchmark Command Arguments
--target
: Target column name in the dataset (default: 'label').--n_runs
: Number of runs for model stability evaluation (default: 20).
Survival Clustering Command Arguments
--data_path
: Path to the CSV file containing patient features.--clinical_df_path
: Path to the CSV file containing clinical data.--save_dir
: Directory to save the results.
Dependencies
- Python 3.11+
- numpy
- pandas
- scikit-learn
- matplotlib
- seaborn
- tensorflow
- lifelines
- yellowbrick
License
This project is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. See the LICENSE file for details.
Author
Vatsal Patel - VatsalPatel18
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for rag_classic_ml-0.1.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6b1a6192ca3b0d5ecb7361120d39931e3da8a08fc1f1381868fe121a03f1e31f |
|
MD5 | 69ef666e265a83b28a16b3690d5a6088 |
|
BLAKE2b-256 | 9d3cb1d42881e3bfe477925111c9048303332ba148cdfd42567a11e6ddaa96d6 |