Supervised fine-tuning of LLMs with LoRA and DeepSpeed

These details have not been verified by PyPI

Project description

Supervised Fine-Tuning (SFT) with LoRA and DeepSpeed

This project provides a streamlined pipeline for fine-tuning Large Language Models (LLMs) like Llama 3.1 using Low-Rank Adaptation (LoRA) and DeepSpeed for efficient distributed training.

📂 Directory Structure

SFT/
├── configs/                 # Configuration files (planned)
├── <placeholder>/           # Datasets
├── <placeholder>/           # Output directory for checkpoints and adapters
├── run.sh                   # Main entry point script
├── lora_sft.py              # Main training and inference script
├── config.yaml              # Hyperparameters and paths configuration
├── ds_config.json           # DeepSpeed configuration
├── requirements.txt         # Python dependencies
└── README.md                # This file

🚀 Setup

Create and activate a virtual environment:

python3 -m venv venv
source venv/bin/activate

Install dependencies:
```
pip install -r requirements.txt
```
(Ensure deepspeed is installed compatible with your CUDA version)

Configure Environment: Create a .env file in the root directory:

HF_TOKEN=your_huggingface_token
WANDB_API_KEY=your_wandb_key  # Optional, for logging

🛠️ Configuration

config.yaml: Controls model ID, dataset paths, training hyperparameters (learning rate, epochs, batch size), and LoRA settings.
ds_config.json: Configures DeepSpeed optimization (ZeRO stage, offloading, mixed precision).

🏃 Usage

Use the provided scripts/run.sh wrapper for easy execution. It automatically handles directory paths.

Training

To start fine-tuning the model:

# Default: Train on 2 GPUs
./run.sh train 2

# Train on 4 GPUs
./scripts/run.sh train 4

Inference

To evaluate the fine-tuned model on the test set:

./scripts/run.sh inference

Custom Configuration

You can specify a custom DeepSpeed config file:

./scripts/run.sh --config custom_ds_config.json train 4

📊 Monitoring

Training progress (loss, accuracy, etc.) is logged to MLflow (and/or WandB if configured).

To view MLflow logs locally:

mlflow ui

Then open http://localhost:5000 in your browser.

🐛 Troubleshooting

deepspeed: command not found: Ensure you have activated the virtual environment where deepspeed is installed.
CUDA Errors: Check ds_config.json to ensure batch sizes and offloading settings fit your GPU memory.

☸️ Ray Train + Kubernetes (KubeRay)

This repo now includes a Ray Train entrypoint for running multi-GPU training on a Ray cluster (including on Kubernetes via KubeRay).

Ray entrypoint: SFT/ray_train_lora_sft.py
KubeRay RayJob template: SFT/k8s/rayjob-lora-sft.yaml
Container build: SFT/k8s/Dockerfile

Local Ray (single node)

pip install -r requirements.txt
python ray_train_lora_sft.py --num_workers 2

Kubernetes (high level)

Build/push the image from SFT/k8s/Dockerfile and set it in rayjob-lora-sft.yaml.
Create PVCs for:
- /workspace/SFT/data_dir (your training data)
- /mnt/ray-results (Ray Train run storage / checkpoints)
Apply the RayJob:

kubectl apply -f SFT/k8s/rayjob-lora-sft.yaml

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.1.1

Apr 14, 2026

1.1.0

Apr 14, 2026

1.0.0

Apr 14, 2026

0.9.1

Apr 14, 2026

0.9.0

Apr 14, 2026

0.8.0

Apr 14, 2026

0.7.2

Apr 14, 2026

0.7.1

Apr 14, 2026

0.7.0

Apr 14, 2026

0.6.1

Apr 14, 2026

0.6.0

Apr 14, 2026

0.5.1

Apr 14, 2026

0.5.0

Apr 14, 2026

0.3.0

Apr 14, 2026

0.2.0

Apr 6, 2026

This version

0.1.0

Apr 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

neotune-0.1.0.tar.gz (12.6 kB view details)

Uploaded Apr 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

neotune-0.1.0-py3-none-any.whl (13.6 kB view details)

Uploaded Apr 6, 2026 Python 3

File details

Details for the file neotune-0.1.0.tar.gz.

File metadata

Download URL: neotune-0.1.0.tar.gz
Upload date: Apr 6, 2026
Size: 12.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for neotune-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`96391b132c3ebbd75348a5a3c6fee2592be9a3058f8b95637f62fcfb4e58692f`
MD5	`8cacb5b98fe6464d55f86da8cd0f7c5f`
BLAKE2b-256	`2885aa5a605f0e45e75620cbeb075927c8db0400bed5d6511d02d8c995f249bd`

See more details on using hashes here.

File details

Details for the file neotune-0.1.0-py3-none-any.whl.

File metadata

Download URL: neotune-0.1.0-py3-none-any.whl
Upload date: Apr 6, 2026
Size: 13.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for neotune-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7ac132ea7bc0959cc49e0b03b4c0d99e6aafe22105ac8fdc24c96d4b6b73281f`
MD5	`bf2d1cacec78490b0c7b2fdcba3914d5`
BLAKE2b-256	`105bf835a945a6728c89767901f249e76bed4d7aa6b18a06dbc7456dc807226f`

See more details on using hashes here.

neotune 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Supervised Fine-Tuning (SFT) with LoRA and DeepSpeed

📂 Directory Structure

🚀 Setup

🛠️ Configuration

🏃 Usage

Training

Inference

Custom Configuration

📊 Monitoring

🐛 Troubleshooting

☸️ Ray Train + Kubernetes (KubeRay)

Local Ray (single node)

Kubernetes (high level)

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes