Skip to main content

Supervised fine-tuning of LLMs with LoRA and DeepSpeed

Project description

Supervised Fine-Tuning (SFT) with LoRA and DeepSpeed

This project provides a streamlined pipeline for fine-tuning Large Language Models (LLMs) like Llama 3.1 using Low-Rank Adaptation (LoRA) and DeepSpeed for efficient distributed training.

📂 Directory Structure

SFT/
├── configs/                 # Configuration files (planned)
├── <placeholder>/           # Datasets
├── <placeholder>/           # Output directory for checkpoints and adapters
├── run.sh                   # Main entry point script
├── lora_sft.py              # Main training and inference script
├── config.yaml              # Hyperparameters and paths configuration
├── ds_config.json           # DeepSpeed configuration
├── requirements.txt         # Python dependencies
└── README.md                # This file

🚀 Setup

  1. Create and activate a virtual environment:

    python3 -m venv venv
    source venv/bin/activate
    
  2. Install dependencies:

    pip install -r requirements.txt
    

    (Ensure deepspeed is installed compatible with your CUDA version)

  3. Configure Environment: Create a .env file in the root directory:

    HF_TOKEN=your_huggingface_token
    WANDB_API_KEY=your_wandb_key  # Optional, for logging
    

🛠️ Configuration

  • config.yaml: Controls model ID, dataset paths, training hyperparameters (learning rate, epochs, batch size), and LoRA settings.
  • ds_config.json: Configures DeepSpeed optimization (ZeRO stage, offloading, mixed precision).

🏃 Usage

Use the provided scripts/run.sh wrapper for easy execution. It automatically handles directory paths.

Training

To start fine-tuning the model:

# Default: Train on 2 GPUs
./run.sh train 2

# Train on 4 GPUs
./scripts/run.sh train 4

Inference

To evaluate the fine-tuned model on the test set:

./scripts/run.sh inference

Custom Configuration

You can specify a custom DeepSpeed config file:

./scripts/run.sh --config custom_ds_config.json train 4

📊 Monitoring

Training progress (loss, accuracy, etc.) is logged to MLflow (and/or WandB if configured).

To view MLflow logs locally:

mlflow ui

Then open http://localhost:5000 in your browser.

🐛 Troubleshooting

  • deepspeed: command not found: Ensure you have activated the virtual environment where deepspeed is installed.
  • CUDA Errors: Check ds_config.json to ensure batch sizes and offloading settings fit your GPU memory.

☸️ Ray Train + Kubernetes (KubeRay)

This repo now includes a Ray Train entrypoint for running multi-GPU training on a Ray cluster (including on Kubernetes via KubeRay).

  • Ray entrypoint: SFT/ray_train_lora_sft.py
  • KubeRay RayJob template: SFT/k8s/rayjob-lora-sft.yaml
  • Container build: SFT/k8s/Dockerfile

Local Ray (single node)

pip install -r requirements.txt
python ray_train_lora_sft.py --num_workers 2

Kubernetes (high level)

  1. Build/push the image from SFT/k8s/Dockerfile and set it in rayjob-lora-sft.yaml.
  2. Create PVCs for:
    • /workspace/SFT/data_dir (your training data)
    • /mnt/ray-results (Ray Train run storage / checkpoints)
  3. Apply the RayJob:
kubectl apply -f SFT/k8s/rayjob-lora-sft.yaml

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

neotune-0.2.0.tar.gz (15.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

neotune-0.2.0-py3-none-any.whl (17.1 kB view details)

Uploaded Python 3

File details

Details for the file neotune-0.2.0.tar.gz.

File metadata

  • Download URL: neotune-0.2.0.tar.gz
  • Upload date:
  • Size: 15.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for neotune-0.2.0.tar.gz
Algorithm Hash digest
SHA256 f42bfbb7bc0d5fe2c794957d762312ecb6147646ca449ad03a14275297500fe3
MD5 16200eaaa87c3472c3e9b7f01b6e82b2
BLAKE2b-256 98968af19e8f4cdbf8bcc8915c29c1f1ecf8a55913ad6fc77ad06fca046c1017

See more details on using hashes here.

File details

Details for the file neotune-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: neotune-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 17.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for neotune-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2355cc2f3886723355b9f5b16c68fb611f2c69072c54cfed62d6e0390f2960c9
MD5 5e7263925c2236a307082df86a4c416d
BLAKE2b-256 950f76a587796017771421c6fd2eceb017a8c2262a0f0ed0b4d03d5341627641

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page