Skip to main content

UI-based Fine-Tuning Framework for Large Language Models (LLMs)

Project description

Upasak - UI-based Fine-Tuning for Large Language Models (LLMs)

Upasak is a flexible, mindful to privacy, no-code/low-code framework for fine-tuning large language models, built around Hugging Face Transformers. It features an easy-to-use Streamlit-based interface, multi-format dataset support, built-in PII and sensitive information sanitization, and a customizable training process. Whether you're experimenting, researching, or performing internal fine-tuning tasks, Upasak makes it easily accessible and compliant.

PyPI Version License

Key Features

LLM Fine-Tuning

  • Developed on top of Hugging Face's Transformers library.
  • Supports Text-only models of Gemma-3 LLM family for instruction-tuning or domain adaptation.
  • Full-parameter fine-tuning or LoRA (Parameter-Efficient Fine-Tuning).
  • Future support planned for image-text-to-text Gemma-3 models, LLaMA, Qwen, Phi, Mixtral.

Flexible Dataset Handling

Upload or import datasets in multiple file formats:

  • .json
  • .jsonl
  • .csv
  • .zip (containing .txt)

Or select datasets directly from the Hugging Face Hub.

Auto-Detection of Dataset Schema

Upasak intelligently identifies and structures your dataset into training-ready format. Supported schemas:

Schema Format Notes
DAPT [{"text":"..."}] or text column Document Adaptation / continued pretraining
ALPACA [{"instruction":"...", "output":"..."}] (+ optional "input") or instruction, output, input (optional) columns Converted to user → assistant turns
CHATML [{"messages":[{"role":"...", "content":"..."}]}] or messages column Supports role/content pairs
SHARE_GPT [{"conversations":[{"from":"...", "value":"..."}]}] or conversations column Converts human ↔ model to user ↔ assistant
PROMPT_RESPONSE [{"prompt":"...", "response":"..."}] or prompt, response columns Simple instruction → answer
QA [{"question":"", "answer":""}] or question, answer columns Q&A format
QLA [{"question":"...", "long_answer":"..."}] or question, long_answer columns Long-form generation

Built-In PII & Sensitive Information Sanitization

Upasak ensures privacy compliance by:

  • Automatically detecting and redacting/masking PII
  • Using placeholder tokens to preserve dataset utility
  • Offering AI-assisted detection with manual review loops, which uses GLiNER (Named Entity Recognition) model.
  • Logging sanitization results for auditability

Upasak automatically detects and redacts:

  • Personal names
  • Emails / phone numbers
  • IP addresses, IMEI
  • Credit card / bank details
  • National IDs (Aadhaar, PAN, Voter ID)
  • API keys
  • GitHub/GitLab tokens
  • Database credentials
  • Residential & workplace addresses

Two sanitization modes:

  1. Rule-Based (default)

  2. Hybrid (Rule-Based + NER-based)

    • Optional human review
    • Configure HITL ratio & max samples for human review
    • Accept/reject uncertain detections directly in the UI
    • Preview sanitized sample before training

Streamlit UI – No-Code Training Workflow

The visual interface provides fully interactive control:

1. Model Selection

Choose supported base models (currently Gemma-3 text-only). Future updates will include LLaMA, Mixtral, Phi, Qwen and multimodal variants.

2. HF Token Handling

  • Read token for pulling models
  • Write token for pushing fine-tuned models back to HF Hub

3. Dataset Input

  • Upload dataset files
  • Or load from Hugging Face dataset list

4. PII Sanitization Panel

  • Enable/disable sanitization
  • Select detection method (rule-based / hybrid)
  • Enable Human Review & configure ratios
  • View uncertain detections and choose actions
  • Preview sanitized sample before training

5. Hyperparameter Controls

Basic Hyperparameters

  • Learning rate
  • Batch size
  • Epochs
  • Max sequence length
  • Logging steps
  • LR scheduler

Advanced Hyperparameters

  • Gradient accumulation
  • Gradient clipping
  • LR warmup ratio
  • Weight decay
  • Checkpoint save strategy
  • Evaluation strategy + steps
  • Validation split
  • Model tracker platform (Comet / WandB / none)
  • Tracker API keys

6. LoRA Configuration

  • LoRA rank
  • LoRA alpha
  • LoRA dropout
  • Target modules
  • Optional merging of LoRA adapters

7. Training Control

  • Start / Stop training

  • Live training metrics inside the app:

    • Training loss
    • Validation loss
    • Token-level curves
  • Optional external tracking (Comet / WandB)

8. Inference Script Generation

After training completes, Upasak automatically generates a customized inference.py script tailored to your training configuration.

  • LoRA support – Handles both scenarios:
    • LoRA + merged adapters – Loads the fully merged model.
    • LoRA + unmerged adapters – Loads base model + applies LoRA adapters at runtime.
    • Full fine-tune – Standard model loading
  • Ready to use - Access it in your output directory

Usage

cd path_to_output_dir
python inference.py

9. Export & Push

  • Output directory for checkpoints, final model, and merged model
  • Push to HF Hub (when write-enabled token is provided)

Installation

Install from PyPI (recommended)

pip install upasak

Or install from source

# Clone this repo
git clone https://github.com/shrut2702/upasak
cd upasak
# optional

## For Windows
python -m venv vir_env
./vir_env/scripts/activate

## For macOS
python -m venv vir_env
source vir_env/bin/activate
# Install required dependencies
pip install -r requirements.txt

Usage

Upasak is used as a Python-triggered Streamlit app.

After installing the package:

1. Create a Python launcher file

For example: run_upasak.py

from upasak import main

if __name__ == "__main__":
    main()

2. Launch the Streamlit application

streamlit run run_upasak.py

or

streamlit run run_upasak.py --server.maxUploadSize=1024 # for configuring upload file size limit in MB

This opens the Upasak UI in your browser.

After installing from source

1. Launch app.py

streamlit run app.py

or

streamlit run app.py --server.maxUploadSize=1024 # for configuring upload file size limit in MB

Reusability of Upasak Modules

Although Upasak provides a full end-to-end UI, every internal component is designed to be reusable in isolation. You can import and use modules such as:

  • TokenizerWrapper → standalone tokenization
  • TrainingEngine + TrainerConfig → run full or LoRA fine-tuning programmatically
  • PIISanitizer → rule-based or hybrid PII detection/sanitization

You can refer to examples to more details.

This allows you to integrate Upasak directly into custom pipelines, backend services, notebooks, or data-processing workflows — without launching the Streamlit UI.


Use Cases

  • Educational fine-tuning demonstrations
  • Rapid prototyping in quick-shipping environments
  • Dataset preparation and anonymization workflows
  • Internal LLM finetuning on sensitive or regulated data
  • Developers with no domain expertise who wants LLM in their application

Contributing

Contributions are welcome! Please open an issue or submit a pull request for bug fixes, features, documentation, or dataset schema support.


Support

For issues, questions, or feature requests: Create a GitHub issue in this repository.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

upasak-0.1.1.tar.gz (299.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

upasak-0.1.1-py3-none-any.whl (297.3 kB view details)

Uploaded Python 3

File details

Details for the file upasak-0.1.1.tar.gz.

File metadata

  • Download URL: upasak-0.1.1.tar.gz
  • Upload date:
  • Size: 299.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.0

File hashes

Hashes for upasak-0.1.1.tar.gz
Algorithm Hash digest
SHA256 ab80def96c53dd8e2ad07b9833da5a89be4d8062725a967e217b336fbebbd3e8
MD5 5e3808b4c8de5f21c7770e3dca58c42a
BLAKE2b-256 c7129d5cea678435342ccf0f9bb5cc68f69b882a5b306fc929241faed3170f67

See more details on using hashes here.

File details

Details for the file upasak-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: upasak-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 297.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.0

File hashes

Hashes for upasak-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 7a6aa87bc2b952facdcd22d9c11cf017fba62e204aa532c1e93fdb7f6b5643a7
MD5 a11177339162a80a17b06b895b8c965a
BLAKE2b-256 9953ec2a2db9e62cda4d9d7b3867e94490dbb61514f73b3a9df5c8df4719857e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page