UI-based Fine-Tuning Framework for Large Language Models (LLMs)
Project description
Upasak - UI-based Fine-Tuning for Large Language Models (LLMs)
Upasak is a flexible, mindful to privacy, no-code/low-code framework for fine-tuning large language models, built around Hugging Face Transformers. It features an easy-to-use Streamlit-based interface, multi-format dataset support, built-in PII and sensitive information sanitization, and a customizable training process. Whether you're experimenting, researching, or performing internal fine-tuning tasks, Upasak makes it easily accessible and compliant.
Key Features
LLM Fine-Tuning
- Developed on top of Hugging Face's Transformers library.
- Supports Text-only models of Gemma-3 LLM family for instruction-tuning or domain adaptation.
- Full-parameter fine-tuning or LoRA (Parameter-Efficient Fine-Tuning).
- Future support planned for image-text-to-text Gemma-3 models, LLaMA, Qwen, Phi, Mixtral.
Flexible Dataset Handling
Upload or import datasets in multiple file formats:
.json.jsonl.csv.zip(containing.txt)
Or select datasets directly from the Hugging Face Hub.
Auto-Detection of Dataset Schema
Upasak intelligently identifies and structures your dataset into training-ready format. Supported schemas:
| Schema | Format | Notes |
|---|---|---|
| DAPT | [{"text":"..."}] or text column |
Document Adaptation / continued pretraining |
| ALPACA | [{"instruction":"...", "output":"..."}] (+ optional "input") or instruction, output, input (optional) columns |
Converted to user → assistant turns |
| CHATML | [{"messages":[{"role":"...", "content":"..."}]}] or messages column |
Supports role/content pairs |
| SHARE_GPT | [{"conversations":[{"from":"...", "value":"..."}]}] or conversations column |
Converts human ↔ model to user ↔ assistant |
| PROMPT_RESPONSE | [{"prompt":"...", "response":"..."}] or prompt, response columns |
Simple instruction → answer |
| QA | [{"question":"", "answer":""}] or question, answer columns |
Q&A format |
| QLA | [{"question":"...", "long_answer":"..."}] or question, long_answer columns |
Long-form generation |
Built-In PII & Sensitive Information Sanitization
Upasak ensures privacy compliance by:
- Automatically detecting and redacting/masking PII
- Using placeholder tokens to preserve dataset utility
- Offering AI-assisted detection with manual review loops, which uses GLiNER (Named Entity Recognition) model.
- Logging sanitization results for auditability
Upasak automatically detects and redacts:
- Personal names
- Emails / phone numbers
- IP addresses, IMEI
- Credit card / bank details
- National IDs (Aadhaar, PAN, Voter ID)
- API keys
- GitHub/GitLab tokens
- Database credentials
- Residential & workplace addresses
Two sanitization modes:
-
Rule-Based (default)
-
Hybrid (Rule-Based + NER-based)
- Optional human review
- Configure HITL ratio & max samples for human review
- Accept/reject uncertain detections directly in the UI
- Preview sanitized sample before training
Streamlit UI – No-Code Training Workflow
The visual interface provides fully interactive control:
1. Model Selection
Choose supported base models (currently Gemma-3 text-only). Future updates will include LLaMA, Mixtral, Phi, Qwen and multimodal variants.
2. HF Token Handling
- Read token for pulling models
- Write token for pushing fine-tuned models back to HF Hub
3. Dataset Input
- Upload dataset files
- Or load from Hugging Face dataset list
4. PII Sanitization Panel
- Enable/disable sanitization
- Select detection method (rule-based / hybrid)
- Enable Human Review & configure ratios
- View uncertain detections and choose actions
- Preview sanitized sample before training
5. Hyperparameter Controls
Basic Hyperparameters
- Learning rate
- Batch size
- Epochs
- Max sequence length
- Logging steps
- LR scheduler
Advanced Hyperparameters
- Gradient accumulation
- Gradient clipping
- LR warmup ratio
- Weight decay
- Checkpoint save strategy
- Evaluation strategy + steps
- Validation split
- Model tracker platform (Comet / WandB / none)
- Tracker API keys
6. LoRA Configuration
- LoRA rank
- LoRA alpha
- LoRA dropout
- Target modules
- Optional merging of LoRA adapters
7. Training Control
-
Start / Stop training
-
Live training metrics inside the app:
- Training loss
- Validation loss
- Token-level curves
-
Optional external tracking (Comet / WandB)
8. Inference Script Generation
After training completes, Upasak automatically generates a customized inference.py script tailored to your training configuration.
- LoRA support – Handles both scenarios:
- LoRA + merged adapters – Loads the fully merged model.
- LoRA + unmerged adapters – Loads base model + applies LoRA adapters at runtime.
- Full fine-tune – Standard model loading
- Ready to use - Access it in your output directory
Usage
cd path_to_output_dir
python inference.py
9. Export & Push
- Output directory for checkpoints, final model, and merged model
- Push to HF Hub (when write-enabled token is provided)
Installation
Install from PyPI (recommended)
pip install upasak
Or install from source
# Clone this repo
git clone https://github.com/shrut2702/upasak
cd upasak
# optional
## For Windows
python -m venv vir_env
./vir_env/scripts/activate
## For macOS
python -m venv vir_env
source vir_env/bin/activate
# Install required dependencies
pip install -r requirements.txt
Usage
Upasak is used as a Python-triggered Streamlit app.
After installing the package:
1. Create a Python launcher file
For example: run_upasak.py
from upasak import main
if __name__ == "__main__":
main()
2. Launch the Streamlit application
streamlit run run_upasak.py
or
streamlit run run_upasak.py --server.maxUploadSize=1024 # for configuring upload file size limit in MB
This opens the Upasak UI in your browser.
After installing from source
1. Launch app.py
streamlit run app.py
or
streamlit run app.py --server.maxUploadSize=1024 # for configuring upload file size limit in MB
Reusability of Upasak Modules
Although Upasak provides a full end-to-end UI, every internal component is designed to be reusable in isolation. You can import and use modules such as:
TokenizerWrapper→ standalone tokenizationTrainingEngine+TrainerConfig→ run full or LoRA fine-tuning programmaticallyPIISanitizer→ rule-based or hybrid PII detection/sanitization
You can refer to examples to more details.
This allows you to integrate Upasak directly into custom pipelines, backend services, notebooks, or data-processing workflows — without launching the Streamlit UI.
Use Cases
- Educational fine-tuning demonstrations
- Rapid prototyping in quick-shipping environments
- Dataset preparation and anonymization workflows
- Internal LLM finetuning on sensitive or regulated data
- Developers with no domain expertise who wants LLM in their application
Contributing
Contributions are welcome! Please open an issue or submit a pull request for bug fixes, features, documentation, or dataset schema support.
Support
For issues, questions, or feature requests: Create a GitHub issue in this repository.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file upasak-0.1.1.tar.gz.
File metadata
- Download URL: upasak-0.1.1.tar.gz
- Upload date:
- Size: 299.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ab80def96c53dd8e2ad07b9833da5a89be4d8062725a967e217b336fbebbd3e8
|
|
| MD5 |
5e3808b4c8de5f21c7770e3dca58c42a
|
|
| BLAKE2b-256 |
c7129d5cea678435342ccf0f9bb5cc68f69b882a5b306fc929241faed3170f67
|
File details
Details for the file upasak-0.1.1-py3-none-any.whl.
File metadata
- Download URL: upasak-0.1.1-py3-none-any.whl
- Upload date:
- Size: 297.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7a6aa87bc2b952facdcd22d9c11cf017fba62e204aa532c1e93fdb7f6b5643a7
|
|
| MD5 |
a11177339162a80a17b06b895b8c965a
|
|
| BLAKE2b-256 |
9953ec2a2db9e62cda4d9d7b3867e94490dbb61514f73b3a9df5c8df4719857e
|