UI-based Fine-Tuning Framework for Large Language Models (LLMs)

These details have not been verified by PyPI

Project links

Project description

Upasak - UI-based Fine-Tuning for Large Language Models (LLMs)

Upasak is a flexible, privacy-aware, no-code/low-code framework for fine-tuning large language models, built around Hugging Face Transformers. It provides an intuitive Streamlit-based interface, multi-format dataset support, built-in PII and sensitive information sanitization, and a fully customizable training pipeline. Whether you're prototyping, researching, or running internal fine-tuning workflows, Upasak makes the process simple, accessible, and compliant.

✨ Key Features

LLM Fine-Tuning

Developed on top of Hugging Face's Transformers library.
Supports Text-only models of Gemma-3 LLM family for instruction-tuning or domain adaptation.
Full-parameter fine-tuning or LoRA (Parameter-Efficient Fine-Tuning).
Future support planned for image-text-to-text Gemma-3 models, LLaMA, Qwen, Phi, Mixtral.

Flexible Dataset Handling

Upload or import datasets in multiple file formats:

.json
.jsonl
.csv
.zip (containing .txt)

Or select datasets directly from the Hugging Face Hub.

Auto-Detection of Dataset Schema

Upasak intelligently identifies and structures your dataset into training-ready format. Supported schemas:

Schema	Format	Notes
DAPT	`[{"text":"..."}]` or `text` column	Document Adaptation / continued pretraining
ALPACA	`[{"instruction":"...", "output":"..."}]` (+ optional `"input"`) or `instruction`, `output`, `input` (optional) columns	Converted to user → assistant turns
CHATML	`[{"messages":[{"role":"...", "content":"..."}]}]` or `messages` column	Supports role/content pairs
SHARE_GPT	`[{"conversations":[{"from":"...", "value":"..."}]}]` or `conversations` column	Converts human ↔ model to user ↔ assistant
PROMPT_RESPONSE	`[{"prompt":"...", "response":"..."}]` or `prompt`, `response` columns	Simple instruction → answer
QA	`[{"question":"", "answer":""}]` or `question`, `answer` columns	Q&A format
QLA	`[{"question":"...", "long_answer":"..."}]` or `question`, `long_answer` columns	Long-form generation

Built-In PII & Sensitive Information Sanitization

Upasak automatically detects and redacts:

Personal names
Emails / phone numbers
IP addresses, IMEI
Credit card / bank details
National IDs (Aadhaar, PAN, Voter ID, etc.)
API keys
GitHub/GitLab tokens
Database credentials
Residential & workplace addresses
And more…

Two sanitization modes:

Rule-Based (default)
Hybrid (Rule-Based + NER-based)
- Optional human review
- Configure HITL ratio & max samples for human review
- Accept/reject uncertain detections directly in the UI
- Preview sanitized sample before training

🖥️ Streamlit UI – No-Code Training Workflow

The visual interface provides fully interactive control:

1. Model Selection

Choose supported base models (currently Gemma-3 text-only). Future updates will include LLaMA, Mixtral, Phi, Qwen and multimodal variants.

2. HF Token Handling

Read token for pulling models
Write token for pushing fine-tuned models back to HF Hub

3. Dataset Input

Upload dataset files
Or load from Hugging Face dataset list

4. PII Sanitization Panel

Enable/disable sanitization
Select detection method (rule-based / hybrid)
Enable Human Review & configure ratios
View uncertain detections and choose actions
Preview sanitized sample before training

5. Hyperparameter Controls

Basic Hyperparameters

Learning rate
Batch size
Epochs
Max sequence length
Logging steps
LR scheduler

Advanced Hyperparameters

Gradient accumulation
Gradient clipping
LR warmup ratio
Weight decay
Checkpoint save strategy
Evaluation strategy + steps
Validation split
Model tracker platform (Comet / WandB / none)
Tracker API keys

6. LoRA Configuration

LoRA rank
LoRA alpha
LoRA dropout
Target modules
Optional merging of LoRA adapters

7. Training Control

Start / Stop training
Live training metrics inside the app:
- Training loss
- Validation loss
- Token-level curves
Optional external tracking (Comet / WandB)

8. Inference Script Generation

After training completes, Upasak automatically generates a customized inference.py script tailored to your training configuration.

LoRA support – Handles both scenarios:
- LoRA + merged adapters – Loads the fully merged model.
- LoRA + unmerged adapters – Loads base model + applies LoRA adapters at runtime.
- Full fine-tune – Standard model loading
Ready to use - Access it in your output directory

Usage

cd path_to_output_dir
python inference.py

9. Export & Push

Output directory for checkpoints, final model, and merged model
Push to HF Hub (when write-enabled token is provided)

📦 Installation

Install from PyPI (recommended)

pip install upasak

Or install from source

# Clone this repo
git clone https://github.com/shrut2702/upasak
cd upasak

# optional

## For Windows
python -m venv vir_env
./vir_env/scripts/activate

## For macOS
python -m venv vir_env
source vir_env/bin/activate

# Install required dependencies
pip install -r requirements.txt

🚀 Usage

Upasak is used as a Python-triggered Streamlit app.

After installing the package:

1. Create a Python launcher file

For example: run_upasak.py

from upasak import main

if __name__ == "__main__":
    main()

2. Launch the Streamlit application

streamlit run run_upasak.py

streamlit run run_upasak.py --server.maxUploadSize=1024 # for configuring upload file size limit in MB

This opens the Upasak UI in your browser.

After installing from source

1. Launch `app.py`

streamlit run app.py

streamlit run app.py --server.maxUploadSize=1024 # for configuring upload file size limit in MB

📁 Repository Structure

research/
    can_be_used.ipynb
    dirs-pii.ipynb
    tokenization_wrapper.ipynb

src/Upasak/
    __init__.py
    data_sanitization/
        __init__.py
        pii-detection-patterns.yaml
        pii-generator-mapping.yaml
        pii.py
    fine_tune/
        __init__.py
        trainer_config.py
        training_engine.py
    preprocessing/
        __init__.py
        tokenizer_wrapper.py
    ui/
        __init__.py
        interface.py
    utils/
        __init__.py
        common.py

app.py
README.md
requirements.txt
.gitignore

Here’s a clean add-on paragraph you can place right after the main introduction without altering any of your existing content. (You can drop it anywhere — ideally after the first section or before “Key Features.”)

Reusability of Upasak Modules

Although Upasak provides a full end-to-end UI, every internal component is designed to be reusable in isolation. You can import and use modules such as:

TokenizerWrapper → standalone tokenization
TrainingEngine + TrainerConfig → run full or LoRA fine-tuning programmatically
PIISanitizer → rule-based or hybrid PII detection/sanitization

This allows you to integrate Upasak directly into custom pipelines, backend services, notebooks, or data-processing workflows — without launching the Streamlit UI.

🔒 PII & Sensitive Information Handling

Upasak ensures privacy compliance by:

Automatically detecting and redacting/masking PII
Using placeholder tokens to preserve dataset utility
Offering AI-assisted detection with manual review loops, which uses GLiNER (Named Entity Recognition) model.
Logging sanitization results for auditability

PII categories include:

Personal identifiers
Contact information
Financial identifiers
API credentials
Device IDs
Addresses (residential, office)
Tokens, keys, connection strings
IP, MAC, IMEI … and more

🧩 Use Cases

Privacy-safe LLM fine-tuning
Educational fine-tuning demonstrations
Rapid prototyping for research teams
Dataset preparation and anonymization workflows
Internal LLM finetuning on sensitive or regulated data

📍 Roadmap

The following enhancements are actively planned:

Model Support

Gemma-3 multimodal models
LLaMA / Mixtral / Phi / Qwen
QLoRA for efficient training on quantized weights
DPO (Direct Preference Optimization) for preference alignment

Data Processing

Profanity detection and handling
Additional schema support
Multimodal (image + text) tokenization

UI Enhancements

More detailed training logs

🤝 Contributing

Contributions are welcome! Please open an issue or submit a pull request for bug fixes, features, documentation, or dataset schema support.

💬 Support

For issues, questions, or feature requests: Create a GitHub issue in this repository.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.1

Dec 4, 2025

This version

0.1.0

Dec 3, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

upasak-0.1.0.tar.gz (300.6 kB view details)

Uploaded Dec 3, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

upasak-0.1.0-py3-none-any.whl (297.7 kB view details)

Uploaded Dec 3, 2025 Python 3

File details

Details for the file upasak-0.1.0.tar.gz.

File metadata

Download URL: upasak-0.1.0.tar.gz
Upload date: Dec 3, 2025
Size: 300.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.0

File hashes

Hashes for upasak-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`e27bcc977b8343b3a6f5b6667d8e5fbe5dc6757c1f9a6f36e89747d0490fe343`
MD5	`38f48efc6e1294e585e8e577ea4274b7`
BLAKE2b-256	`2867364816b5374a7a987a41aaca6fad493b657cb9f37326fce13c358d787a70`

See more details on using hashes here.

File details

Details for the file upasak-0.1.0-py3-none-any.whl.

File metadata

Download URL: upasak-0.1.0-py3-none-any.whl
Upload date: Dec 3, 2025
Size: 297.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.0

File hashes

Hashes for upasak-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3c8e3f7c741516fce27572c047a101afd9932a7b8d68c7b4d5c12e5d71ff72f3`
MD5	`7aafc04fecb5ee57cd43880393e842db`
BLAKE2b-256	`16dd8a8fd352f2968b45a0b8fd0ccf85ec3a0e347a9248c64c9968d91483c751`

See more details on using hashes here.

upasak 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Upasak - UI-based Fine-Tuning for Large Language Models (LLMs)

✨ Key Features

LLM Fine-Tuning

Flexible Dataset Handling

Auto-Detection of Dataset Schema

Built-In PII & Sensitive Information Sanitization

🖥️ Streamlit UI – No-Code Training Workflow

1. Model Selection

2. HF Token Handling

3. Dataset Input

4. PII Sanitization Panel

5. Hyperparameter Controls

Basic Hyperparameters

Advanced Hyperparameters

6. LoRA Configuration

7. Training Control

8. Inference Script Generation

9. Export & Push

📦 Installation

Install from PyPI (recommended)

Or install from source

🚀 Usage

After installing the package:

1. Create a Python launcher file

2. Launch the Streamlit application

After installing from source

1. Launch app.py

📁 Repository Structure

Reusability of Upasak Modules

🔒 PII & Sensitive Information Handling

🧩 Use Cases

📍 Roadmap

Model Support

Data Processing

UI Enhancements

🤝 Contributing

💬 Support

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

1. Launch `app.py`