Data curation engine for LLM fine-tuning
Project description
Truva
Truva curates your fine-tuning data so you train on signal, not noise.
A CLI-first data curation engine for ML engineers who fine-tune language models. Truva takes a messy dataset and produces a smaller, higher-quality "gold" dataset by removing redundancy, scoring information density, and detecting contradictions.
Goal: Reduce dataset size by 50–80% while maintaining or improving downstream model accuracy, cutting GPU training costs proportionally.
Quick Install
pip install truva
30-Second Example
# Deduplicate a dataset with default settings
truva dedupe ./data.jsonl --output ./deduped.jsonl
# Deduplicate with a custom threshold and generate a report
truva dedupe ./data.jsonl --threshold 0.9 --output ./deduped.jsonl --report ./report.json
# Generate embeddings for a dataset
truva embed ./data.jsonl --output ./embeddings.npy
What It Does
| Before | After |
|---|---|
| 50,000 rows | 12,000 rows |
| Redundant examples | Unique, representative samples |
| Unknown quality | Scored and filtered |
| Hidden contradictions | Flagged for review |
Features
Semantic Deduplication
Removes near-duplicate rows using embedding similarity and Union-Find clustering. Each cluster keeps the single most representative example (closest to centroid).
truva dedupe ./data.jsonl --threshold 0.95
--threshold 0.95(default): Aggressive but safe for most fine-tuning datasets--threshold 0.85: More aggressive, catches paraphrases--threshold 1.0: Only removes exact semantic matches
Embedding Generation
Compute vector embeddings for your dataset using local models or the OpenAI API.
# Local (free, no API key needed)
truva embed ./data.jsonl --provider local --model all-MiniLM-L6-v2
# OpenAI API
truva embed ./data.jsonl --provider api --model text-embedding-3-small
Supported Formats
- JSONL — One JSON object per line (
.jsonl,.json) - CSV — Auto-detects the text column or use
--text-field - Hugging Face Datasets — Pass a dataset identifier like
username/dataset
Configuration
All options are available as CLI flags:
--threshold FLOAT Cosine similarity threshold for dedup (0.0–1.0)
--provider [local|api] Embedding provider
--model TEXT Model name
--text-field TEXT Column/field to use (auto-detected if not set)
--format TEXT Input format: auto, jsonl, csv, hf
--output, -o TEXT Output file path
--report TEXT Path for JSON report
Requirements
- Python 3.10+
- Works on macOS (Apple Silicon), Linux
License
TBD
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file truva-0.1.0.tar.gz.
File metadata
- Download URL: truva-0.1.0.tar.gz
- Upload date:
- Size: 14.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eb6cfecd8e29dd366259d1e0043f471401e61f91a0f0a9700c73144fdb2d2c18
|
|
| MD5 |
0fc9cb901345e4e339eb8ee18cc1f8a7
|
|
| BLAKE2b-256 |
30d8a7a0131882b0ae87411763c4ce2df67d70d0d39fb412e93faaea967ef2dc
|
File details
Details for the file truva-0.1.0-py3-none-any.whl.
File metadata
- Download URL: truva-0.1.0-py3-none-any.whl
- Upload date:
- Size: 16.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
baf887513fb849002c3eb87cb02a2ee7a06d9f9713af4819e4d74de5e274d019
|
|
| MD5 |
badd175b92302a791ed4607537f98c51
|
|
| BLAKE2b-256 |
1b1f13db30df24801fa7c20897063a110f4f5ad516ec1b577ad4c8bf486df661
|