Data curation engine for LLM fine-tuning
Project description
Truva
Truva curates your fine-tuning data so you train on signal, not noise.
A CLI-first data curation engine for ML engineers who fine-tune language models. Truva takes a messy dataset and produces a smaller, higher-quality "gold" dataset — starting with semantic deduplication today, with quality scoring and contradiction detection coming soon.
Goal: Reduce dataset size by 50–80% while maintaining or improving downstream model accuracy, cutting GPU training costs proportionally.
Quick Install
pip install truva
30-Second Example
# Deduplicate a dataset with default settings
truva dedupe ./data.jsonl --output ./deduped.jsonl
# Deduplicate with a custom threshold and generate a report
truva dedupe ./data.jsonl --threshold 0.9 --output ./deduped.jsonl --report ./report.json
# Generate embeddings for a dataset
truva embed ./data.jsonl --output ./embeddings.npy
What It Does
| Before | After |
|---|---|
| 50,000 rows | 12,000 rows |
| Redundant examples | Unique, representative samples |
| Unknown quality | Scored and filtered (coming soon) |
| Hidden contradictions | Flagged for review (coming soon) |
Features
Semantic Deduplication
Removes near-duplicate rows using embedding similarity and Union-Find clustering. Each cluster keeps the single most representative example (closest to centroid).
truva dedupe ./data.jsonl --threshold 0.95
--threshold 0.95(default): Aggressive but safe for most fine-tuning datasets--threshold 0.85: More aggressive, catches paraphrases--threshold 1.0: Only removes exact semantic matches
Embedding Generation
Compute vector embeddings for your dataset using local models or the OpenAI API.
# Local (free, no API key needed)
truva embed ./data.jsonl --provider local --model all-MiniLM-L6-v2
# OpenAI API
truva embed ./data.jsonl --provider api --model text-embedding-3-small
Example Report Output
When you pass --report ./report.json, Truva writes a structured summary of what it found:
{
"input_rows": 50000,
"kept_rows": 12380,
"removed_rows": 37620,
"reduction_pct": 75.24,
"threshold": 0.95,
"num_clusters": 12380,
"clusters": [
{
"representative_idx": 41,
"size": 23,
"avg_similarity": 0.9812
},
{
"representative_idx": 7,
"size": 14,
"avg_similarity": 0.9734
}
]
}
Each cluster shows the representative row kept, how many duplicates were merged, and the average pairwise similarity within the group.
Supported Formats
- JSONL — One JSON object per line (
.jsonl,.json) - CSV — Auto-detects the text column or use
--text-field - Hugging Face Datasets — Pass a dataset identifier like
username/dataset
Configuration
All options are available as CLI flags:
--threshold FLOAT Cosine similarity threshold for dedup (0.0–1.0)
--provider [local|api] Embedding provider
--model TEXT Model name
--text-field TEXT Column/field to use (auto-detected if not set)
--format TEXT Input format: auto, jsonl, csv, hf
--output, -o TEXT Output file path
--report TEXT Path for JSON report
Roadmap
- Quality scoring — LLM-based information density scoring to filter low-value rows
- Contradiction detection — Flag rows that teach conflicting information
- Calibration — Human-in-the-loop threshold tuning
Requirements
- Python 3.10+
- Works on macOS (Apple Silicon), Linux
License
Apache 2.0
Feedback
Found a bug or have a feature request? Send us an email at team@turingspark.com — we'd love to hear from you.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file truva-0.1.3.tar.gz.
File metadata
- Download URL: truva-0.1.3.tar.gz
- Upload date:
- Size: 19.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b98a5a2c4043520647809993669f133b888577e7e4d834bd364ff3abb228ec1c
|
|
| MD5 |
89f49d4375cf26f868ed1792fdf176d5
|
|
| BLAKE2b-256 |
358fa36a2be028b7fe0ed04fb8b84647cde5352cf79f81f58b71b6da766dd798
|
File details
Details for the file truva-0.1.3-py3-none-any.whl.
File metadata
- Download URL: truva-0.1.3-py3-none-any.whl
- Upload date:
- Size: 21.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
06e207fd5222633d9fab215134abdc2c73e83717f43c3764059b178182b67a85
|
|
| MD5 |
4a3b093b89cb91128c2bf93c61ca0576
|
|
| BLAKE2b-256 |
b7eced6ea1d5462706cb69f3b97986a7a58a11dda3e741d048cb970e3b218d36
|