Data curation engine for LLM fine-tuning
Project description
Truva
Truva curates your fine-tuning data so you train on signal, not noise.
A CLI-first data curation engine for ML engineers who fine-tune language models. Truva takes a messy dataset and produces a smaller, higher-quality "gold" dataset by removing redundancy, scoring information density, and detecting contradictions.
Goal: Reduce dataset size by 50–80% while maintaining or improving downstream model accuracy, cutting GPU training costs proportionally.
Quick Install
pip install truva
30-Second Example
# Deduplicate a dataset
truva dedupe ./data.jsonl --output ./deduped.jsonl
# Score for information density
truva score ./data.jsonl --provider ollama --model llama3:8b --min-quality 6
# Detect contradicting rows
truva contradict ./data.jsonl --confidence 0.8 --output ./contradictions.json
# Full pipeline in one command
truva audit ./data.jsonl \
--provider openai --model gpt-4o-mini \
--min-quality 6 --detect-contradictions \
--output ./results/
What It Does
| Before | After |
|---|---|
| 50,000 rows | 12,000 rows |
| Redundant examples | Unique, representative samples |
| Unknown quality | Scored and filtered |
| Hidden contradictions | Flagged for review |
Features
Semantic Deduplication
Removes near-duplicate rows using embedding similarity and Union-Find clustering. Each cluster keeps the single most representative example (closest to centroid).
truva dedupe ./data.jsonl --threshold 0.95
--threshold 0.95(default): Aggressive but safe for most fine-tuning datasets--threshold 0.85: More aggressive, catches paraphrases--threshold 1.0: Only removes exact semantic matches
Quality Scoring
Scores each row on a 1–10 scale for educational value using an LLM judge. Drop low-quality rows with --min-quality.
# Free, local (requires Ollama running locally)
truva score ./data.jsonl --provider ollama --model llama3:8b --min-quality 6
# OpenAI
truva score ./data.jsonl --provider openai --model gpt-4o-mini --min-quality 6
# Anthropic
truva score ./data.jsonl --provider anthropic --model claude-haiku-4-5-20251001 --min-quality 6
# With a domain-specific calibration file
truva score ./data.jsonl --calibration ./calibration_medical.yaml --min-quality 7
# Interactive dry-run: score 50 samples and give feedback before committing
truva score ./data.jsonl --interactive --sample 50
Fast mode (default): Scores only cluster representatives from a prior dedup run, then propagates scores to cluster members with a −1 penalty. 5–10x cheaper than scoring every row.
Thorough mode: Scores every row individually.
truva score ./data.jsonl --mode thorough --provider openai --model gpt-4o-mini
Contradiction Detection
Finds rows that teach conflicting information using a local NLI model (cross-encoder/nli-deberta-v3-base). Only semantically similar rows (within the same cluster) are compared, keeping the number of checks tractable.
truva contradict ./data.jsonl --confidence 0.8 --output ./contradictions.json
Output is a JSON report listing each contradicting pair, the texts, and the NLI confidence score:
{
"input_rows": 1000,
"num_clusters": 312,
"confidence_threshold": 0.8,
"contradictions_found": 4,
"contradictions": [
{
"row_a_id": "c1a",
"row_b_id": "c1b",
"text_a": "Refunds are processed instantly.",
"text_b": "Refunds take 3 to 5 business days.",
"confidence": 0.9341
}
]
}
No API key needed — NLI runs entirely on CPU.
Full Pipeline (audit)
Run deduplication, quality scoring, and contradiction detection in one command. Writes gold.jsonl and report.json to the output directory.
truva audit ./data.jsonl \
--provider openai --model gpt-4o-mini \
--min-quality 6 --detect-contradictions \
--output ./results/
After the run, inspect the results:
# View top duplicate clusters
truva inspect ./results/report.json --what clusters --top 10
# View contradiction pairs
truva inspect ./results/report.json --what contradictions
# Export gold dataset as CSV
truva export ./results/gold.jsonl --format csv -o ./gold.csv
# Quick health check on any dataset
truva health ./data.jsonl
Embedding Generation
Compute vector embeddings for your dataset using local models or the OpenAI API.
# Local (free, no API key needed)
truva embed ./data.jsonl --provider local --model all-MiniLM-L6-v2
# OpenAI API
truva embed ./data.jsonl --provider api --model text-embedding-3-small
Example Dedup Report
When you pass --report ./report.json to truva dedupe, Truva writes a structured summary:
{
"input_rows": 50000,
"kept_rows": 12380,
"removed_rows": 37620,
"reduction_pct": 75.24,
"threshold": 0.95,
"num_clusters": 12380,
"clusters": [
{
"representative_idx": 41,
"size": 23,
"avg_similarity": 0.9812
}
]
}
Connecting to API Providers
OpenAI
- Get an API key from https://platform.openai.com/api-keys
- Set the environment variable:
export OPENAI_API_KEY=sk-...
Then use --provider openai with any truva command.
Anthropic
- Get an API key from https://console.anthropic.com/settings/keys
- Set the environment variable:
export ANTHROPIC_API_KEY=sk-ant-...
Then use --provider anthropic with any truva command.
Ollama (Free, Local)
Install Ollama from https://ollama.com, then pull a model:
ollama pull llama3:8b
No API key needed. Use --provider ollama with truva score.
Calibration
Calibration files let you inject domain-specific scoring rules and few-shot examples into the judge prompt. This is useful when the default judge undervalues domain-specific shorthand (e.g., medical abbreviations, legal citations).
# calibration_medical.yaml
rubric: |
Prioritize clinical accuracy over grammar.
Short medical abbreviations (SOB, CHF, hx) are acceptable.
Penalize vague or speculative language.
examples:
- text: "pt presents w/ SOB, hx of CHF, started on furosemide 40mg"
score: 9
reasoning: "Clinically precise with actionable treatment detail"
- text: "The patient is not feeling well and might have a heart problem"
score: 2
reasoning: "Vague, no clinical specificity"
truva score ./data.jsonl --calibration ./calibration_medical.yaml --min-quality 7
See examples/ for ready-made calibration files for medical and customer support domains.
Supported Formats
- JSONL — One JSON object per line (
.jsonl,.json) - CSV — Auto-detects the text column or use
--text-field - Hugging Face Datasets — Pass a dataset identifier like
username/dataset
Configuration
truva dedupe
--threshold FLOAT Cosine similarity threshold (0.0–1.0), default 0.95
--provider [local|api] Embedding provider
--model TEXT Embedding model name
--text-field TEXT Column/field to embed (auto-detected if not set)
--format TEXT Input format: auto, jsonl, csv, hf
--output, -o TEXT Output file path
--report TEXT Path for JSON report
truva score
--provider Judge model provider: ollama, openai, anthropic
--model TEXT Judge model name
--mode [fast|thorough] Scoring mode, default fast
--min-quality FLOAT Drop rows below this score (0.0 = keep all)
--calibration TEXT Path to calibration YAML file
--text-field TEXT Column/field to score (auto-detected if not set)
--output, -o TEXT Output file path
--report TEXT Path for scoring report JSON
--interactive Interactive dry-run mode
--sample INT Number of rows to sample in interactive mode
truva contradict
--confidence FLOAT Min NLI confidence to flag a contradiction, default 0.8
--nli-model TEXT NLI model name, default cross-encoder/nli-deberta-v3-base
--text-field TEXT Column/field to compare (auto-detected if not set)
--dedupe-threshold FLOAT Similarity threshold for clustering before NLI check, default 0.95
--format TEXT Input format: auto, jsonl, csv, hf
--output, -o TEXT Output path for contradiction report JSON
truva audit
--provider TEXT Judge model provider: ollama, openai, anthropic
--model TEXT Judge model name
--embed-provider TEXT Embedding provider: local, api (default: local)
--embed-model TEXT Embedding model name
--dedupe-threshold FLOAT Cosine similarity threshold for dedup, default 0.95
--min-quality FLOAT Drop rows below this score, default 6.0
--mode [fast|thorough] Scoring mode, default fast
--detect-contradictions Run NLI contradiction detection after dedup
--contradiction-confidence FLOAT Min confidence to flag, default 0.8
--text-field TEXT Column/field name (auto-detected if not set)
--format TEXT Input format: auto, jsonl, csv, hf
--calibration TEXT Path to calibration YAML file
--output, -o TEXT Output directory (default: ./truva_output)
truva inspect
REPORT_PATH Path to report.json from a prior audit run
--what [clusters|contradictions] What to inspect (default: clusters)
--top INT How many entries to show (default: 20)
truva export
GOLD_PATH Path to gold.jsonl from a prior audit run
--format [jsonl|csv] Output format (default: jsonl)
--columns TEXT Comma-separated fields to include (default: all)
--output, -o TEXT Output file path
truva health
INPUT_PATH Dataset to inspect
--text-field TEXT Column/field to analyse (auto-detected if not set)
--format TEXT Input format: auto, jsonl, csv, hf
Roadmap
- Caching — Skip recomputation on re-runs
- Data leakage detection — Flag training rows that overlap with your eval set
Quickstart Script
A runnable demo script is included in the repo:
bash examples/quickstart.sh
It runs a health check, full audit with local embeddings, cluster inspection, and CSV export on the bundled 50-row fixture — no API key required.
Development
# Clone and install in editable mode with dev dependencies
git clone https://github.com/turingspark/truva
cd truva
pip install -e ".[dev]"
# Run fast unit tests (no models required)
pytest tests/ -v -m "not integration"
# Run full test suite including end-to-end integration tests
pytest tests/ -v
Integration tests load real embedding (all-MiniLM-L6-v2) and NLI (cross-encoder/nli-deberta-v3-base) models locally — no API keys needed, but they take a few minutes.
Requirements
- Python 3.10+
- Works on macOS (Apple Silicon), Linux
License
Apache 2.0
Feedback
Found a bug or have a feature request? Send us an email at team@turingspark.com — we'd love to hear from you.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file truva-0.2.0.tar.gz.
File metadata
- Download URL: truva-0.2.0.tar.gz
- Upload date:
- Size: 44.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e7a63e9f6e058208a70bc44574bc80350aca8b4d2bcf7722594c84478c88dd0a
|
|
| MD5 |
df29ab352c7404aa51eac7683a6cc805
|
|
| BLAKE2b-256 |
1a7e6044e038fdb07ac30a6b046982d75087f1746759a24ef543fb5942f59de2
|
File details
Details for the file truva-0.2.0-py3-none-any.whl.
File metadata
- Download URL: truva-0.2.0-py3-none-any.whl
- Upload date:
- Size: 50.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6ac662d79993f45a458eaab238da24b994d46aeaa0666c6138c4284d8d24cb6a
|
|
| MD5 |
e2cf9af5188fd9aeffd0006e520ebc42
|
|
| BLAKE2b-256 |
3ba702df8fb1b0d29439769a72c720e4589a3b529b785c132e4dad0c95cde77a
|