Data curation engine for LLM fine-tuning

These details have not been verified by PyPI

Project description

Truva

Truva curates your fine-tuning data so you train on signal, not noise.

A CLI-first data curation engine for ML engineers who fine-tune language models. Truva takes a messy dataset and produces a smaller, higher-quality "gold" dataset by removing redundancy, scoring information density, and detecting contradictions.

Goal: Reduce dataset size by 50–80% while maintaining or improving downstream model accuracy, cutting GPU training costs proportionally.

Quick Install

pip install truva

30-Second Example

# Deduplicate a dataset
truva dedupe ./data.jsonl --output ./deduped.jsonl

# Score for information density
truva score ./data.jsonl --provider ollama --model llama3:8b --min-quality 6

# Detect contradicting rows
truva contradict ./data.jsonl --confidence 0.8 --output ./contradictions.json

# Full pipeline in one command
truva audit ./data.jsonl \
  --provider openai --model gpt-4o-mini \
  --min-quality 6 --detect-contradictions \
  --output ./results/

What It Does

Before	After
50,000 rows	12,000 rows
Redundant examples	Unique, representative samples
Unknown quality	Scored and filtered
Hidden contradictions	Flagged for review

Features

Semantic Deduplication

Removes near-duplicate rows using embedding similarity and Union-Find clustering. Each cluster keeps the single most representative example (closest to centroid).

truva dedupe ./data.jsonl --threshold 0.95

--threshold 0.95 (default): Aggressive but safe for most fine-tuning datasets
--threshold 0.85: More aggressive, catches paraphrases
--threshold 1.0: Only removes exact semantic matches

Quality Scoring

Scores each row on a 1–10 scale for educational value using an LLM judge. Drop low-quality rows with --min-quality.

# Free, local (requires Ollama running locally)
truva score ./data.jsonl --provider ollama --model llama3:8b --min-quality 6

# OpenAI
truva score ./data.jsonl --provider openai --model gpt-4o-mini --min-quality 6

# Anthropic
truva score ./data.jsonl --provider anthropic --model claude-haiku-4-5-20251001 --min-quality 6

# With a domain-specific calibration file
truva score ./data.jsonl --calibration ./calibration_medical.yaml --min-quality 7

# Interactive dry-run: score 50 samples and give feedback before committing
truva score ./data.jsonl --interactive --sample 50

Fast mode (default): Scores only cluster representatives from a prior dedup run, then propagates scores to cluster members with a −1 penalty. 5–10x cheaper than scoring every row.

Thorough mode: Scores every row individually.

truva score ./data.jsonl --mode thorough --provider openai --model gpt-4o-mini

Contradiction Detection

Finds rows that teach conflicting information using a local NLI model (cross-encoder/nli-deberta-v3-base). Only semantically similar rows (within the same cluster) are compared, keeping the number of checks tractable.

truva contradict ./data.jsonl --confidence 0.8 --output ./contradictions.json

Output is a JSON report listing each contradicting pair, the texts, and the NLI confidence score:

{
  "input_rows": 1000,
  "num_clusters": 312,
  "confidence_threshold": 0.8,
  "contradictions_found": 4,
  "contradictions": [
    {
      "row_a_id": "c1a",
      "row_b_id": "c1b",
      "text_a": "Refunds are processed instantly.",
      "text_b": "Refunds take 3 to 5 business days.",
      "confidence": 0.9341
    }
  ]
}

No API key needed — NLI runs entirely on CPU.

Full Pipeline (audit)

Run deduplication, quality scoring, and contradiction detection in one command. Writes gold.jsonl and report.json to the output directory.

truva audit ./data.jsonl \
  --provider openai --model gpt-4o-mini \
  --min-quality 6 --detect-contradictions \
  --output ./results/

After the run, inspect the results:

# View top duplicate clusters
truva inspect ./results/report.json --what clusters --top 10

# View contradiction pairs
truva inspect ./results/report.json --what contradictions

# Export gold dataset as CSV
truva export ./results/gold.jsonl --format csv -o ./gold.csv

# Quick health check on any dataset
truva health ./data.jsonl

Embedding Generation

Compute vector embeddings for your dataset using local models or the OpenAI API.

# Local (free, no API key needed)
truva embed ./data.jsonl --provider local --model all-MiniLM-L6-v2

# OpenAI API
truva embed ./data.jsonl --provider api --model text-embedding-3-small

Example Dedup Report

When you pass --report ./report.json to truva dedupe, Truva writes a structured summary:

{
  "input_rows": 50000,
  "kept_rows": 12380,
  "removed_rows": 37620,
  "reduction_pct": 75.24,
  "threshold": 0.95,
  "num_clusters": 12380,
  "clusters": [
    {
      "representative_idx": 41,
      "size": 23,
      "avg_similarity": 0.9812
    }
  ]
}

Connecting to API Providers

OpenAI

Get an API key from https://platform.openai.com/api-keys
Set the environment variable:

export OPENAI_API_KEY=sk-...

Then use --provider openai with any truva command.

Anthropic

Get an API key from https://console.anthropic.com/settings/keys
Set the environment variable:

export ANTHROPIC_API_KEY=sk-ant-...

Then use --provider anthropic with any truva command.

Ollama (Free, Local)

Install Ollama from https://ollama.com, then pull a model:

ollama pull llama3:8b

No API key needed. Use --provider ollama with truva score.

Calibration

Calibration files let you inject domain-specific scoring rules and few-shot examples into the judge prompt. This is useful when the default judge undervalues domain-specific shorthand (e.g., medical abbreviations, legal citations).

# calibration_medical.yaml
rubric: |
  Prioritize clinical accuracy over grammar.
  Short medical abbreviations (SOB, CHF, hx) are acceptable.
  Penalize vague or speculative language.

examples:
  - text: "pt presents w/ SOB, hx of CHF, started on furosemide 40mg"
    score: 9
    reasoning: "Clinically precise with actionable treatment detail"
  - text: "The patient is not feeling well and might have a heart problem"
    score: 2
    reasoning: "Vague, no clinical specificity"

truva score ./data.jsonl --calibration ./calibration_medical.yaml --min-quality 7

See examples/ for ready-made calibration files for medical and customer support domains.

Supported Formats

JSONL — One JSON object per line (.jsonl, .json)
CSV — Auto-detects the text column or use --text-field
Hugging Face Datasets — Pass a dataset identifier like username/dataset

Configuration

truva dedupe

--threshold FLOAT         Cosine similarity threshold (0.0–1.0), default 0.95
--provider [local|api]    Embedding provider
--model TEXT              Embedding model name
--text-field TEXT         Column/field to embed (auto-detected if not set)
--format TEXT             Input format: auto, jsonl, csv, hf
--output, -o TEXT         Output file path
--report TEXT             Path for JSON report

truva score

--provider                Judge model provider: ollama, openai, anthropic
--model TEXT              Judge model name
--mode [fast|thorough]    Scoring mode, default fast
--min-quality FLOAT       Drop rows below this score (0.0 = keep all)
--calibration TEXT        Path to calibration YAML file
--text-field TEXT         Column/field to score (auto-detected if not set)
--output, -o TEXT         Output file path
--report TEXT             Path for scoring report JSON
--interactive             Interactive dry-run mode
--sample INT              Number of rows to sample in interactive mode

truva contradict

--confidence FLOAT        Min NLI confidence to flag a contradiction, default 0.8
--nli-model TEXT          NLI model name, default cross-encoder/nli-deberta-v3-base
--text-field TEXT         Column/field to compare (auto-detected if not set)
--dedupe-threshold FLOAT  Similarity threshold for clustering before NLI check, default 0.95
--format TEXT             Input format: auto, jsonl, csv, hf
--output, -o TEXT         Output path for contradiction report JSON

truva audit

--provider TEXT           Judge model provider: ollama, openai, anthropic
--model TEXT              Judge model name
--embed-provider TEXT     Embedding provider: local, api (default: local)
--embed-model TEXT        Embedding model name
--dedupe-threshold FLOAT  Cosine similarity threshold for dedup, default 0.95
--min-quality FLOAT       Drop rows below this score, default 6.0
--mode [fast|thorough]    Scoring mode, default fast
--detect-contradictions   Run NLI contradiction detection after dedup
--contradiction-confidence FLOAT  Min confidence to flag, default 0.8
--text-field TEXT         Column/field name (auto-detected if not set)
--format TEXT             Input format: auto, jsonl, csv, hf
--calibration TEXT        Path to calibration YAML file
--output, -o TEXT         Output directory (default: ./truva_output)

truva inspect

REPORT_PATH               Path to report.json from a prior audit run
--what [clusters|contradictions]  What to inspect (default: clusters)
--top INT                 How many entries to show (default: 20)

truva export

GOLD_PATH                 Path to gold.jsonl from a prior audit run
--format [jsonl|csv]      Output format (default: jsonl)
--columns TEXT            Comma-separated fields to include (default: all)
--output, -o TEXT         Output file path

truva health

INPUT_PATH                Dataset to inspect
--text-field TEXT         Column/field to analyse (auto-detected if not set)
--format TEXT             Input format: auto, jsonl, csv, hf

Roadmap

Caching — Skip recomputation on re-runs
Data leakage detection — Flag training rows that overlap with your eval set

Quickstart Script

A runnable demo script is included in the repo:

bash examples/quickstart.sh

It runs a health check, full audit with local embeddings, cluster inspection, and CSV export on the bundled 50-row fixture — no API key required.

Development

# Clone and install in editable mode with dev dependencies
git clone https://github.com/turingspark/truva
cd truva
pip install -e ".[dev]"

# Run fast unit tests (no models required)
pytest tests/ -v -m "not integration"

# Run full test suite including end-to-end integration tests
pytest tests/ -v

Integration tests load real embedding (all-MiniLM-L6-v2) and NLI (cross-encoder/nli-deberta-v3-base) models locally — no API keys needed, but they take a few minutes.

Requirements

Python 3.10+
Works on macOS (Apple Silicon), Linux

License

Apache 2.0

Feedback

Found a bug or have a feature request? Send us an email at team@turingspark.com — we'd love to hear from you.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.0

Apr 11, 2026

0.1.3

Apr 5, 2026

0.1.2

Apr 5, 2026

0.1.1

Apr 5, 2026

0.1.0

Apr 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

truva-0.2.0.tar.gz (44.6 kB view details)

Uploaded Apr 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

truva-0.2.0-py3-none-any.whl (50.2 kB view details)

Uploaded Apr 11, 2026 Python 3

File details

Details for the file truva-0.2.0.tar.gz.

File metadata

Download URL: truva-0.2.0.tar.gz
Upload date: Apr 11, 2026
Size: 44.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for truva-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`e7a63e9f6e058208a70bc44574bc80350aca8b4d2bcf7722594c84478c88dd0a`
MD5	`df29ab352c7404aa51eac7683a6cc805`
BLAKE2b-256	`1a7e6044e038fdb07ac30a6b046982d75087f1746759a24ef543fb5942f59de2`

See more details on using hashes here.

File details

Details for the file truva-0.2.0-py3-none-any.whl.

File metadata

Download URL: truva-0.2.0-py3-none-any.whl
Upload date: Apr 11, 2026
Size: 50.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for truva-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6ac662d79993f45a458eaab238da24b994d46aeaa0666c6138c4284d8d24cb6a`
MD5	`e2cf9af5188fd9aeffd0006e520ebc42`
BLAKE2b-256	`3ba702df8fb1b0d29439769a72c720e4589a3b529b785c132e4dad0c95cde77a`

See more details on using hashes here.

truva 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Truva

Quick Install

30-Second Example

What It Does

Features

Semantic Deduplication

Quality Scoring

Contradiction Detection

Full Pipeline (audit)

Embedding Generation

Example Dedup Report

Connecting to API Providers

OpenAI

Anthropic

Ollama (Free, Local)

Calibration

Supported Formats

Configuration

Roadmap

Quickstart Script

Development

Requirements

License

Feedback

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes