Skip to main content

Data curation engine for LLM fine-tuning

Project description

Truva

Truva curates your fine-tuning data so you train on signal, not noise.

A CLI-first data curation engine for ML engineers who fine-tune language models. Truva takes a messy dataset and produces a smaller, higher-quality "gold" dataset by removing redundancy, scoring information density, and detecting contradictions.

Goal: Reduce dataset size by 50–80% while maintaining or improving downstream model accuracy, cutting GPU training costs proportionally.

Quick Install

pip install truva

30-Second Example

# Deduplicate a dataset
truva dedupe ./data.jsonl --output ./deduped.jsonl

# Score for information density
truva score ./data.jsonl --provider ollama --model llama3:8b --min-quality 6

# Detect contradicting rows
truva contradict ./data.jsonl --confidence 0.8 --output ./contradictions.json

# Full pipeline in one command
truva audit ./data.jsonl \
  --provider openai --model gpt-4o-mini \
  --min-quality 6 --detect-contradictions \
  --output ./results/

What It Does

Before After
50,000 rows 12,000 rows
Redundant examples Unique, representative samples
Unknown quality Scored and filtered
Hidden contradictions Flagged for review

Features

Semantic Deduplication

Removes near-duplicate rows using embedding similarity and Union-Find clustering. Each cluster keeps the single most representative example (closest to centroid).

truva dedupe ./data.jsonl --threshold 0.95
  • --threshold 0.95 (default): Aggressive but safe for most fine-tuning datasets
  • --threshold 0.85: More aggressive, catches paraphrases
  • --threshold 1.0: Only removes exact semantic matches

Quality Scoring

Scores each row on a 1–10 scale for educational value using an LLM judge. Drop low-quality rows with --min-quality.

# Free, local (requires Ollama running locally)
truva score ./data.jsonl --provider ollama --model llama3:8b --min-quality 6

# OpenAI
truva score ./data.jsonl --provider openai --model gpt-4o-mini --min-quality 6

# Anthropic
truva score ./data.jsonl --provider anthropic --model claude-haiku-4-5-20251001 --min-quality 6

# With a domain-specific calibration file
truva score ./data.jsonl --calibration ./calibration_medical.yaml --min-quality 7

# Interactive dry-run: score 50 samples and give feedback before committing
truva score ./data.jsonl --interactive --sample 50

Fast mode (default): Scores only cluster representatives from a prior dedup run, then propagates scores to cluster members with a −1 penalty. 5–10x cheaper than scoring every row.

Thorough mode: Scores every row individually.

truva score ./data.jsonl --mode thorough --provider openai --model gpt-4o-mini

Contradiction Detection

Finds rows that teach conflicting information using a local NLI model (cross-encoder/nli-deberta-v3-base). Only semantically similar rows (within the same cluster) are compared, keeping the number of checks tractable.

truva contradict ./data.jsonl --confidence 0.8 --output ./contradictions.json

Output is a JSON report listing each contradicting pair, the texts, and the NLI confidence score:

{
  "input_rows": 1000,
  "num_clusters": 312,
  "confidence_threshold": 0.8,
  "contradictions_found": 4,
  "contradictions": [
    {
      "row_a_id": "c1a",
      "row_b_id": "c1b",
      "text_a": "Refunds are processed instantly.",
      "text_b": "Refunds take 3 to 5 business days.",
      "confidence": 0.9341
    }
  ]
}

No API key needed — NLI runs entirely on CPU.

Full Pipeline (audit)

Run deduplication, quality scoring, and contradiction detection in one command. Writes gold.jsonl and report.json to the output directory.

truva audit ./data.jsonl \
  --provider openai --model gpt-4o-mini \
  --min-quality 6 --detect-contradictions \
  --output ./results/

After the run, inspect the results:

# View top duplicate clusters
truva inspect ./results/report.json --what clusters --top 10

# View contradiction pairs
truva inspect ./results/report.json --what contradictions

# Export gold dataset as CSV
truva export ./results/gold.jsonl --format csv -o ./gold.csv

# Quick health check on any dataset
truva health ./data.jsonl

Embedding Generation

Compute vector embeddings for your dataset using local models or the OpenAI API.

# Local (free, no API key needed)
truva embed ./data.jsonl --provider local --model all-MiniLM-L6-v2

# OpenAI API
truva embed ./data.jsonl --provider api --model text-embedding-3-small

Example Dedup Report

When you pass --report ./report.json to truva dedupe, Truva writes a structured summary:

{
  "input_rows": 50000,
  "kept_rows": 12380,
  "removed_rows": 37620,
  "reduction_pct": 75.24,
  "threshold": 0.95,
  "num_clusters": 12380,
  "clusters": [
    {
      "representative_idx": 41,
      "size": 23,
      "avg_similarity": 0.9812
    }
  ]
}

Connecting to API Providers

OpenAI

  1. Get an API key from https://platform.openai.com/api-keys
  2. Set the environment variable:
export OPENAI_API_KEY=sk-...

Then use --provider openai with any truva command.

Anthropic

  1. Get an API key from https://console.anthropic.com/settings/keys
  2. Set the environment variable:
export ANTHROPIC_API_KEY=sk-ant-...

Then use --provider anthropic with any truva command.

Ollama (Free, Local)

Install Ollama from https://ollama.com, then pull a model:

ollama pull llama3:8b

No API key needed. Use --provider ollama with truva score.

Calibration

Calibration files let you inject domain-specific scoring rules and few-shot examples into the judge prompt. This is useful when the default judge undervalues domain-specific shorthand (e.g., medical abbreviations, legal citations).

# calibration_medical.yaml
rubric: |
  Prioritize clinical accuracy over grammar.
  Short medical abbreviations (SOB, CHF, hx) are acceptable.
  Penalize vague or speculative language.

examples:
  - text: "pt presents w/ SOB, hx of CHF, started on furosemide 40mg"
    score: 9
    reasoning: "Clinically precise with actionable treatment detail"
  - text: "The patient is not feeling well and might have a heart problem"
    score: 2
    reasoning: "Vague, no clinical specificity"
truva score ./data.jsonl --calibration ./calibration_medical.yaml --min-quality 7

See examples/ for ready-made calibration files for medical and customer support domains.

Supported Formats

  • JSONL — One JSON object per line (.jsonl, .json)
  • CSV — Auto-detects the text column or use --text-field
  • Hugging Face Datasets — Pass a dataset identifier like username/dataset

Configuration

truva dedupe

--threshold FLOAT         Cosine similarity threshold (0.0–1.0), default 0.95
--provider [local|api]    Embedding provider
--model TEXT              Embedding model name
--text-field TEXT         Column/field to embed (auto-detected if not set)
--format TEXT             Input format: auto, jsonl, csv, hf
--output, -o TEXT         Output file path
--report TEXT             Path for JSON report

truva score

--provider                Judge model provider: ollama, openai, anthropic
--model TEXT              Judge model name
--mode [fast|thorough]    Scoring mode, default fast
--min-quality FLOAT       Drop rows below this score (0.0 = keep all)
--calibration TEXT        Path to calibration YAML file
--text-field TEXT         Column/field to score (auto-detected if not set)
--output, -o TEXT         Output file path
--report TEXT             Path for scoring report JSON
--interactive             Interactive dry-run mode
--sample INT              Number of rows to sample in interactive mode

truva contradict

--confidence FLOAT        Min NLI confidence to flag a contradiction, default 0.8
--nli-model TEXT          NLI model name, default cross-encoder/nli-deberta-v3-base
--text-field TEXT         Column/field to compare (auto-detected if not set)
--dedupe-threshold FLOAT  Similarity threshold for clustering before NLI check, default 0.95
--format TEXT             Input format: auto, jsonl, csv, hf
--output, -o TEXT         Output path for contradiction report JSON

truva audit

--provider TEXT           Judge model provider: ollama, openai, anthropic
--model TEXT              Judge model name
--embed-provider TEXT     Embedding provider: local, api (default: local)
--embed-model TEXT        Embedding model name
--dedupe-threshold FLOAT  Cosine similarity threshold for dedup, default 0.95
--min-quality FLOAT       Drop rows below this score, default 6.0
--mode [fast|thorough]    Scoring mode, default fast
--detect-contradictions   Run NLI contradiction detection after dedup
--contradiction-confidence FLOAT  Min confidence to flag, default 0.8
--text-field TEXT         Column/field name (auto-detected if not set)
--format TEXT             Input format: auto, jsonl, csv, hf
--calibration TEXT        Path to calibration YAML file
--output, -o TEXT         Output directory (default: ./truva_output)

truva inspect

REPORT_PATH               Path to report.json from a prior audit run
--what [clusters|contradictions]  What to inspect (default: clusters)
--top INT                 How many entries to show (default: 20)

truva export

GOLD_PATH                 Path to gold.jsonl from a prior audit run
--format [jsonl|csv]      Output format (default: jsonl)
--columns TEXT            Comma-separated fields to include (default: all)
--output, -o TEXT         Output file path

truva health

INPUT_PATH                Dataset to inspect
--text-field TEXT         Column/field to analyse (auto-detected if not set)
--format TEXT             Input format: auto, jsonl, csv, hf

Roadmap

  • Caching — Skip recomputation on re-runs
  • Data leakage detection — Flag training rows that overlap with your eval set

Quickstart Script

A runnable demo script is included in the repo:

bash examples/quickstart.sh

It runs a health check, full audit with local embeddings, cluster inspection, and CSV export on the bundled 50-row fixture — no API key required.

Development

# Clone and install in editable mode with dev dependencies
git clone https://github.com/turingspark/truva
cd truva
pip install -e ".[dev]"

# Run fast unit tests (no models required)
pytest tests/ -v -m "not integration"

# Run full test suite including end-to-end integration tests
pytest tests/ -v

Integration tests load real embedding (all-MiniLM-L6-v2) and NLI (cross-encoder/nli-deberta-v3-base) models locally — no API keys needed, but they take a few minutes.

Requirements

  • Python 3.10+
  • Works on macOS (Apple Silicon), Linux

License

Apache 2.0

Feedback

Found a bug or have a feature request? Send us an email at team@turingspark.com — we'd love to hear from you.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

truva-0.2.0.tar.gz (44.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

truva-0.2.0-py3-none-any.whl (50.2 kB view details)

Uploaded Python 3

File details

Details for the file truva-0.2.0.tar.gz.

File metadata

  • Download URL: truva-0.2.0.tar.gz
  • Upload date:
  • Size: 44.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for truva-0.2.0.tar.gz
Algorithm Hash digest
SHA256 e7a63e9f6e058208a70bc44574bc80350aca8b4d2bcf7722594c84478c88dd0a
MD5 df29ab352c7404aa51eac7683a6cc805
BLAKE2b-256 1a7e6044e038fdb07ac30a6b046982d75087f1746759a24ef543fb5942f59de2

See more details on using hashes here.

File details

Details for the file truva-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: truva-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 50.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for truva-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6ac662d79993f45a458eaab238da24b994d46aeaa0666c6138c4284d8d24cb6a
MD5 e2cf9af5188fd9aeffd0006e520ebc42
BLAKE2b-256 3ba702df8fb1b0d29439769a72c720e4589a3b529b785c132e4dad0c95cde77a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page