Skip to main content

Data curation engine for LLM fine-tuning

Project description

Truva

Truva curates your fine-tuning data so you train on signal, not noise.

A CLI-first data curation engine for ML engineers who fine-tune language models. Truva takes a messy dataset and produces a smaller, higher-quality "gold" dataset by removing redundancy, scoring information density, and detecting contradictions.

Goal: Reduce dataset size by 50–80% while maintaining or improving downstream model accuracy, cutting GPU training costs proportionally.

Quick Install

pip install truva

30-Second Example

# Deduplicate a dataset with default settings
truva dedupe ./data.jsonl --output ./deduped.jsonl

# Deduplicate with a custom threshold and generate a report
truva dedupe ./data.jsonl --threshold 0.9 --output ./deduped.jsonl --report ./report.json

# Generate embeddings for a dataset
truva embed ./data.jsonl --output ./embeddings.npy

What It Does

Before After
50,000 rows 12,000 rows
Redundant examples Unique, representative samples
Unknown quality Scored and filtered
Hidden contradictions Flagged for review

Features

Semantic Deduplication

Removes near-duplicate rows using embedding similarity and Union-Find clustering. Each cluster keeps the single most representative example (closest to centroid).

truva dedupe ./data.jsonl --threshold 0.95
  • --threshold 0.95 (default): Aggressive but safe for most fine-tuning datasets
  • --threshold 0.85: More aggressive, catches paraphrases
  • --threshold 1.0: Only removes exact semantic matches

Embedding Generation

Compute vector embeddings for your dataset using local models or the OpenAI API.

# Local (free, no API key needed)
truva embed ./data.jsonl --provider local --model all-MiniLM-L6-v2

# OpenAI API
truva embed ./data.jsonl --provider api --model text-embedding-3-small

Supported Formats

  • JSONL — One JSON object per line (.jsonl, .json)
  • CSV — Auto-detects the text column or use --text-field
  • Hugging Face Datasets — Pass a dataset identifier like username/dataset

Configuration

All options are available as CLI flags:

--threshold FLOAT       Cosine similarity threshold for dedup (0.0–1.0)
--provider [local|api]  Embedding provider
--model TEXT            Model name
--text-field TEXT       Column/field to use (auto-detected if not set)
--format TEXT           Input format: auto, jsonl, csv, hf
--output, -o TEXT       Output file path
--report TEXT           Path for JSON report

Requirements

  • Python 3.10+
  • Works on macOS (Apple Silicon), Linux

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

truva-0.1.2.tar.gz (18.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

truva-0.1.2-py3-none-any.whl (20.7 kB view details)

Uploaded Python 3

File details

Details for the file truva-0.1.2.tar.gz.

File metadata

  • Download URL: truva-0.1.2.tar.gz
  • Upload date:
  • Size: 18.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for truva-0.1.2.tar.gz
Algorithm Hash digest
SHA256 c25696af0570f4123066a6cf3ace13e3192f8b41b1ed0bd1621d0795c35c60d0
MD5 5c1f8793e00710226bf5160c3f6f1ac3
BLAKE2b-256 596c62dbaf40d3cafcb20abf43d7e3eba66b8f4252461ec41775a5dcc7b06a32

See more details on using hashes here.

File details

Details for the file truva-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: truva-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 20.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for truva-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 46563bf8ac135b66e32f029b4f1ba08282d34747688c81f745061dcbfa27d9ff
MD5 a01c085bcd21f5a3ab839e4393a46a57
BLAKE2b-256 fba140a2e70fa1baaa902670f06a2800e77c2d407a13d34ad04cf5b6cbc14560

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page