Data curation engine for LLM fine-tuning

Project description

Truva

Truva curates your fine-tuning data so you train on signal, not noise.

A CLI-first data curation engine for ML engineers who fine-tune language models. Truva takes a messy dataset and produces a smaller, higher-quality "gold" dataset by removing redundancy, scoring information density, and detecting contradictions.

Goal: Reduce dataset size by 50–80% while maintaining or improving downstream model accuracy, cutting GPU training costs proportionally.

Quick Install

pip install truva

30-Second Example

# Deduplicate a dataset with default settings
truva dedupe ./data.jsonl --output ./deduped.jsonl

# Deduplicate with a custom threshold and generate a report
truva dedupe ./data.jsonl --threshold 0.9 --output ./deduped.jsonl --report ./report.json

# Generate embeddings for a dataset
truva embed ./data.jsonl --output ./embeddings.npy

What It Does

Before	After
50,000 rows	12,000 rows
Redundant examples	Unique, representative samples
Unknown quality	Scored and filtered
Hidden contradictions	Flagged for review

Features

Semantic Deduplication

Removes near-duplicate rows using embedding similarity and Union-Find clustering. Each cluster keeps the single most representative example (closest to centroid).

truva dedupe ./data.jsonl --threshold 0.95

--threshold 0.95 (default): Aggressive but safe for most fine-tuning datasets
--threshold 0.85: More aggressive, catches paraphrases
--threshold 1.0: Only removes exact semantic matches

Embedding Generation

Compute vector embeddings for your dataset using local models or the OpenAI API.

# Local (free, no API key needed)
truva embed ./data.jsonl --provider local --model all-MiniLM-L6-v2

# OpenAI API
truva embed ./data.jsonl --provider api --model text-embedding-3-small

Supported Formats

JSONL — One JSON object per line (.jsonl, .json)
CSV — Auto-detects the text column or use --text-field
Hugging Face Datasets — Pass a dataset identifier like username/dataset

Configuration

All options are available as CLI flags:

--threshold FLOAT       Cosine similarity threshold for dedup (0.0–1.0)
--provider [local|api]  Embedding provider
--model TEXT            Model name
--text-field TEXT       Column/field to use (auto-detected if not set)
--format TEXT           Input format: auto, jsonl, csv, hf
--output, -o TEXT       Output file path
--report TEXT           Path for JSON report

Requirements

Python 3.10+
Works on macOS (Apple Silicon), Linux

License

TBD

Project details

Release history Release notifications | RSS feed

0.2.0

Apr 11, 2026

0.1.3

Apr 5, 2026

0.1.2

Apr 5, 2026

0.1.1

Apr 5, 2026

This version

0.1.0

Apr 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

truva-0.1.0.tar.gz (14.4 kB view details)

Uploaded Apr 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

truva-0.1.0-py3-none-any.whl (16.8 kB view details)

Uploaded Apr 5, 2026 Python 3

File details

Details for the file truva-0.1.0.tar.gz.

File metadata

Download URL: truva-0.1.0.tar.gz
Upload date: Apr 5, 2026
Size: 14.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for truva-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`eb6cfecd8e29dd366259d1e0043f471401e61f91a0f0a9700c73144fdb2d2c18`
MD5	`0fc9cb901345e4e339eb8ee18cc1f8a7`
BLAKE2b-256	`30d8a7a0131882b0ae87411763c4ce2df67d70d0d39fb412e93faaea967ef2dc`

See more details on using hashes here.

File details

Details for the file truva-0.1.0-py3-none-any.whl.

File metadata

Download URL: truva-0.1.0-py3-none-any.whl
Upload date: Apr 5, 2026
Size: 16.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for truva-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`baf887513fb849002c3eb87cb02a2ee7a06d9f9713af4819e4d74de5e274d019`
MD5	`badd175b92302a791ed4607537f98c51`
BLAKE2b-256	`1b1f13db30df24801fa7c20897063a110f4f5ad516ec1b577ad4c8bf486df661`

See more details on using hashes here.

truva 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Truva

Quick Install

30-Second Example

What It Does

Features

Semantic Deduplication

Embedding Generation

Supported Formats

Configuration

Requirements

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes