Skip to main content

Lightweight data quality toolkit for LLM instruction tuning. Deduplication, PII detection, contamination checking, and quality scoring — no GPU required.

Project description

datacruxai

PyPI Downloads CI Python 3.9+ License: Apache 2.0

Data quality toolkit for LLM instruction tuning.

Clean your training data before fine-tuning. No GPU needed.

datacruxai quality report

from datacruxai import load_dataset, exact_dedup, scan_examples, score_dataset

# Load any instruction-tuning dataset
examples = load_dataset("training_data.jsonl")

# Remove exact duplicates
result = exact_dedup(examples)
print(f"Removed {result.n_duplicates} duplicates")

# Scan for PII
pii_results = scan_examples(result.originals)
print(f"Found PII in {len(pii_results)} examples")

# Score quality
scores = score_dataset(result.originals, min_score=0.5)
print(f"{len(scores)} low-quality examples flagged")

Why datacruxai?

If you're fine-tuning an LLM, your training data quality matters more than quantity. Garbage in, garbage out — except now garbage costs you GPU hours and makes your model worse.

Existing options are either overkill (NeMo Curator needs NVIDIA GPUs and processes terabytes of web crawl data) or too narrow (scattered scripts in random repos). datacruxai fills the gap: a single pip install that gives you everything needed to validate and clean instruction-tuning datasets on a laptop.

What it does:

  • Deduplication — exact (hash-based) and near-duplicate (MinHash + LSH) detection
  • PII detection — regex-based scanning for emails, phones, SSNs, credit cards, IPs
  • PII redaction — replace detected PII with placeholders
  • Benchmark contamination — n-gram overlap checking against MMLU, GSM8K, HellaSwag, ARC, TruthfulQA, WinoGrande
  • Quality scoring — heuristic checks for instruction quality, response completeness, repetition, formatting
  • Format support — Alpaca, ShareGPT, OpenAI chat format; JSONL, JSON, Parquet
  • Dataset statistics — length distributions, token estimates, field coverage

Everything runs on CPU. Everything is deterministic. No API keys, no signups, no cloud dependencies.

Install

pip install datacruxai

With fuzzy deduplication (MinHash + LSH):

pip install datacruxai[fuzzy]

With Parquet support:

pip install datacruxai[formats]

Everything:

pip install datacruxai[all]

Usage

Load data

datacruxai auto-detects Alpaca, ShareGPT, and OpenAI chat formats.

from datacruxai import load_dataset, detect_format

examples = load_dataset("training_data.jsonl")  # JSONL, JSON, or Parquet
print(f"Loaded {len(examples)} examples")
print(f"Format: {detect_format(examples[0].raw)}")

Deduplicate

from datacruxai import exact_dedup, fuzzy_dedup, dedup

# Fast exact dedup (hash-based)
result = exact_dedup(examples)
print(f"{result.n_total}{result.n_unique} (removed {result.n_duplicates})")

# Near-duplicate detection (requires datacruxai[fuzzy])
result = fuzzy_dedup(examples, threshold=0.8)

# Combined: exact first, then fuzzy on the remainder
result = dedup(examples, exact=True, fuzzy=True, fuzzy_threshold=0.8)
clean_examples = result.originals

Detect PII

from datacruxai import scan_text, scan_examples, redact_text, redact_examples

# Scan a single string
entities = scan_text("Email me at john@example.com or call 555-0123")
for e in entities:
    print(f"  {e.kind}: '{e.text}' at [{e.start}:{e.end}]")

# Scan an entire dataset
pii_results = scan_examples(examples)
for r in pii_results:
    print(f"  Example {r.example_index}: {[e.kind for e in r.entities]}")

# Redact PII
safe = redact_text("SSN: 123-45-6789")
# "SSN: [SSN]"

safe_examples = redact_examples(examples)

Check benchmark contamination

from datacruxai import check_contamination, list_benchmarks

# Built-in benchmarks
print(list_benchmarks())
# ['arc', 'gsm8k', 'hellaswag', 'mmlu', 'truthfulqa', 'winogrande']

report = check_contamination(examples, ngram_size=8)
print(f"Flagged {report.total_flagged} / {report.total_checked} examples")
for bench, count in report.by_benchmark.items():
    print(f"  {bench}: {count} matches")

# Custom benchmark dataset
custom = {"my_eval": ["question one text", "question two text"]}
report = check_contamination(examples, benchmarks=custom, ngram_size=5)

Score quality

from datacruxai import score_example, score_dataset, filter_by_quality

# Score a single example
score = score_example(examples[0])
print(f"Overall: {score.overall:.2f}")
print(f"Details: {score.details}")
print(f"Flags: {score.flags}")

# Flag low-quality examples
low_quality = score_dataset(examples, min_score=0.5)
for s in low_quality:
    print(f"  [{s.example_index}] {s.overall:.2f}{s.flags}")

# Filter and keep only good examples
clean = filter_by_quality(examples, min_score=0.5)
print(f"Kept {len(clean)} / {len(examples)}")

Dataset statistics

from datacruxai import compute_stats, length_distribution

stats = compute_stats(examples)
print(f"Examples: {stats['n_examples']}")
print(f"Token estimate: ~{stats['token_estimate']:,}")
print(f"Empty outputs: {stats['empty_outputs']}")
print(f"Avg instruction length: {stats['instruction_lengths']['mean']:.0f} chars")

# Length histogram
hist = length_distribution(examples, field="output", bins=10)
for bucket in hist:
    print(f"  {bucket['range']}: {'█' * bucket['count']}")

Save results

from datacruxai import save_jsonl

save_jsonl(clean_examples, "cleaned_training_data.jsonl")

CLI

# Dataset statistics
datacruxai stats training_data.jsonl

# Deduplicate
datacruxai dedup training_data.jsonl -o deduped.jsonl

# Fuzzy dedup
datacruxai dedup training_data.jsonl --fuzzy -t 0.8 -o deduped.jsonl

# PII scan
datacruxai pii training_data.jsonl

# PII redact and save
datacruxai pii training_data.jsonl -o redacted.jsonl

# Contamination check
datacruxai contamination training_data.jsonl -n 8

# Quality scoring
datacruxai quality training_data.jsonl -t 0.5 -o filtered.jsonl

Quality Checks

The quality scorer applies these deterministic heuristics:

Check Weight What it catches
Instruction quality 25% Empty, trivial, or all-caps instructions
Response completeness 30% Empty, trivial, or refusal-only responses
Length 15% Extremely short or long examples
Repetition 20% Repeated words, repeated n-grams
Language 10% Excessive special characters, whitespace

Scores range from 0.0 (terrible) to 1.0 (clean). The default threshold of 0.5 catches the obvious problems without being overly aggressive.

Supported Formats

Format Auto-detected Key fields
Alpaca instruction, input, output Standard fine-tuning format
ShareGPT conversations Multi-turn with from/value
OpenAI chat messages role/content pairs

File types: .jsonl, .json, .parquet (with datacruxai[formats])

Performance

Everything is single-threaded and CPU-only by design. On a typical laptop:

  • Exact dedup: ~100k examples/sec
  • PII scan: ~50k examples/sec
  • Quality scoring: ~80k examples/sec
  • Contamination check: depends on n-gram size, ~10k examples/sec for n=8

For datasets under 1M examples, everything runs in seconds to minutes. If you're working with larger datasets, consider NeMo Curator (but you'll need GPUs).

Contributing

PRs welcome — especially:

  • Additional PII patterns (non-US phone formats, EU identifiers)
  • More benchmark fingerprints
  • New quality heuristics
  • Performance improvements
git clone https://github.com/zbhatti/datacruxai.git
cd datacruxai
pip install -e ".[all]"
pip install pytest ruff
pytest

See Also

Part of the stef41 LLM toolkit — open-source tools for every stage of the LLM lifecycle:

Project What it does
tokonomics Token counting & cost management for LLM APIs
castwright Synthetic instruction data generation
datamix Dataset mixing & curriculum optimization
toksight Tokenizer analysis & comparison
trainpulse Training health monitoring
ckpt Checkpoint inspection, diffing & merging
quantbench Quantization quality analysis
infermark Inference benchmarking
modeldiff Behavioral regression testing
vibesafe AI-generated code safety scanner
injectionguard Prompt injection detection

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datacruxai-0.4.0.tar.gz (60.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datacruxai-0.4.0-py3-none-any.whl (42.4 kB view details)

Uploaded Python 3

File details

Details for the file datacruxai-0.4.0.tar.gz.

File metadata

  • Download URL: datacruxai-0.4.0.tar.gz
  • Upload date:
  • Size: 60.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for datacruxai-0.4.0.tar.gz
Algorithm Hash digest
SHA256 281563e2d3145bd1573497fd68545eba2b51ec280d2fdb807804637229fdb963
MD5 310d02b9087c99b2fa69b360c98f6190
BLAKE2b-256 7b01a146ed66aaa44ea77c2bb259e6e1c00ce69dc51b3b31a9782e405a827d11

See more details on using hashes here.

File details

Details for the file datacruxai-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: datacruxai-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 42.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for datacruxai-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e05f2079e80637c7b08a1684fb9224f808718879d84667aebe7660503153df88
MD5 aebc82cae1c079381bd0dee43d1ff38d
BLAKE2b-256 1d7454b3fd2f60e0b44a9a888f1be53ea5c1b841f5aff34055e18c581fb83af6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page