Skip to main content

Streaming JSONL cleaner for LLM fine-tuning datasets.

Project description

cleanllm

Streaming JSONL cleaner for LLM fine-tuning datasets. Minimal dependencies, memory-safe, and fast — processes files line-by-line without loading them into memory.

PyPI Python License: MIT


What it does

cleanllm gives you a pipeline for cleaning, validating, and profiling JSONL datasets before fine-tuning:

raw.jsonl → scan → fix → dedup → validate → stats → audit bundle → shards

Every step is streaming (no full-file load), resumable, and produces machine-readable JSON reports for CI gating.


Install

pip install cleanllm

Or from source:

git clone https://github.com/verma8076/cleanllm
cd cleanllm
pip install -e .

Quickstart

# Scan for issues
cleanllm scan data.jsonl

# Fix: remove URLs, normalize whitespace, redact forbidden patterns
cleanllm fix data.jsonl -o data.cleaned.jsonl

# Deduplicate by prompt content
cleanllm dedup data.cleaned.jsonl -o data.dedup.jsonl --by prompt

# Profile the cleaned dataset
cleanllm stats data.dedup.jsonl --report-json stats.json

# Gate in CI: fail if invalid rows increased
cleanllm gate --compare compare.json --rules gate_rules.json

CLI reference

scan

Streaming scan for issues — invalid JSON, missing keys, URLs, forbidden patterns, language distribution, duplicate estimate.

cleanllm scan data.jsonl
cleanllm scan data.jsonl --report-json scan_report.json --dup-estimate
cleanllm scan data.jsonl --preset cp_portable

fix

Remove URLs, normalize whitespace, redact or drop rows with forbidden patterns.

cleanllm fix data.jsonl -o cleaned.jsonl
cleanllm fix data.jsonl -o cleaned.jsonl --drop-on forbidden_pattern --drop-on invalid_json
cleanllm fix data.jsonl -o cleaned.jsonl --preset cpp17_clean --report-json fix_report.json

Drop rules: invalid_json, missing_required_keys, forbidden_pattern, empty_assistant, placeholder, repetitive_response, bad_conversation.

Note on empty_assistant: By default this drops assistant responses shorter than 20 characters — calibrated for code datasets where very short responses are almost always errors. For text/chat datasets, set --min-assistant-chars 1 to only drop truly blank responses.

validate

Schema validation, line by line. Exit code 0 only if all rows pass.

cleanllm validate data.jsonl --schema basic_sft
cleanllm validate data.jsonl --schema cp_sft_v1
Schema Required fields
basic_sft id, messages (list of role/content dicts)
cp_sft_v1 id, source, problem_id, messages, tests (non-empty, with input/output)

dedup

First-occurrence deduplication — by full record, prompt (system+user), or code (assistant).

cleanllm dedup data.jsonl -o deduped.jsonl --by record
cleanllm dedup data.jsonl -o deduped.jsonl --by prompt --normalized
cleanllm dedup data.jsonl -o deduped.jsonl --by code --report-json dedup_report.json

stats

Single-pass profiler: distributions, structural stats, schema counts, response lengths, language distribution.

cleanllm stats data.jsonl
cleanllm stats data.jsonl --schema cp_sft_v1 --keys source,difficulty_bucket --top-k 20
cleanllm stats data.jsonl --report-json stats.json

compare

Diff two stats reports to catch regressions between dataset versions.

cleanllm compare old_stats.json new_stats.json
cleanllm compare old_stats.json new_stats.json --report-json compare.json
cleanllm compare old.jsonl new.jsonl --from-jsonl --schema cp_sft_v1

gate

CI-friendly quality gating. Nonzero exit on failures.

cleanllm gate --stats stats.json --rules gate_rules.json
cleanllm gate --compare compare.json --rules gate_rules.json --strict
cleanllm gate --compare compare.json --inline-rule "counts_diff.invalid_json_rows.delta<=0"

Gate rules JSON:

{
  "version": 1,
  "mode": "compare",
  "rules": [
    {"name": "no_new_invalid", "metric": "counts_diff.invalid_json_rows.delta", "op": "<=", "value": 0},
    {"name": "enough_valid",   "metric": "counts_diff.valid_json_rows.new",    "op": ">=", "value": 1000}
  ]
}

Supported ops: ==, !=, <, <=, >, >=. Severities: error (default), warn.

run

Execute a JSON-defined multi-step pipeline with variable substitution.

cleanllm run --config pipeline.json
cleanllm run --config pipeline.json --set input_path=data.jsonl --set outdir=out/v2
cleanllm run --config pipeline.json --dry-run

Supported step types: fix, validate, dedup, stats, audit, sample, shard, manifest, scan, compare.

sample

Reservoir sampling — random or stratified, deterministic with --seed.

cleanllm sample data.jsonl -o sample.jsonl -n 500 --seed 42
cleanllm sample data.jsonl -o sample.jsonl -n 500 --stratify source,difficulty_bucket

audit

Build a reproducible audit bundle in one command: sampled JSONL + CSV review index (with original line numbers) + summary + manifest.

cleanllm audit data.jsonl --outdir audit_bundle -n 200 --seed 42
cleanllm audit data.jsonl --outdir audit_bundle -n 200 --stratify source --schema cp_sft_v1

Bundle contents: audit_sample.jsonl, audit_index.csv, audit_summary.json, AUDIT_README.md, manifest.json.

shard / manifest

cleanllm shard data.jsonl --outdir shards --size 5000 --gzip
cleanllm manifest shards -o manifest.json

convert

Convert a JSONL file between sharegpt, alpaca, and chatml formats.

cleanllm convert data.jsonl -o converted.jsonl --from sharegpt --to chatml
cleanllm convert data.jsonl -o converted.jsonl --from alpaca --to sharegpt

Supported formats: sharegpt (conversations list), alpaca (instruction/output), chatml (messages list).

merge

Merge multiple JSONL files into one, with optional deduplication.

cleanllm merge a.jsonl b.jsonl c.jsonl -o merged.jsonl
cleanllm merge a.jsonl b.jsonl -o merged.jsonl --dedup

split

Split a JSONL file into train and val sets.

cleanllm split data.jsonl --outdir splits/
cleanllm split data.jsonl --outdir splits/ --ratio 0.95 --seed 42 --no-shuffle

Outputs <basename>_train.jsonl and <basename>_val.jsonl in the output directory. Default ratio is 0.9 (90% train).

recipes

Bootstrap pipelines and gate rules from built-in templates.

cleanllm recipes list
cleanllm recipes show cp_pipeline_cp_portable
cleanllm recipes write cp_bundle --outdir bootstrap/

Built-in recipes: cp_pipeline_basic, cp_pipeline_cp_portable, cp_pipeline_fast_audit, gate_stats_basic, gate_compare_basic, gate_compare_strict, cp_bundle.


Python API

from cleanllm import (
    scan_jsonl, fix_jsonl, FixRules,
    dedup_jsonl, validate_jsonl, stats_jsonl,
    sample_jsonl, audit_bundle,
    shard_jsonl, make_manifest,
    download_from_hub, detect_hf_schema,
)
from cleanllm.convert import convert_jsonl
from cleanllm.merge import merge_jsonl
from cleanllm.split import split_jsonl

# Scan
report = scan_jsonl("data.jsonl")

# Fix (code dataset)
rules = FixRules(
    drop_on={"forbidden_pattern", "empty_assistant"},
    max_tokens=4096,
    keep_language="python",
)
summary = fix_jsonl("data.jsonl", "cleaned.jsonl", rules)

# Fix (text/chat dataset — only drop truly blank responses)
rules = FixRules(drop_on={"empty_assistant"}, min_assistant_chars=1, forbidden_patterns=[])

# Dedup
result = dedup_jsonl("cleaned.jsonl", "deduped.jsonl", by="prompt", normalized=True)

# Stats
stats = stats_jsonl("deduped.jsonl", schema="cp_sft_v1", keys=["source", "difficulty_bucket"])

# Sample + audit
sample_jsonl("deduped.jsonl", "sample.jsonl", num_rows=200, seed=42)
audit_bundle("deduped.jsonl", "audit_bundle", num_rows=200, seed=42, stratify=["source"])

# Shard + manifest
shard_jsonl("deduped.jsonl", "shards", shard_size=5000, gzip_output=True)
make_manifest("shards", "manifest.json")

# Convert between formats
convert_jsonl("data.jsonl", "out.jsonl", from_fmt="sharegpt", to_fmt="chatml")

# Merge + split
merge_jsonl(["a.jsonl", "b.jsonl"], "merged.jsonl", dedup=True)
split_jsonl("merged.jsonl", "splits/", ratio=0.9, seed=42)

# Download from HuggingFace Hub (requires pip install cleanllm[hf])
result = download_from_hub("HuggingFaceH4/ultrachat_200k", "data.jsonl", split="train_sft")

Presets

Preset Description
general URL removal + whitespace normalization, no domain-specific forbidden patterns
security_scan Redacts secrets: AWS keys, GitHub tokens, API keys, private keys
pii_scan Redacts PII: emails, US phone numbers, SSNs, credit cards, IPv4 addresses
cpp17_clean URL removal + whitespace normalization + redact C++ portability issues
cp_portable Strict CP portability — drops rows with forbidden patterns
deterministic_only Drops rows with non-deterministic APIs (rand(), random_device, etc.)

Defaults

  • Required keys: id, messages
  • Forbidden patterns (default): none — use --preset cpp17_clean or --preset cp_portable for CP datasets
  • empty_assistant threshold: 20 characters (responses shorter than this are flagged as empty)

CP datasets: To apply competitive-programming forbidden patterns (freopen, ifstream, bits/extc++.h, etc.) use a preset: cleanllm fix data.jsonl -o out.jsonl --preset cp_portable. In Python, pass forbidden_patterns=list(DEFAULT_FORBIDDEN_PATTERNS) explicitly.


Data format

cleanllm expects JSONL where each line is a JSON object. The default schema (cp_sft_v1) requires:

{
  "id": "unique-id",
  "messages": [
    {"role": "system",    "content": "..."},
    {"role": "user",      "content": "..."},
    {"role": "assistant", "content": "..."}
  ]
}

Optional fields: source, difficulty_bucket, problem_id, tests.


Development

pip install -e .[dev]
pytest
python -m build
twine check dist/*

See RELEASE_CHECKLIST.md for the full release workflow.


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cleanllm-1.0.0.tar.gz (217.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cleanllm-1.0.0-py3-none-any.whl (79.1 kB view details)

Uploaded Python 3

File details

Details for the file cleanllm-1.0.0.tar.gz.

File metadata

  • Download URL: cleanllm-1.0.0.tar.gz
  • Upload date:
  • Size: 217.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for cleanllm-1.0.0.tar.gz
Algorithm Hash digest
SHA256 75838d2ed496219da356594fd5a82bd1b5867bbe5387c0062727e3af76a34a6a
MD5 519291d9fa70bf2b9b406e69382f9595
BLAKE2b-256 1fc5ca7f32cccb0c5f8676d7271a1f5c4cca4a024e642955f140961aeb11e4e6

See more details on using hashes here.

File details

Details for the file cleanllm-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: cleanllm-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 79.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for cleanllm-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 22276d5f8c64034afcc653444eaf7e3bfb2c789b60daa2677c2019755219f605
MD5 19e6567952416d4500bdb5c4e2911d28
BLAKE2b-256 fd8b5a70301c5602a3dc6996d244cb744bca9af4f99e3b22c48a901d9f94ca7e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page