Streaming JSONL cleaner for LLM fine-tuning datasets.
Project description
cleanllm
Streaming JSONL cleaner for LLM fine-tuning datasets. Minimal dependencies, memory-safe, and fast — processes files line-by-line without loading them into memory.
What it does
cleanllm gives you a pipeline for cleaning, validating, and profiling JSONL datasets before fine-tuning:
raw.jsonl → scan → fix → dedup → validate → stats → audit bundle → shards
Every step is streaming (no full-file load), resumable, and produces machine-readable JSON reports for CI gating.
Install
pip install cleanllm
Or from source:
git clone https://github.com/verma8076/cleanllm
cd cleanllm
pip install -e .
Quickstart
# Scan for issues
cleanllm scan data.jsonl
# Fix: remove URLs, normalize whitespace, redact forbidden patterns
cleanllm fix data.jsonl -o data.cleaned.jsonl
# Deduplicate by prompt content
cleanllm dedup data.cleaned.jsonl -o data.dedup.jsonl --by prompt
# Profile the cleaned dataset
cleanllm stats data.dedup.jsonl --report-json stats.json
# Gate in CI: fail if invalid rows increased
cleanllm gate --compare compare.json --rules gate_rules.json
CLI reference
scan
Streaming scan for issues — invalid JSON, missing keys, URLs, forbidden patterns, language distribution, duplicate estimate.
cleanllm scan data.jsonl
cleanllm scan data.jsonl --report-json scan_report.json --dup-estimate
cleanllm scan data.jsonl --preset cp_portable
fix
Remove URLs, normalize whitespace, redact or drop rows with forbidden patterns.
cleanllm fix data.jsonl -o cleaned.jsonl
cleanllm fix data.jsonl -o cleaned.jsonl --drop-on forbidden_pattern --drop-on invalid_json
cleanllm fix data.jsonl -o cleaned.jsonl --preset cpp17_clean --report-json fix_report.json
Drop rules: invalid_json, missing_required_keys, forbidden_pattern, empty_assistant, placeholder, repetitive_response, bad_conversation.
Note on
empty_assistant: By default this drops assistant responses shorter than 20 characters — calibrated for code datasets where very short responses are almost always errors. For text/chat datasets, set--min-assistant-chars 1to only drop truly blank responses.
validate
Schema validation, line by line. Exit code 0 only if all rows pass.
cleanllm validate data.jsonl --schema basic_sft
cleanllm validate data.jsonl --schema cp_sft_v1
| Schema | Required fields |
|---|---|
basic_sft |
id, messages (list of role/content dicts) |
cp_sft_v1 |
id, source, problem_id, messages, tests (non-empty, with input/output) |
dedup
First-occurrence deduplication — by full record, prompt (system+user), or code (assistant).
cleanllm dedup data.jsonl -o deduped.jsonl --by record
cleanllm dedup data.jsonl -o deduped.jsonl --by prompt --normalized
cleanllm dedup data.jsonl -o deduped.jsonl --by code --report-json dedup_report.json
stats
Single-pass profiler: distributions, structural stats, schema counts, response lengths, language distribution.
cleanllm stats data.jsonl
cleanllm stats data.jsonl --schema cp_sft_v1 --keys source,difficulty_bucket --top-k 20
cleanllm stats data.jsonl --report-json stats.json
compare
Diff two stats reports to catch regressions between dataset versions.
cleanllm compare old_stats.json new_stats.json
cleanllm compare old_stats.json new_stats.json --report-json compare.json
cleanllm compare old.jsonl new.jsonl --from-jsonl --schema cp_sft_v1
gate
CI-friendly quality gating. Nonzero exit on failures.
cleanllm gate --stats stats.json --rules gate_rules.json
cleanllm gate --compare compare.json --rules gate_rules.json --strict
cleanllm gate --compare compare.json --inline-rule "counts_diff.invalid_json_rows.delta<=0"
Gate rules JSON:
{
"version": 1,
"mode": "compare",
"rules": [
{"name": "no_new_invalid", "metric": "counts_diff.invalid_json_rows.delta", "op": "<=", "value": 0},
{"name": "enough_valid", "metric": "counts_diff.valid_json_rows.new", "op": ">=", "value": 1000}
]
}
Supported ops: ==, !=, <, <=, >, >=. Severities: error (default), warn.
run
Execute a JSON-defined multi-step pipeline with variable substitution.
cleanllm run --config pipeline.json
cleanllm run --config pipeline.json --set input_path=data.jsonl --set outdir=out/v2
cleanllm run --config pipeline.json --dry-run
Supported step types: fix, validate, dedup, stats, audit, sample, shard, manifest, scan, compare.
sample
Reservoir sampling — random or stratified, deterministic with --seed.
cleanllm sample data.jsonl -o sample.jsonl -n 500 --seed 42
cleanllm sample data.jsonl -o sample.jsonl -n 500 --stratify source,difficulty_bucket
audit
Build a reproducible audit bundle in one command: sampled JSONL + CSV review index (with original line numbers) + summary + manifest.
cleanllm audit data.jsonl --outdir audit_bundle -n 200 --seed 42
cleanllm audit data.jsonl --outdir audit_bundle -n 200 --stratify source --schema cp_sft_v1
Bundle contents: audit_sample.jsonl, audit_index.csv, audit_summary.json, AUDIT_README.md, manifest.json.
shard / manifest
cleanllm shard data.jsonl --outdir shards --size 5000 --gzip
cleanllm manifest shards -o manifest.json
convert
Convert a JSONL file between sharegpt, alpaca, and chatml formats.
cleanllm convert data.jsonl -o converted.jsonl --from sharegpt --to chatml
cleanllm convert data.jsonl -o converted.jsonl --from alpaca --to sharegpt
Supported formats: sharegpt (conversations list), alpaca (instruction/output), chatml (messages list).
merge
Merge multiple JSONL files into one, with optional deduplication.
cleanllm merge a.jsonl b.jsonl c.jsonl -o merged.jsonl
cleanllm merge a.jsonl b.jsonl -o merged.jsonl --dedup
split
Split a JSONL file into train and val sets.
cleanllm split data.jsonl --outdir splits/
cleanllm split data.jsonl --outdir splits/ --ratio 0.95 --seed 42 --no-shuffle
Outputs <basename>_train.jsonl and <basename>_val.jsonl in the output directory. Default ratio is 0.9 (90% train).
recipes
Bootstrap pipelines and gate rules from built-in templates.
cleanllm recipes list
cleanllm recipes show cp_pipeline_cp_portable
cleanllm recipes write cp_bundle --outdir bootstrap/
Built-in recipes: cp_pipeline_basic, cp_pipeline_cp_portable, cp_pipeline_fast_audit, gate_stats_basic, gate_compare_basic, gate_compare_strict, cp_bundle.
Python API
from cleanllm import (
scan_jsonl, fix_jsonl, FixRules,
dedup_jsonl, validate_jsonl, stats_jsonl,
sample_jsonl, audit_bundle,
shard_jsonl, make_manifest,
download_from_hub, detect_hf_schema,
)
from cleanllm.convert import convert_jsonl
from cleanllm.merge import merge_jsonl
from cleanllm.split import split_jsonl
# Scan
report = scan_jsonl("data.jsonl")
# Fix (code dataset)
rules = FixRules(
drop_on={"forbidden_pattern", "empty_assistant"},
max_tokens=4096,
keep_language="python",
)
summary = fix_jsonl("data.jsonl", "cleaned.jsonl", rules)
# Fix (text/chat dataset — only drop truly blank responses)
rules = FixRules(drop_on={"empty_assistant"}, min_assistant_chars=1, forbidden_patterns=[])
# Dedup
result = dedup_jsonl("cleaned.jsonl", "deduped.jsonl", by="prompt", normalized=True)
# Stats
stats = stats_jsonl("deduped.jsonl", schema="cp_sft_v1", keys=["source", "difficulty_bucket"])
# Sample + audit
sample_jsonl("deduped.jsonl", "sample.jsonl", num_rows=200, seed=42)
audit_bundle("deduped.jsonl", "audit_bundle", num_rows=200, seed=42, stratify=["source"])
# Shard + manifest
shard_jsonl("deduped.jsonl", "shards", shard_size=5000, gzip_output=True)
make_manifest("shards", "manifest.json")
# Convert between formats
convert_jsonl("data.jsonl", "out.jsonl", from_fmt="sharegpt", to_fmt="chatml")
# Merge + split
merge_jsonl(["a.jsonl", "b.jsonl"], "merged.jsonl", dedup=True)
split_jsonl("merged.jsonl", "splits/", ratio=0.9, seed=42)
# Download from HuggingFace Hub (requires pip install cleanllm[hf])
result = download_from_hub("HuggingFaceH4/ultrachat_200k", "data.jsonl", split="train_sft")
Presets
| Preset | Description |
|---|---|
general |
URL removal + whitespace normalization, no domain-specific forbidden patterns |
security_scan |
Redacts secrets: AWS keys, GitHub tokens, API keys, private keys |
pii_scan |
Redacts PII: emails, US phone numbers, SSNs, credit cards, IPv4 addresses |
cpp17_clean |
URL removal + whitespace normalization + redact C++ portability issues |
cp_portable |
Strict CP portability — drops rows with forbidden patterns |
deterministic_only |
Drops rows with non-deterministic APIs (rand(), random_device, etc.) |
Defaults
- Required keys:
id,messages - Forbidden patterns (default): none — use
--preset cpp17_cleanor--preset cp_portablefor CP datasets empty_assistantthreshold: 20 characters (responses shorter than this are flagged as empty)
CP datasets: To apply competitive-programming forbidden patterns (
freopen,ifstream,bits/extc++.h, etc.) use a preset:cleanllm fix data.jsonl -o out.jsonl --preset cp_portable. In Python, passforbidden_patterns=list(DEFAULT_FORBIDDEN_PATTERNS)explicitly.
Data format
cleanllm expects JSONL where each line is a JSON object. The default schema (cp_sft_v1) requires:
{
"id": "unique-id",
"messages": [
{"role": "system", "content": "..."},
{"role": "user", "content": "..."},
{"role": "assistant", "content": "..."}
]
}
Optional fields: source, difficulty_bucket, problem_id, tests.
Development
pip install -e .[dev]
pytest
python -m build
twine check dist/*
See RELEASE_CHECKLIST.md for the full release workflow.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cleanllm-1.0.0.tar.gz.
File metadata
- Download URL: cleanllm-1.0.0.tar.gz
- Upload date:
- Size: 217.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
75838d2ed496219da356594fd5a82bd1b5867bbe5387c0062727e3af76a34a6a
|
|
| MD5 |
519291d9fa70bf2b9b406e69382f9595
|
|
| BLAKE2b-256 |
1fc5ca7f32cccb0c5f8676d7271a1f5c4cca4a024e642955f140961aeb11e4e6
|
File details
Details for the file cleanllm-1.0.0-py3-none-any.whl.
File metadata
- Download URL: cleanllm-1.0.0-py3-none-any.whl
- Upload date:
- Size: 79.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
22276d5f8c64034afcc653444eaf7e3bfb2c789b60daa2677c2019755219f605
|
|
| MD5 |
19e6567952416d4500bdb5c4e2911d28
|
|
| BLAKE2b-256 |
fd8b5a70301c5602a3dc6996d244cb744bca9af4f99e3b22c48a901d9f94ca7e
|