Streaming JSONL cleaner for LLM fine-tuning datasets.

These details have not been verified by PyPI

Project description

cleanllm

Streaming JSONL cleaner for LLM fine-tuning datasets. Minimal dependencies, memory-safe, and fast — processes files line-by-line without loading them into memory.

What it does

cleanllm gives you a pipeline for cleaning, validating, and profiling JSONL datasets before fine-tuning:

raw.jsonl → scan → fix → dedup → validate → stats → audit bundle → shards

Every step is streaming (no full-file load), resumable, and produces machine-readable JSON reports for CI gating.

Install

pip install cleanllm

Or from source:

git clone https://github.com/verma8076/cleanllm
cd cleanllm
pip install -e .

Quickstart

# Scan for issues
cleanllm scan data.jsonl

# Fix: remove URLs, normalize whitespace, redact forbidden patterns
cleanllm fix data.jsonl -o data.cleaned.jsonl

# Deduplicate by prompt content
cleanllm dedup data.cleaned.jsonl -o data.dedup.jsonl --by prompt

# Profile the cleaned dataset
cleanllm stats data.dedup.jsonl --report-json stats.json

# Gate in CI: fail if invalid rows increased
cleanllm gate --compare compare.json --rules gate_rules.json

CLI reference

`scan`

Streaming scan for issues — invalid JSON, missing keys, URLs, forbidden patterns, language distribution, duplicate estimate.

cleanllm scan data.jsonl
cleanllm scan data.jsonl --report-json scan_report.json --dup-estimate
cleanllm scan data.jsonl --preset cp_portable

`fix`

Remove URLs, normalize whitespace, redact or drop rows with forbidden patterns.

cleanllm fix data.jsonl -o cleaned.jsonl
cleanllm fix data.jsonl -o cleaned.jsonl --drop-on forbidden_pattern --drop-on invalid_json
cleanllm fix data.jsonl -o cleaned.jsonl --preset cpp17_clean --report-json fix_report.json

Drop rules: invalid_json, missing_required_keys, forbidden_pattern, empty_assistant, placeholder, repetitive_response, bad_conversation.

Note on empty_assistant: By default this drops assistant responses shorter than 20 characters — calibrated for code datasets where very short responses are almost always errors. For text/chat datasets, set --min-assistant-chars 1 to only drop truly blank responses.

`validate`

Schema validation, line by line. Exit code 0 only if all rows pass.

cleanllm validate data.jsonl --schema basic_sft
cleanllm validate data.jsonl --schema cp_sft_v1

Schema	Required fields
`basic_sft`	`id`, `messages` (list of `role`/`content` dicts)
`cp_sft_v1`	`id`, `source`, `problem_id`, `messages`, `tests` (non-empty, with `input`/`output`)

`dedup`

First-occurrence deduplication — by full record, prompt (system+user), or code (assistant).

cleanllm dedup data.jsonl -o deduped.jsonl --by record
cleanllm dedup data.jsonl -o deduped.jsonl --by prompt --normalized
cleanllm dedup data.jsonl -o deduped.jsonl --by code --report-json dedup_report.json

`stats`

Single-pass profiler: distributions, structural stats, schema counts, response lengths, language distribution.

cleanllm stats data.jsonl
cleanllm stats data.jsonl --schema cp_sft_v1 --keys source,difficulty_bucket --top-k 20
cleanllm stats data.jsonl --report-json stats.json

`compare`

Diff two stats reports to catch regressions between dataset versions.

cleanllm compare old_stats.json new_stats.json
cleanllm compare old_stats.json new_stats.json --report-json compare.json
cleanllm compare old.jsonl new.jsonl --from-jsonl --schema cp_sft_v1

`gate`

CI-friendly quality gating. Nonzero exit on failures.

cleanllm gate --stats stats.json --rules gate_rules.json
cleanllm gate --compare compare.json --rules gate_rules.json --strict
cleanllm gate --compare compare.json --inline-rule "counts_diff.invalid_json_rows.delta<=0"

Gate rules JSON:

{
  "version": 1,
  "mode": "compare",
  "rules": [
    {"name": "no_new_invalid", "metric": "counts_diff.invalid_json_rows.delta", "op": "<=", "value": 0},
    {"name": "enough_valid",   "metric": "counts_diff.valid_json_rows.new",    "op": ">=", "value": 1000}
  ]
}

Supported ops: ==, !=, <, <=, >, >=. Severities: error (default), warn.

`run`

Execute a JSON-defined multi-step pipeline with variable substitution.

cleanllm run --config pipeline.json
cleanllm run --config pipeline.json --set input_path=data.jsonl --set outdir=out/v2
cleanllm run --config pipeline.json --dry-run

Supported step types: fix, validate, dedup, stats, audit, sample, shard, manifest, scan, compare.

`sample`

Reservoir sampling — random or stratified, deterministic with --seed.

cleanllm sample data.jsonl -o sample.jsonl -n 500 --seed 42
cleanllm sample data.jsonl -o sample.jsonl -n 500 --stratify source,difficulty_bucket

`audit`

Build a reproducible audit bundle in one command: sampled JSONL + CSV review index (with original line numbers) + summary + manifest.

cleanllm audit data.jsonl --outdir audit_bundle -n 200 --seed 42
cleanllm audit data.jsonl --outdir audit_bundle -n 200 --stratify source --schema cp_sft_v1

Bundle contents: audit_sample.jsonl, audit_index.csv, audit_summary.json, AUDIT_README.md, manifest.json.

`shard` / `manifest`

cleanllm shard data.jsonl --outdir shards --size 5000 --gzip
cleanllm manifest shards -o manifest.json

`convert`

Convert a JSONL file between sharegpt, alpaca, and chatml formats.

cleanllm convert data.jsonl -o converted.jsonl --from sharegpt --to chatml
cleanllm convert data.jsonl -o converted.jsonl --from alpaca --to sharegpt

Supported formats: sharegpt (conversations list), alpaca (instruction/output), chatml (messages list).

`merge`

Merge multiple JSONL files into one, with optional deduplication.

cleanllm merge a.jsonl b.jsonl c.jsonl -o merged.jsonl
cleanllm merge a.jsonl b.jsonl -o merged.jsonl --dedup

`split`

Split a JSONL file into train and val sets.

cleanllm split data.jsonl --outdir splits/
cleanllm split data.jsonl --outdir splits/ --ratio 0.95 --seed 42 --no-shuffle

Outputs <basename>_train.jsonl and <basename>_val.jsonl in the output directory. Default ratio is 0.9 (90% train).

`recipes`

Bootstrap pipelines and gate rules from built-in templates.

cleanllm recipes list
cleanllm recipes show cp_pipeline_cp_portable
cleanllm recipes write cp_bundle --outdir bootstrap/

Built-in recipes: cp_pipeline_basic, cp_pipeline_cp_portable, cp_pipeline_fast_audit, gate_stats_basic, gate_compare_basic, gate_compare_strict, cp_bundle.

Python API

from cleanllm import (
    scan_jsonl, fix_jsonl, FixRules,
    dedup_jsonl, validate_jsonl, stats_jsonl,
    sample_jsonl, audit_bundle,
    shard_jsonl, make_manifest,
    download_from_hub, detect_hf_schema,
)
from cleanllm.convert import convert_jsonl
from cleanllm.merge import merge_jsonl
from cleanllm.split import split_jsonl

# Scan
report = scan_jsonl("data.jsonl")

# Fix (code dataset)
rules = FixRules(
    drop_on={"forbidden_pattern", "empty_assistant"},
    max_tokens=4096,
    keep_language="python",
)
summary = fix_jsonl("data.jsonl", "cleaned.jsonl", rules)

# Fix (text/chat dataset — only drop truly blank responses)
rules = FixRules(drop_on={"empty_assistant"}, min_assistant_chars=1, forbidden_patterns=[])

# Dedup
result = dedup_jsonl("cleaned.jsonl", "deduped.jsonl", by="prompt", normalized=True)

# Stats
stats = stats_jsonl("deduped.jsonl", schema="cp_sft_v1", keys=["source", "difficulty_bucket"])

# Sample + audit
sample_jsonl("deduped.jsonl", "sample.jsonl", num_rows=200, seed=42)
audit_bundle("deduped.jsonl", "audit_bundle", num_rows=200, seed=42, stratify=["source"])

# Shard + manifest
shard_jsonl("deduped.jsonl", "shards", shard_size=5000, gzip_output=True)
make_manifest("shards", "manifest.json")

# Convert between formats
convert_jsonl("data.jsonl", "out.jsonl", from_fmt="sharegpt", to_fmt="chatml")

# Merge + split
merge_jsonl(["a.jsonl", "b.jsonl"], "merged.jsonl", dedup=True)
split_jsonl("merged.jsonl", "splits/", ratio=0.9, seed=42)

# Download from HuggingFace Hub (requires pip install cleanllm[hf])
result = download_from_hub("HuggingFaceH4/ultrachat_200k", "data.jsonl", split="train_sft")

Presets

Preset	Description
`general`	URL removal + whitespace normalization, no domain-specific forbidden patterns
`security_scan`	Redacts secrets: AWS keys, GitHub tokens, API keys, private keys
`pii_scan`	Redacts PII: emails, US phone numbers, SSNs, credit cards, IPv4 addresses
`cpp17_clean`	URL removal + whitespace normalization + redact C++ portability issues
`cp_portable`	Strict CP portability — drops rows with forbidden patterns
`deterministic_only`	Drops rows with non-deterministic APIs (`rand()`, `random_device`, etc.)

Defaults

Required keys: id, messages
Forbidden patterns (default): none — use --preset cpp17_clean or --preset cp_portable for CP datasets
empty_assistant threshold: 20 characters (responses shorter than this are flagged as empty)

CP datasets: To apply competitive-programming forbidden patterns (freopen, ifstream, bits/extc++.h, etc.) use a preset: cleanllm fix data.jsonl -o out.jsonl --preset cp_portable. In Python, pass forbidden_patterns=list(DEFAULT_FORBIDDEN_PATTERNS) explicitly.

Data format

cleanllm expects JSONL where each line is a JSON object. The default schema (cp_sft_v1) requires:

{
  "id": "unique-id",
  "messages": [
    {"role": "system",    "content": "..."},
    {"role": "user",      "content": "..."},
    {"role": "assistant", "content": "..."}
  ]
}

Optional fields: source, difficulty_bucket, problem_id, tests.

Development

pip install -e .[dev]
pytest
python -m build
twine check dist/*

See RELEASE_CHECKLIST.md for the full release workflow.

License

MIT

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.0.0

Jun 6, 2026

0.4.0

Jun 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cleanllm-1.0.0.tar.gz (217.0 kB view details)

Uploaded Jun 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cleanllm-1.0.0-py3-none-any.whl (79.1 kB view details)

Uploaded Jun 6, 2026 Python 3

File details

Details for the file cleanllm-1.0.0.tar.gz.

File metadata

Download URL: cleanllm-1.0.0.tar.gz
Upload date: Jun 6, 2026
Size: 217.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for cleanllm-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`75838d2ed496219da356594fd5a82bd1b5867bbe5387c0062727e3af76a34a6a`
MD5	`519291d9fa70bf2b9b406e69382f9595`
BLAKE2b-256	`1fc5ca7f32cccb0c5f8676d7271a1f5c4cca4a024e642955f140961aeb11e4e6`

See more details on using hashes here.

File details

Details for the file cleanllm-1.0.0-py3-none-any.whl.

File metadata

Download URL: cleanllm-1.0.0-py3-none-any.whl
Upload date: Jun 6, 2026
Size: 79.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for cleanllm-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`22276d5f8c64034afcc653444eaf7e3bfb2c789b60daa2677c2019755219f605`
MD5	`19e6567952416d4500bdb5c4e2911d28`
BLAKE2b-256	`fd8b5a70301c5602a3dc6996d244cb744bca9af4f99e3b22c48a901d9f94ca7e`

See more details on using hashes here.

cleanllm 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

cleanllm

What it does

Install

Quickstart

CLI reference

scan

fix

validate

dedup

stats

compare

gate

run

sample

audit

shard / manifest

convert

merge

split

recipes

Python API

Presets

Defaults

Data format

Development

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`scan`

`fix`

`validate`

`dedup`

`stats`

`compare`

`gate`

`run`

`sample`

`audit`

`shard` / `manifest`

`convert`

`merge`

`split`

`recipes`