Skip to main content

Strict validator for fine-tuning datasets. Run before you train.

Project description

parallelogram

Strict validator for fine-tuning datasets. Run it before you train.

Every fine-tuning framework assumes your data is clean. None of them verify it. Axolotl will start a run on malformed data and either crash mid-way or — worse — complete silently while producing a broken model. TRL will truncate samples that exceed the context window without telling you. Unsloth will train on duplicates that cause your model to memorize instead of generalize.

parallelogram sits between your raw dataset and your training run. It hard-blocks on anything that would silently corrupt training. If it exits 0 with all rules enabled, your run won't fail because of data.

Install

pip install parallelogram

For the context-window check, also install the tokenizer extras:

pip install 'parallelogram[tokenizer]'

Use

parallelogram check data.jsonl

With a tokenizer for context-window validation:

parallelogram check data.jsonl \
  --tokenizer meta-llama/Llama-3-8B \
  --max-seq-len 8192

Write only the clean records to a new file:

parallelogram check data.jsonl --output clean.jsonl

--fix — mechanical repair

When parallelogram check finds errors, --fix attempts to repair what it can without touching the model. Mechanical fixes are free, local, and require no network call.

parallelogram check data.jsonl --fix --output clean.jsonl

Fixes applied in order:

  1. encoding — strip BOM markers, replace mojibake (don’tdon't)
  2. empty-content — drop empty/whitespace-only message turns
  3. context-window — truncate the longest user message until the record fits
  4. duplicates — keep the first occurrence, drop subsequent

After fixes are applied, the dataset is re-validated. Anything still erroring is dropped from the output. The CLI tells you exactly what was unchanged, fixed, and dropped:

✓ encoding · 4 fixes
✓ duplicates · 12 fixes

✗ dropped:
    data.jsonl:23 → roles (unfixable)
    data.jsonl:147 → schema (unfixable)

547 records  531 unchanged  4 fixed  11 dropped  1 unparseable

Errors that need understanding the content (broken role sequences, incomplete assistant turns) are not fixable mechanically. These will be addressed by a hosted SLM tier in a future release. For now, they are dropped.

Use --dry-run to preview without writing:

parallelogram check data.jsonl --fix --dry-run

--disable and the exit-0 guarantee

Rules can be disabled by id (e.g. --disable encoding), but with three constraints:

  • The schema rule cannot be disabled. Every other rule depends on its structural guarantees, and disabling it would let other rules silently no-op on malformed records.
  • Unknown rule ids are rejected. Typos like --disable encding exit non-zero with a list of valid options rather than silently doing nothing.
  • Whenever any rule is disabled, a loud stderr warning names exactly which ones, and reminds you that the exit-0 guarantee no longer applies. The terminal output and JSON report (disabled_rules field) both surface this so CI tooling can refuse to merge a PR that disabled rules.

The guarantee is precise: a clean exit with all rules enabled means your run won't fail because of data. A clean exit with rules disabled means only that the rules you left enabled passed — which may or may not be enough.

Options

Flag Description
--format, -f Dataset format. Only openai-chat in v0.1.
--tokenizer, -t HuggingFace tokenizer name. Required for context-window check.
--max-seq-len Token budget per record (default 4096).
--output, -o Write error-free records to this file. With --fix, writes the repaired dataset.
--fix Attempt mechanical repair of fixable issues.
--dry-run With --fix, report what would change without writing.
--json Machine-readable report on stdout.
--disable Disable a rule by id. Repeatable.
--no-color Plain output.

Exit codes

Code Meaning (check) Meaning (--fix)
0 Clean. All records emitted clean.
1 Warnings only. Some records dropped (partial fix).
2 Errors. Nothing fixable.

These map directly to CI gates without any extra wiring.

Rules

id severity catches
schema error malformed records, missing fields, wrong types
roles error bad role sequences (system out of place, no alternation, doesn't end on assistant)
empty-content error empty or whitespace-only message content
context-window error records exceeding max_seq_len (TRL truncates these silently)
duplicates error exact-content duplicate records (memorization → poor generalization)
encoding warning BOM markers, mojibake patterns

Status

v0.2 — solo dev, local, pre-training run. No telemetry, no network, no upload boundary.

Roadmap

  • ShareGPT and raw-completion formats
  • --fix mechanical tier (dedupe, truncate, normalize encoding) ✓ shipped in v0.2
  • Opt-in anonymized error-type analytics (informs SLM tier scope)
  • --fix --slm paid hosted tier — repairs broken role sequences, incomplete turns

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parallelogram-0.2.1.tar.gz (30.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

parallelogram-0.2.1-py3-none-any.whl (33.2 kB view details)

Uploaded Python 3

File details

Details for the file parallelogram-0.2.1.tar.gz.

File metadata

  • Download URL: parallelogram-0.2.1.tar.gz
  • Upload date:
  • Size: 30.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for parallelogram-0.2.1.tar.gz
Algorithm Hash digest
SHA256 8e92397a31efb34df334313f9c575c80719616eb3703884c921c2628f846357c
MD5 4acebc3de36846e9537ff05bc38b95ef
BLAKE2b-256 dfe54eab58f9ff1c7e24b66c797d381d99a6b626b95a04486ded4693f80dec04

See more details on using hashes here.

Provenance

The following attestation bundles were made for parallelogram-0.2.1.tar.gz:

Publisher: release.yml on Thatayotlhe04/Parallelogram

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file parallelogram-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: parallelogram-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 33.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for parallelogram-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 3edf8e508f6f67e7c6321213eff9071089d08950f9d40f1b62968dd03c7e8c15
MD5 08c1ddba818edf9cd47f20219afdcab2
BLAKE2b-256 c81e67369c9cbebf20a3f8bba5d7e2d909b81d7238815fc4a375b2c1b0e21b6d

See more details on using hashes here.

Provenance

The following attestation bundles were made for parallelogram-0.2.1-py3-none-any.whl:

Publisher: release.yml on Thatayotlhe04/Parallelogram

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page