Strict validator for fine-tuning datasets. Run before you train.
Project description
parallelogram
Strict validator for fine-tuning datasets. Run it before you train.
Every fine-tuning framework assumes your data is clean. None of them verify it. Axolotl will start a run on malformed data and either crash mid-way or — worse — complete silently while producing a broken model. TRL will truncate samples that exceed the context window without telling you. Unsloth will train on duplicates that cause your model to memorize instead of generalize.
parallelogram sits between your raw dataset and your training run. It hard-blocks
on anything that would silently corrupt training. If it exits 0 with all rules enabled,
your run won't fail because of data.
Install
pip install parallelogram
For an exact context-window count, install the tokenizer extras — HuggingFace
tokenizers for open-weight models and tiktoken for OpenAI models:
pip install 'parallelogram[tokenizer]'
Without the extras (or for a model with no offline tokenizer, like Claude), the context-window check still runs using an approximate length-based count.
Use
parallelogram check data.jsonl
With a model-specific tokenizer for an exact context-window count — an OpenAI model,
or any HuggingFace repo or short alias (mistral, qwen, llama-3, …):
parallelogram check data.jsonl \
--tokenizer Qwen/Qwen2.5-7B \
--max-seq-len 8192
Omit --tokenizer and the check still runs with an approximate count, reported as
warnings instead of errors.
Write only the clean records to a new file:
parallelogram check data.jsonl --output clean.jsonl
--fix — mechanical repair
When parallelogram check finds errors, --fix attempts to repair what it can without
touching the model. Mechanical fixes are free, local, and require no network call.
parallelogram check data.jsonl --fix --output clean.jsonl
Fixes applied in order:
- encoding — strip BOM markers, replace mojibake (
don’t→don't) - empty-content — drop empty/whitespace-only message turns
- context-window — truncate the longest user message until the record fits
- duplicates — keep the first occurrence, drop subsequent
After fixes are applied, the dataset is re-validated. Anything still erroring is dropped from the output. The CLI tells you exactly what was unchanged, fixed, and dropped:
✓ encoding · 4 fixes
✓ duplicates · 12 fixes
✗ dropped:
data.jsonl:23 → roles (unfixable)
data.jsonl:147 → schema (unfixable)
547 records 531 unchanged 4 fixed 11 dropped 1 unparseable
Errors that need understanding the content (broken role sequences, incomplete assistant turns) are not fixable mechanically. These will be addressed by a hosted SLM tier in a future release. For now, they are dropped.
Use --dry-run to preview without writing:
parallelogram check data.jsonl --fix --dry-run
--disable and the exit-0 guarantee
Rules can be disabled by id (e.g. --disable encoding), but with three constraints:
- The
schemarule cannot be disabled. Every other rule depends on its structural guarantees, and disabling it would let other rules silently no-op on malformed records. - Unknown rule ids are rejected. Typos like
--disable encdingexit non-zero with a list of valid options rather than silently doing nothing. - Whenever any rule is disabled, a loud stderr warning names exactly which ones, and
reminds you that the exit-0 guarantee no longer applies. The terminal output and
JSON report (
disabled_rulesfield) both surface this so CI tooling can refuse to merge a PR that disabled rules.
The guarantee is precise: a clean exit with all rules enabled means your run won't fail because of data. A clean exit with rules disabled means only that the rules you left enabled passed — which may or may not be enough.
Options
| Flag | Description |
|---|---|
--format, -f |
Dataset format. Only openai-chat in v0.1. |
--tokenizer, -t |
Model or tokenizer for the context-window check — an OpenAI model (gpt-4o), or an HF repo/alias (Qwen/Qwen2.5-7B, mistral). Optional: omit for an approximate count. |
--max-seq-len |
Token budget per record (default 4096). |
--output, -o |
Write error-free records to this file. With --fix, writes the repaired dataset. |
--fix |
Attempt mechanical repair of fixable issues. |
--dry-run |
With --fix, report what would change without writing. |
--json |
Machine-readable report on stdout. |
--disable |
Disable a rule by id. Repeatable. |
--no-color |
Plain output. |
Exit codes
| Code | Meaning (check) | Meaning (--fix) |
|---|---|---|
0 |
Clean. | All records emitted clean. |
1 |
Warnings only. | Some records dropped (partial fix). |
2 |
Errors. | Nothing fixable. |
These map directly to CI gates without any extra wiring.
Rules
| id | severity | catches |
|---|---|---|
schema |
error | malformed records, missing fields, wrong types |
roles |
error | bad role sequences (system out of place, no alternation, doesn't end on assistant) |
empty-content |
error | empty or whitespace-only message content |
context-window |
error / warning | records exceeding max_seq_len (TRL truncates these silently) — error with an exact tokenizer, warning when the count is approximate |
duplicates |
error | exact-content duplicate records (memorization → poor generalization) |
encoding |
warning | BOM markers, mojibake patterns |
Status
v0.3 — solo dev, local, pre-training run. No telemetry, no network, no upload boundary.
Roadmap
- ShareGPT and raw-completion formats
✓ shipped in v0.2--fixmechanical tier (dedupe, truncate, normalize encoding)Model-specific tokenizers (tiktoken/HF) with approximate fallback✓ shipped in v0.3- Opt-in anonymized error-type analytics (informs SLM tier scope)
--fix --slmpaid hosted tier — repairs broken role sequences, incomplete turns
License
Apache-2.0
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file parallelogram-0.3.0.tar.gz.
File metadata
- Download URL: parallelogram-0.3.0.tar.gz
- Upload date:
- Size: 34.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
88e2ff1f9d553ff7511c67efcc56e454ece82b7fa23ad85f356024b29dfe9322
|
|
| MD5 |
4e944115211413c0cd96f75b4e121baf
|
|
| BLAKE2b-256 |
4d7a783c47614eaf136fb1bcdae00becc153a0fad99adc34e8e27e79801c50b3
|
Provenance
The following attestation bundles were made for parallelogram-0.3.0.tar.gz:
Publisher:
release.yml on Thatayotlhe04/Parallelogram
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
parallelogram-0.3.0.tar.gz -
Subject digest:
88e2ff1f9d553ff7511c67efcc56e454ece82b7fa23ad85f356024b29dfe9322 - Sigstore transparency entry: 1785519165
- Sigstore integration time:
-
Permalink:
Thatayotlhe04/Parallelogram@2726b07c2b853b4e948b000a14e17365f05f96c7 -
Branch / Tag:
refs/tags/cli-v0.3.0 - Owner: https://github.com/Thatayotlhe04
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@2726b07c2b853b4e948b000a14e17365f05f96c7 -
Trigger Event:
push
-
Statement type:
File details
Details for the file parallelogram-0.3.0-py3-none-any.whl.
File metadata
- Download URL: parallelogram-0.3.0-py3-none-any.whl
- Upload date:
- Size: 36.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ca8f75418a5b219c44fcda2fbc8f504c66dd46a59886c5d6b39e805595bc1bc0
|
|
| MD5 |
beb5a8226a87ea21b0bd3e32d11eb80d
|
|
| BLAKE2b-256 |
11807b9714de8abbeea6c4bfb0dc8a48e81af31a48da5b49ecb1fe74d8fc71ea
|
Provenance
The following attestation bundles were made for parallelogram-0.3.0-py3-none-any.whl:
Publisher:
release.yml on Thatayotlhe04/Parallelogram
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
parallelogram-0.3.0-py3-none-any.whl -
Subject digest:
ca8f75418a5b219c44fcda2fbc8f504c66dd46a59886c5d6b39e805595bc1bc0 - Sigstore transparency entry: 1785519369
- Sigstore integration time:
-
Permalink:
Thatayotlhe04/Parallelogram@2726b07c2b853b4e948b000a14e17365f05f96c7 -
Branch / Tag:
refs/tags/cli-v0.3.0 - Owner: https://github.com/Thatayotlhe04
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@2726b07c2b853b4e948b000a14e17365f05f96c7 -
Trigger Event:
push
-
Statement type: