Normalize messy JSONL into dict-only, deduplicated, BigQuery-friendly JSONL with discard logs.

These details have not been verified by PyPI

Project links

Project description

jsonl-normalizer

Python versions

A fast, fault-tolerant tool that normalizes messy JSONL files into clean, dict-only, BigQuery-friendly JSONL. Supports discard logging, SHA-256 deduplication, JSONL concatenation, and mixed-type top-level lines (dicts, lists, strings, numbers).

🚀 Features

Normalization

Normalize any JSONL file
- Accepts dicts, lists, numbers, strings, malformed lines
- Extracts dicts from lists
- Logs non-dict elements instead of failing
BigQuery-friendly output
Ensures one JSON object per line.
Robust error handling
- Malformed JSON → logged
- Non-dict top-level values → logged
- Mixed lists → dicts kept, junk discarded
Optional SHA-256 deduplication
Canonical JSON hashing removes duplicate objects across large files.
Zero dependencies
Pure standard library. Fast and lightweight.

NEW (v0.2.0): JSONL Concatenation

Combine many normalized_*.jsonl files into one newline-delimited JSONL
Perfect for BigQuery (NEWLINE_DELIMITED_JSON)
Optional dedupe via SHA-256
Gentle warnings for non-standard output filenames
Clean argparse-based CLI (jsonl-concat)

📦 Installation

pip install jsonl-normalizer

Development install:

pip install -e .

🖥️ CLI Usage

1. Normalize JSONL

Normalize a JSONL file:

jsonl-normalize input.jsonl

Produces:

normalized.jsonl   # clean dict-only output
discarded.jsonl    # log of malformed or discarded items

Enable deduplication:

jsonl-normalize input.jsonl --dedupe

Specify custom output:

jsonl-normalize input.jsonl --output clean.jsonl --discarded junk.jsonl

🔗 NEW: `jsonl-concat` — JSONL Concatenation Tool

jsonl-concat concatenates multiple normalized JSONL files into a single multi-line JSONL file.

This is ideal when your workflow produces many files such as:

norm_jsonl/
  normalized_0044a4b1d5099e2a.jsonl
  normalized_007b2d5c01abc0b9.jsonl
  normalized_02231d6de9a07833.jsonl
  ...

Combine them into one BigQuery-friendly file:

jsonl-concat

Default behavior is equivalent to:

jsonl-concat norm_jsonl/ combined.jsonl

Features

Reads all normalized_*.jsonl under the given directory
Writes one JSON object per line
Optional SHA-256 dedupe (--no-dedupe to disable)
Verbose mode (--verbose)
Gentle suffix warning when output file is not .jsonl/.ndjson

Examples

Use defaults:

jsonl-concat

Explicit directory and output:

jsonl-concat norm_jsonl/ final.jsonl

Verbose output:

jsonl-concat --verbose norm_jsonl/ combined.jsonl

Disable deduplication:

jsonl-concat --no-dedupe norm_jsonl/ combined.jsonl

If verbose and output filename is non-standard:

[WARN] Output file 'out.json' does not use .jsonl or .ndjson. Continuing.

📄 Example (Normalization)

Input (`mixed.jsonl`)

{"a": 1, "b": 2}
[{"a": 2}, [7]]
"just a string"

Output: `normalized.jsonl`

{"a": 1, "b": 2}
{"a": 2}

Output: `discarded.jsonl`

{"line": 2, "index": 1, "type": "list", "value": [7], "reason": "non-dict element in list"}
{"line": 3, "type": "str", "value": "just a string", "reason": "top-level value is not dict or list"}

🧪 Library Usage

from pathlib import Path
from jsonl_normalizer import normalize_jsonl

stats = normalize_jsonl(
    input_path=Path("input.jsonl"),
    output_path=Path("normalized.jsonl"),
    discarded_path=Path("discarded.jsonl"),
    dedupe=True,
)

print(stats)

❓ Why jsonl-normalizer?

Real-world JSONL is messy:

LLMs output arrays or malformed fragments
Excel corrupts JSON strings
Some APIs return non-dict top-level structures
Data lakes accumulate junk
BigQuery requires strict dict-per-line JSONL
ETL pipelines fail on partial corruption

jsonl-normalizer fixes these problems by:

Normalizing structure
Logging all junk transparently
Keeping valid dicts only
Providing optional dedupe mode
Producing warehouse-ready JSONL

🧹 Deduplication

When --dedupe is enabled:

Each object is canonicalized (sorted keys, compact JSON)
Hashed using SHA-256
Duplicates are skipped automatically

Example:

Normalized records seen: 200
Unique records written: 173
Duplicates skipped: 27
Discarded items logged: 12 → discarded.jsonl

🧪 Testing

pip install -e .
pip install pytest
pytest

🤝 Contributing

Pull requests are welcome. Please ensure:

Tests pass
Code follows PEP 8
Changes remain backward compatible

📄 License

MIT License. See LICENSE for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.1

Feb 13, 2026

This version

0.2.0

Nov 29, 2025

0.1.1

Nov 24, 2025

0.1.0

Nov 21, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jsonl_normalizer-0.2.0.tar.gz (10.1 kB view details)

Uploaded Nov 29, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

jsonl_normalizer-0.2.0-py3-none-any.whl (10.1 kB view details)

Uploaded Nov 29, 2025 Python 3

File details

Details for the file jsonl_normalizer-0.2.0.tar.gz.

File metadata

Download URL: jsonl_normalizer-0.2.0.tar.gz
Upload date: Nov 29, 2025
Size: 10.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for jsonl_normalizer-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`7cb83e3bacd68ad58ccb2583385d19fd35c39e06edf3690e15bd8b921ab4481e`
MD5	`575adfef285d8951f8f3a30b7139e4fa`
BLAKE2b-256	`b3138d53dbd8b478d4b846e125f8fa179da2576d1aba2d88ecc9797549dea174`

See more details on using hashes here.

File details

Details for the file jsonl_normalizer-0.2.0-py3-none-any.whl.

File metadata

Download URL: jsonl_normalizer-0.2.0-py3-none-any.whl
Upload date: Nov 29, 2025
Size: 10.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for jsonl_normalizer-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`491f5000a74e5484c7536e6f14645225c97c49d77ab1341e729a9fe6c5a58cba`
MD5	`abbf4a747755527184b2d968831115a8`
BLAKE2b-256	`88e582ca74fe22aceec58ed06588f65ca6205fc8e88bf1f7fe1b3ded828f2c85`

See more details on using hashes here.

jsonl-normalizer 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

jsonl-normalizer

🚀 Features

Normalization

NEW (v0.2.0): JSONL Concatenation

📦 Installation

🖥️ CLI Usage

1. Normalize JSONL

🔗 NEW: jsonl-concat — JSONL Concatenation Tool

Features

Examples

📄 Example (Normalization)

Input (mixed.jsonl)

Output: normalized.jsonl

Output: discarded.jsonl

🧪 Library Usage

❓ Why jsonl-normalizer?

🧹 Deduplication

🧪 Testing

🤝 Contributing

📄 License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

🔗 NEW: `jsonl-concat` — JSONL Concatenation Tool

Input (`mixed.jsonl`)

Output: `normalized.jsonl`

Output: `discarded.jsonl`