Normalize messy JSONL into dict-only, deduplicated, BigQuery-friendly JSONL with discard logs.
Project description
jsonl-normalizer
A fast, fault-tolerant tool that normalizes messy JSONL files into clean, dict-only, BigQuery-friendly JSONL. Supports discard logging, SHA-256 deduplication, JSONL concatenation, and mixed-type top-level lines (dicts, lists, strings, numbers).
🚀 Features
Normalization
-
Normalize any JSONL file
- Accepts dicts, lists, numbers, strings, malformed lines
- Extracts dicts from lists
- Logs non-dict elements instead of failing
-
BigQuery-friendly output
Ensures one JSON object per line. -
Robust error handling
- Malformed JSON → logged
- Non-dict top-level values → logged
- Mixed lists → dicts kept, junk discarded
-
Optional SHA-256 deduplication
Canonical JSON hashing removes duplicate objects across large files. -
Zero dependencies
Pure standard library. Fast and lightweight.
NEW (v0.2.0): JSONL Concatenation
- Combine many
normalized_*.jsonlfiles into one newline-delimited JSONL - Perfect for BigQuery (
NEWLINE_DELIMITED_JSON) - Optional dedupe via SHA-256
- Gentle warnings for non-standard output filenames
- Clean argparse-based CLI (
jsonl-concat)
📦 Installation
pip install jsonl-normalizer
Development install:
pip install -e .
🖥️ CLI Usage
1. Normalize JSONL
Normalize a JSONL file:
jsonl-normalize input.jsonl
Produces:
normalized.jsonl # clean dict-only output
discarded.jsonl # log of malformed or discarded items
Enable deduplication:
jsonl-normalize input.jsonl --dedupe
Specify custom output:
jsonl-normalize input.jsonl --output clean.jsonl --discarded junk.jsonl
🔗 NEW: jsonl-concat — JSONL Concatenation Tool
jsonl-concat concatenates multiple normalized JSONL files into a single multi-line JSONL file.
This is ideal when your workflow produces many files such as:
norm_jsonl/
normalized_0044a4b1d5099e2a.jsonl
normalized_007b2d5c01abc0b9.jsonl
normalized_02231d6de9a07833.jsonl
...
Combine them into one BigQuery-friendly file:
jsonl-concat
Default behavior is equivalent to:
jsonl-concat norm_jsonl/ combined.jsonl
Features
- Reads all
normalized_*.jsonlunder the given directory - Writes one JSON object per line
- Optional SHA-256 dedupe (
--no-dedupeto disable) - Verbose mode (
--verbose) - Gentle suffix warning when output file is not
.jsonl/.ndjson
Examples
Use defaults:
jsonl-concat
Explicit directory and output:
jsonl-concat norm_jsonl/ final.jsonl
Verbose output:
jsonl-concat --verbose norm_jsonl/ combined.jsonl
Disable deduplication:
jsonl-concat --no-dedupe norm_jsonl/ combined.jsonl
If verbose and output filename is non-standard:
[WARN] Output file 'out.json' does not use .jsonl or .ndjson. Continuing.
📄 Example (Normalization)
Input (mixed.jsonl)
{"a": 1, "b": 2}
[{"a": 2}, [7]]
"just a string"
Output: normalized.jsonl
{"a": 1, "b": 2}
{"a": 2}
Output: discarded.jsonl
{"line": 2, "index": 1, "type": "list", "value": [7], "reason": "non-dict element in list"}
{"line": 3, "type": "str", "value": "just a string", "reason": "top-level value is not dict or list"}
🧪 Library Usage
from pathlib import Path
from jsonl_normalizer import normalize_jsonl
stats = normalize_jsonl(
input_path=Path("input.jsonl"),
output_path=Path("normalized.jsonl"),
discarded_path=Path("discarded.jsonl"),
dedupe=True,
)
print(stats)
❓ Why jsonl-normalizer?
Real-world JSONL is messy:
- LLMs output arrays or malformed fragments
- Excel corrupts JSON strings
- Some APIs return non-dict top-level structures
- Data lakes accumulate junk
- BigQuery requires strict dict-per-line JSONL
- ETL pipelines fail on partial corruption
jsonl-normalizer fixes these problems by:
- Normalizing structure
- Logging all junk transparently
- Keeping valid dicts only
- Providing optional dedupe mode
- Producing warehouse-ready JSONL
🧹 Deduplication
When --dedupe is enabled:
- Each object is canonicalized (sorted keys, compact JSON)
- Hashed using SHA-256
- Duplicates are skipped automatically
Example:
Normalized records seen: 200
Unique records written: 173
Duplicates skipped: 27
Discarded items logged: 12 → discarded.jsonl
🧪 Testing
pip install -e .
pip install pytest
pytest
🤝 Contributing
Pull requests are welcome. Please ensure:
- Tests pass
- Code follows PEP 8
- Changes remain backward compatible
📄 License
MIT License. See LICENSE for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file jsonl_normalizer-0.2.0.tar.gz.
File metadata
- Download URL: jsonl_normalizer-0.2.0.tar.gz
- Upload date:
- Size: 10.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7cb83e3bacd68ad58ccb2583385d19fd35c39e06edf3690e15bd8b921ab4481e
|
|
| MD5 |
575adfef285d8951f8f3a30b7139e4fa
|
|
| BLAKE2b-256 |
b3138d53dbd8b478d4b846e125f8fa179da2576d1aba2d88ecc9797549dea174
|
File details
Details for the file jsonl_normalizer-0.2.0-py3-none-any.whl.
File metadata
- Download URL: jsonl_normalizer-0.2.0-py3-none-any.whl
- Upload date:
- Size: 10.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
491f5000a74e5484c7536e6f14645225c97c49d77ab1341e729a9fe6c5a58cba
|
|
| MD5 |
abbf4a747755527184b2d968831115a8
|
|
| BLAKE2b-256 |
88e582ca74fe22aceec58ed06588f65ca6205fc8e88bf1f7fe1b3ded828f2c85
|