Skip to main content

Normalize messy JSONL into dict-only, deduplicated, BigQuery-friendly JSONL with discard logs.

Project description

jsonl-normalizer

PyPI version Python versions License

A fast, fault-tolerant tool that normalizes messy JSONL files into clean, dict-only, BigQuery-friendly JSONL. Supports discard logging, SHA-256 deduplication, JSONL concatenation, and mixed-type top-level lines (dicts, lists, strings, numbers).


🚀 Features

Normalization

  • Normalize any JSONL file

    • Accepts dicts, lists, numbers, strings, malformed lines
    • Extracts dicts from lists
    • Logs non-dict elements instead of failing
  • BigQuery-friendly output
    Ensures one JSON object per line.

  • Robust error handling

    • Malformed JSON → logged
    • Non-dict top-level values → logged
    • Mixed lists → dicts kept, junk discarded
  • Optional SHA-256 deduplication
    Canonical JSON hashing removes duplicate objects across large files.

  • Zero dependencies
    Pure standard library. Fast and lightweight.

NEW (v0.2.0): JSONL Concatenation

  • Combine many normalized_*.jsonl files into one newline-delimited JSONL
  • Perfect for BigQuery (NEWLINE_DELIMITED_JSON)
  • Optional dedupe via SHA-256
  • Gentle warnings for non-standard output filenames
  • Clean argparse-based CLI (jsonl-concat)

📦 Installation

pip install jsonl-normalizer

Development install:

pip install -e .

🖥️ CLI Usage

1. Normalize JSONL

Normalize a JSONL file:

jsonl-normalize input.jsonl

Produces:

normalized.jsonl   # clean dict-only output
discarded.jsonl    # log of malformed or discarded items

Enable deduplication:

jsonl-normalize input.jsonl --dedupe

Specify custom output:

jsonl-normalize input.jsonl --output clean.jsonl --discarded junk.jsonl

🔗 NEW: jsonl-concat — JSONL Concatenation Tool

jsonl-concat concatenates multiple normalized JSONL files into a single multi-line JSONL file.

This is ideal when your workflow produces many files such as:

norm_jsonl/
  normalized_0044a4b1d5099e2a.jsonl
  normalized_007b2d5c01abc0b9.jsonl
  normalized_02231d6de9a07833.jsonl
  ...

Combine them into one BigQuery-friendly file:

jsonl-concat

Default behavior is equivalent to:

jsonl-concat norm_jsonl/ combined.jsonl

Features

  • Reads all normalized_*.jsonl under the given directory
  • Writes one JSON object per line
  • Optional SHA-256 dedupe (--no-dedupe to disable)
  • Verbose mode (--verbose)
  • Gentle suffix warning when output file is not .jsonl/.ndjson

Examples

Use defaults:

jsonl-concat

Explicit directory and output:

jsonl-concat norm_jsonl/ final.jsonl

Verbose output:

jsonl-concat --verbose norm_jsonl/ combined.jsonl

Disable deduplication:

jsonl-concat --no-dedupe norm_jsonl/ combined.jsonl

If verbose and output filename is non-standard:

[WARN] Output file 'out.json' does not use .jsonl or .ndjson. Continuing.

📄 Example (Normalization)

Input (mixed.jsonl)

{"a": 1, "b": 2}
[{"a": 2}, [7]]
"just a string"

Output: normalized.jsonl

{"a": 1, "b": 2}
{"a": 2}

Output: discarded.jsonl

{"line": 2, "index": 1, "type": "list", "value": [7], "reason": "non-dict element in list"}
{"line": 3, "type": "str", "value": "just a string", "reason": "top-level value is not dict or list"}

🧪 Library Usage

from pathlib import Path
from jsonl_normalizer import normalize_jsonl

stats = normalize_jsonl(
    input_path=Path("input.jsonl"),
    output_path=Path("normalized.jsonl"),
    discarded_path=Path("discarded.jsonl"),
    dedupe=True,
)

print(stats)

❓ Why jsonl-normalizer?

Real-world JSONL is messy:

  • LLMs output arrays or malformed fragments
  • Excel corrupts JSON strings
  • Some APIs return non-dict top-level structures
  • Data lakes accumulate junk
  • BigQuery requires strict dict-per-line JSONL
  • ETL pipelines fail on partial corruption

jsonl-normalizer fixes these problems by:

  • Normalizing structure
  • Logging all junk transparently
  • Keeping valid dicts only
  • Providing optional dedupe mode
  • Producing warehouse-ready JSONL

🧹 Deduplication

When --dedupe is enabled:

  • Each object is canonicalized (sorted keys, compact JSON)
  • Hashed using SHA-256
  • Duplicates are skipped automatically

Example:

Normalized records seen: 200
Unique records written: 173
Duplicates skipped: 27
Discarded items logged: 12 → discarded.jsonl

🧪 Testing

pip install -e .
pip install pytest
pytest

🤝 Contributing

Pull requests are welcome. Please ensure:

  • Tests pass
  • Code follows PEP 8
  • Changes remain backward compatible

📄 License

MIT License. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jsonl_normalizer-0.2.0.tar.gz (10.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

jsonl_normalizer-0.2.0-py3-none-any.whl (10.1 kB view details)

Uploaded Python 3

File details

Details for the file jsonl_normalizer-0.2.0.tar.gz.

File metadata

  • Download URL: jsonl_normalizer-0.2.0.tar.gz
  • Upload date:
  • Size: 10.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for jsonl_normalizer-0.2.0.tar.gz
Algorithm Hash digest
SHA256 7cb83e3bacd68ad58ccb2583385d19fd35c39e06edf3690e15bd8b921ab4481e
MD5 575adfef285d8951f8f3a30b7139e4fa
BLAKE2b-256 b3138d53dbd8b478d4b846e125f8fa179da2576d1aba2d88ecc9797549dea174

See more details on using hashes here.

File details

Details for the file jsonl_normalizer-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for jsonl_normalizer-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 491f5000a74e5484c7536e6f14645225c97c49d77ab1341e729a9fe6c5a58cba
MD5 abbf4a747755527184b2d968831115a8
BLAKE2b-256 88e582ca74fe22aceec58ed06588f65ca6205fc8e88bf1f7fe1b3ded828f2c85

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page