Skip to main content

Normalize messy JSONL into dict-only, deduplicated, BigQuery-friendly JSONL with discard logs.

Project description

jsonl-normalizer

PyPI version Python versions License

A fast, fault-tolerant tool that normalizes messy JSONL files into clean, dict-only, BigQuery-friendly JSONL. Supports discard logging, SHA-256 deduplication, and mixed-type top-level lines (dicts, lists, strings, numbers).


🚀 Features

  • Normalize any JSONL file

    • Accepts dicts, lists, numbers, strings, malformed lines
    • Extracts dicts from lists
    • Logs non-dict elements instead of failing
  • BigQuery-friendly output
    Ensures one JSON object per line.

  • Robust error handling

    • Malformed JSON → logged
    • Non-dict top-level values → logged
    • Mixed lists → dicts kept, junk discarded
  • Optional SHA-256 deduplication
    Canonical JSON hashing removes duplicate objects across large files.

  • Zero dependencies
    Pure standard library. Fast and lightweight.


📦 Installation

pip install jsonl-normalizer

Development install:

pip install -e .

🖥️ CLI Usage

Normalize a JSONL file:

jsonl-normalize input.jsonl

Produces:

normalized.jsonl   # clean dict-only output
discarded.jsonl    # log of malformed or discarded items

Enable deduplication:

jsonl-normalize input.jsonl --dedupe

Specify custom output:

jsonl-normalize input.jsonl --output clean.jsonl --discarded junk.jsonl

📄 Example

Input (mixed.jsonl)

{"a": 1, "b": 2}
[{"a": 2}, [7]]
"just a string"

Output: normalized.jsonl

{"a": 1, "b": 2}
{"a": 2}

Output: discarded.jsonl

{"line": 2, "index": 1, "type": "list", "value": [7], "reason": "non-dict element in list"}
{"line": 3, "type": "str", "value": "just a string", "reason": "top-level value is not dict or list"}

🧪 Library Usage

from pathlib import Path
from jsonl_normalizer import normalize_jsonl

stats = normalize_jsonl(
    input_path=Path("input.jsonl"),
    output_path=Path("normalized.jsonl"),
    discarded_path=Path("discarded.jsonl"),
    dedupe=True,
)

print(stats)

❓ Why jsonl-normalizer?

Real-world JSONL is messy:

  • LLMs sometimes output arrays or malformed JSON
  • Excel corrupts JSON strings
  • Some APIs return non-dict top-level structures
  • Data lakes accumulate junk
  • BigQuery requires strict dict-per-line JSONL
  • ETL pipelines often fail on partial corruption

jsonl-normalizer fixes these problems by:

  • Normalizing structure
  • Logging all junk transparently
  • Keeping valid dicts only
  • Providing dedupe mode
  • Producing reliable, warehouse-ready JSONL

🧹 Deduplication

When --dedupe is enabled:

  • Each object is canonicalized (sorted keys, compact JSON)
  • Hashed using SHA-256
  • Duplicates are skipped automatically

Example output:

Normalized records seen: 200
Unique records written: 173
Duplicates skipped: 27
Discarded items logged: 12 → discarded.jsonl

🧪 Testing

pip install -e .
pip install pytest
pytest

🤝 Contributing

Pull requests are welcome. Please ensure:

  • Tests pass
  • Code follows PEP 8
  • Changes remain backward compatible

📄 License

MIT License. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jsonl_normalizer-0.1.0.tar.gz (4.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

jsonl_normalizer-0.1.0-py3-none-any.whl (5.8 kB view details)

Uploaded Python 3

File details

Details for the file jsonl_normalizer-0.1.0.tar.gz.

File metadata

  • Download URL: jsonl_normalizer-0.1.0.tar.gz
  • Upload date:
  • Size: 4.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for jsonl_normalizer-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ac2bb9d3d35ce66bcbe83b4fa37a0baf86a79e24c427ed6a4f06902d89189846
MD5 f5f8f2700a301ecc35641f33310842c5
BLAKE2b-256 8ca14cf6d2669a6af6fce03237412202893888af344386e8e7af06d824be8d7e

See more details on using hashes here.

File details

Details for the file jsonl_normalizer-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for jsonl_normalizer-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 71097e80cb25c4cc5c1cda83f568e0c4de23fede237038bf3b2d60c62d66ad1c
MD5 9293efc7b63335936aa7198f268bd4d4
BLAKE2b-256 72f76d0e0d1e6cf345e3d592978e39cf11dbba7e9cc9de12f67de87394ea0a32

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page