Normalize messy JSONL into dict-only, deduplicated, BigQuery-friendly JSONL with discard logs.

These details have not been verified by PyPI

Project links

Project description

jsonl-normalizer

Python versions

A fast, fault-tolerant tool that normalizes messy JSONL files into clean, dict-only, BigQuery-friendly JSONL. Supports discard logging, SHA-256 deduplication, JSONL concatenation, and mixed-type top-level lines (dicts, lists, strings, numbers).

🚀 Features

Normalization

Normalize any JSONL file
- Accepts dicts, lists, numbers, strings, malformed lines
- Extracts dicts from lists
- Logs non-dict elements instead of failing
BigQuery-friendly output
Ensures one JSON object per line.
Robust error handling
- Malformed JSON → logged
- Non-dict top-level values → logged
- Mixed lists → dicts kept, junk discarded
Optional SHA-256 deduplication
Canonical JSON hashing removes duplicate objects across large files.
Zero dependencies
Pure standard library. Fast and lightweight.

NEW (v0.2.1): Batch JSON to JSONL

Batch convert a directory of classic .json files to .jsonl
Perfect for converting legacy exports or API dumps
Optional dedupe and custom discard logging
Clean argparse-based CLI (json-to-jsonl)

NEW (v0.2.1): JSONL Concatenation

Combine many normalized_*.jsonl files into one newline-delimited JSONL
Perfect for BigQuery (NEWLINE_DELIMITED_JSON)
Optional dedupe via SHA-256
Gentle warnings for non-standard output filenames
Clean argparse-based CLI (jsonl-concat)

📦 Installation

pip install jsonl-normalizer

Development install:

pip install -e .

🖥️ CLI Usage

1. Normalize JSONL

Normalize a JSONL file:

jsonl-normalize input.jsonl

Produces:

normalized.jsonl   # clean dict-only output
discarded.jsonl    # log of malformed or discarded items

Enable deduplication:

jsonl-normalize input.jsonl --dedupe

Specify custom output:

jsonl-normalize input.jsonl --output clean.jsonl --discarded junk.jsonl

📂 NEW: `json-to-jsonl` — Batch JSON to JSONL Converter

json-to-jsonl converts all .json files in a source directory to .jsonl files in an output directory.

Usage

json-to-jsonl source_dir output_dir

By default, if --discarded-dir is not provided, it will create a discarded_json directory to save logs of discarded items (but only if there are actual items to discard).

Features

Detects all .json files in source_dir
Converts each to output_dir/<filename>.jsonl
Optional SHA-256 dedupe (--dedupe)
Default discarded directory discarded_json (optional override via --discarded-dir)
Fault-tolerant: Empty discarded files are never created
Quiet mode (--quiet)

Examples

json-to-jsonl ./raw_jsons ./converted_jsonls

With deduplication and discarded logs:

json-to-jsonl ./raw_jsons ./converted_jsonls --discarded-dir ./discarded --dedupe

🔗 NEW: `jsonl-concat` — JSONL Concatenation Tool

jsonl-concat concatenates multiple normalized JSONL files into a single multi-line JSONL file.

This is ideal when your workflow produces many files such as:

norm_jsonl/
  normalized_0044a4b1d5099e2a.jsonl
  normalized_007b2d5c01abc0b9.jsonl
  normalized_02231d6de9a07833.jsonl
  ...

Combine them into one BigQuery-friendly file:

jsonl-concat

Default behavior is equivalent to:

jsonl-concat norm_jsonl/ combined.jsonl --pattern "*.jsonl"

Features

Reads files matching the given pattern (default *.jsonl) under the given directory
Processes files line-by-line for proper record-level deduplication
Writes one JSON object per line
Optional SHA-256 dedupe (--no-dedupe to disable)
Quiet mode (--quiet)
Gentle suffix warning when output file is not .jsonl/.ndjson

Examples

Use defaults:

jsonl-concat

Explicit directory and output:

jsonl-concat norm_jsonl/ final.jsonl

Custom file pattern (e.g., if your files don't start with normalized_ or have different extensions):

jsonl-concat norm_jsonl/ combined.jsonl --pattern "*.json"

Quiet mode:

jsonl-concat --quiet norm_jsonl/ combined.jsonl

Disable deduplication:

jsonl-concat --no-dedupe norm_jsonl/ combined.jsonl

If verbose and output filename is non-standard:

[WARN] Output file 'out.json' does not use .jsonl or .ndjson. Continuing.

📄 Example (Normalization)

Input (`mixed.jsonl`)

{"a": 1, "b": 2}
[{"a": 2}, [7]]
"just a string"

Output: `normalized.jsonl`

{"a": 1, "b": 2}
{"a": 2}

Output: `discarded.jsonl`

{"line": 2, "index": 1, "type": "list", "value": [7], "reason": "non-dict element in list"}
{"line": 3, "type": "str", "value": "just a string", "reason": "top-level value is not dict or list"}

🧪 Library Usage

from pathlib import Path
from jsonl_normalizer import normalize_jsonl, convert_json_dir_to_jsonl, concat_jsonl

# 1. Normalize a single file
stats = normalize_jsonl(
    input_path=Path("input.jsonl"),
    output_path=Path("normalized.jsonl"),
    discarded_path=Path("discarded.jsonl"),
    dedupe=True,
)
print(f"Single file: {stats}")

# 2. Batch convert a directory (discarded_dir is optional)
results = convert_json_dir_to_jsonl(
    source_dir=Path("./json_inputs"),
    output_dir=Path("./jsonl_outputs"),
    dedupe=True,
)
for filename, stats in results.items():
    print(f"{filename}: {stats.written} records")

# 3. Concatenate multiple JSONL files
concat_jsonl(
    source_dir=Path("./norm_jsonl"),
    output_file=Path("combined.jsonl"),
    pattern="*.jsonl",
    dedupe=True,
)

❓ Why jsonl-normalizer?

Real-world JSONL is messy:

LLMs output arrays or malformed fragments
Excel corrupts JSON strings
Some APIs return non-dict top-level structures
Data lakes accumulate junk
BigQuery requires strict dict-per-line JSONL
ETL pipelines fail on partial corruption

jsonl-normalizer fixes these problems by:

Normalizing structure
Logging all junk transparently
Keeping valid dicts only
Providing optional dedupe mode
Producing warehouse-ready JSONL

🧹 Deduplication

When --dedupe is enabled:

Each object is canonicalized (sorted keys, compact JSON)
Hashed using SHA-256
Duplicates are skipped automatically

Example:

Normalized records seen: 200
Unique records written: 173
Duplicates skipped: 27
Discarded items logged: 12 → discarded.jsonl

🧪 Testing

pip install -e .
pip install pytest
pytest

🤝 Contributing

Pull requests are welcome. Please ensure:

Tests pass
Code follows PEP 8
Changes remain backward compatible

📄 License

MIT License. See LICENSE for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.1

Feb 13, 2026

0.2.0

Nov 29, 2025

0.1.1

Nov 24, 2025

0.1.0

Nov 21, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

jsonl_normalizer-0.2.1.tar.gz (13.5 kB view details)

Uploaded Feb 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

jsonl_normalizer-0.2.1-py3-none-any.whl (13.1 kB view details)

Uploaded Feb 13, 2026 Python 3

File details

Details for the file jsonl_normalizer-0.2.1.tar.gz.

File metadata

Download URL: jsonl_normalizer-0.2.1.tar.gz
Upload date: Feb 13, 2026
Size: 13.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for jsonl_normalizer-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`70dafc214b6898ffdbd8afca79cdd594f68f2212846d58867e34a390ed428256`
MD5	`ffc6d114a7642f8e3364d2d7c974e4b6`
BLAKE2b-256	`6a06c3a1ec24399ff1a5e6e1872e0eebf519a656d944da2a3841a55752cf67cf`

See more details on using hashes here.

File details

Details for the file jsonl_normalizer-0.2.1-py3-none-any.whl.

File metadata

Download URL: jsonl_normalizer-0.2.1-py3-none-any.whl
Upload date: Feb 13, 2026
Size: 13.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for jsonl_normalizer-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a535c3e13d5fddaacbbf54cbfe628f17667714b2611c1a20981bd6e097de32e8`
MD5	`bf2de3f4ae602ddeb3add86a599cded7`
BLAKE2b-256	`ae5b4d81cf5c69bf6312a748c1c324ef626635afb9bc21ccd20d0e3c7cd2404e`

See more details on using hashes here.

jsonl-normalizer 0.2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

jsonl-normalizer

🚀 Features

Normalization

NEW (v0.2.1): Batch JSON to JSONL

NEW (v0.2.1): JSONL Concatenation

📦 Installation

🖥️ CLI Usage

1. Normalize JSONL

📂 NEW: json-to-jsonl — Batch JSON to JSONL Converter

Usage

Features

Examples

🔗 NEW: jsonl-concat — JSONL Concatenation Tool

Features

Examples

📄 Example (Normalization)

Input (mixed.jsonl)

Output: normalized.jsonl

Output: discarded.jsonl

🧪 Library Usage

❓ Why jsonl-normalizer?

🧹 Deduplication

🧪 Testing

🤝 Contributing

📄 License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

📂 NEW: `json-to-jsonl` — Batch JSON to JSONL Converter

🔗 NEW: `jsonl-concat` — JSONL Concatenation Tool

Input (`mixed.jsonl`)

Output: `normalized.jsonl`

Output: `discarded.jsonl`