A command-line utility to validate and normalize CSV files

These details have not been verified by PyPI

Project links

Project description

csvnorm

A command-line utility to validate and normalize CSV files for initial exploration.

Version 1.0 Breaking Change

If upgrading from v0.x: The default output has changed from file to stdout for better Unix composability.

# v0.x behavior
csvnorm data.csv              # Created data.csv in current directory

# v1.0 behavior (NEW)
csvnorm data.csv              # Outputs to stdout
csvnorm data.csv -o data.csv  # Explicitly save to file
csvnorm data.csv > data.csv   # Or use shell redirect

This follows the Unix philosophy and matches tools like jq, csvkit, and xsv.

Installation

Recommended (uv):

uv tool install csvnorm

Or with pip:

pip install csvnorm

Purpose

This tool prepares CSV files for basic exploratory data analysis (EDA), not for complex transformations. It focuses on achieving a clean, standardized baseline format that allows you to quickly assess data quality and structure before designing more sophisticated ETL pipelines.

What it does:

Validates CSV structure and reports errors
Normalizes encoding to UTF-8 when needed
Normalizes delimiters and field names
Creates a consistent starting point for data exploration

What it doesn't do:

Complex data transformations or business logic
Type inference or data validation beyond structure
Heavy processing or aggregations

Features

CSV Validation: Checks for common CSV errors and inconsistencies using DuckDB
Delimiter Normalization: Converts all field separators to standard commas (,)
Field Name Normalization: Converts column headers to snake_case format
Encoding Normalization: Auto-detects encoding and converts to UTF-8 when needed (ASCII is already UTF-8 compatible)
Processing Summary: Displays comprehensive statistics (rows, columns, file sizes) and error details
Error Reporting: Exports detailed error file for invalid rows with summary panel
Remote URL Support: Process CSV files directly from HTTP/HTTPS URLs without downloading (unless --fix-mojibake is used)

Usage

csvnorm input.csv [options]
csvnorm -                    # read from stdin

By default, csvnorm writes to stdout for easy piping and composability with other Unix tools. Use - as input to read from stdin.

Options

Option	Description
`-o, --output-file PATH`	Write to file instead of stdout
`-f, --force`	Force overwrite of existing output file (when `-o` is specified)
`-k, --keep-names`	Keep original column names (disable snake_case)
`-d, --delimiter CHAR`	Set custom output delimiter (default: `,`)
`-s, --skip-rows N`	Skip first N rows of input file (useful for metadata/comments)
`--fix-mojibake [N]`	Fix mojibake using ftfy (optional sample size `N`; use `0` to force repair)
`--strict`	Exit with error code 1 if any validation errors occur (fail-fast mode)
`--check`	Validate CSV without processing or normalizing (exit code 0=valid, 1=invalid)
`--download-remote`	Download remote CSV locally before processing (needed for remote .zip/.gz)
`-V, --verbose`	Enable verbose output for debugging
`-v, --version`	Show version number
`-h, --help`	Show help message

Examples

# Default: output to stdout
csvnorm data.csv

# Read from stdin
cat data.csv | csvnorm -
curl -s https://example.com/data.csv | csvnorm - -o clean.csv
csvnorm - --check < data.csv

# Preview first rows
csvnorm data.csv | head -20

# Pipe to other tools
csvnorm data.csv | csvcut -c name,age | csvstat

# Save to file
csvnorm data.csv -o output.csv

# Shell redirect
csvnorm data.csv > output.csv

# Process remote CSV from URL
csvnorm "https://raw.githubusercontent.com/aborruso/csvnorm/refs/heads/main/test/Trasporto%20Pubblico%20Locale%20Settore%20Pubblico%20Allargato%20-%20Indicatore%202000-2020%20Trasferimenti%20Correnti%20su%20Entrate%20Correnti.csv" -o output.csv

# Process remote compressed CSV (download first, then handle gzip/zip locally)
csvnorm "https://example.com/data.csv.gz" --download-remote -o output.csv

# Custom delimiter
csvnorm data.csv -d ';' -o output.csv

# Keep original headers
csvnorm data.csv --keep-names -o output.csv

# Skip first 2 rows (metadata or comments)
csvnorm data.csv --skip-rows 2 -o output.csv

# Force overwrite with verbose output
csvnorm data.csv -f -V -o processed.csv

# Fix mojibake using ftfy (default sample size)
csvnorm data.csv --fix-mojibake -o fixed.csv

# Fix mojibake with custom sample size
csvnorm data.csv --fix-mojibake 4000 -o fixed.csv

# Force mojibake repair even with low badness score
csvnorm data.csv --fix-mojibake 0 -o fixed.csv

# Fail-fast mode: exit with error if validation errors occur
csvnorm data.csv --strict > output.csv || echo "Validation failed!"

# Use in pipelines where data quality is critical
csvnorm remote_data.csv --strict | other_tool || handle_error

# Quick validation check (no processing or output)
csvnorm data.csv --check && echo "Valid CSV" || echo "Invalid CSV"

# Check remote CSV for validity
csvnorm https://example.com/data.csv --check

# Use in CI/CD pipelines for validation
csvnorm raw_data.csv --check || exit 1

Output

Default behavior (stdout):

Writes normalized CSV to stdout
Progress and errors go to stderr
Validation errors (if any) are shown to stderr before the output data
Reject file saved to ./reject_errors.csv in current working directory
Perfect for piping to other tools or shell redirection

File output (with -o):

Creates a normalized CSV file at the specified path with:
- UTF-8 encoding
- Consistent field delimiters
- Normalized column names (unless --keep-names is specified)
Error report if any invalid rows are found (saved as {output_name}_reject_errors.csv in the same directory)
Shows success table with statistics (rows, columns, file sizes)
Supports absolute and relative paths
Any file extension is allowed (not limited to .csv)

Input file protection:

csvnorm will never overwrite the input file, even with --force
If you try to use the same path for input and output, you'll get an error
Use -o to specify a different output path

Remote URLs:

Encoding is handled automatically by DuckDB
If --fix-mojibake is enabled, the URL is downloaded to a temp file first

Mojibake repair (--fix-mojibake [N]):

Mojibake is garbled text produced by decoding bytes with the wrong character encoding (e.g., CittÃ instead of Città).
Enables optional mojibake repair using ftfy (for already-misdecoded text).
N is the sample size (number of characters) used by the detector; default is 5000.
The repair runs only when ftfy's badness heuristic flags the sample as "bad."
Use N=0 to force repair without detection (useful for files with low badness scores but visible mojibake).
Note: ftfy cannot recover bytes that were irreversibly lost in the original encoding. Replacement characters (�) may remain where data was corrupted beyond repair.
HTTP timeout is set to 30 seconds
Only public URLs are supported (no authentication)

The tool provides modern terminal output (shown only when using -o to write to a file) with:

Progress indicators for multi-step processing
Color-coded error messages with panels
Success summary table with statistics (rows, columns, file sizes)
Encoding conversion status (converted/no conversion/remote; ASCII is already UTF-8 compatible)
Error summary panel with reject count and error types when validation fails
ASCII art banner with --version and -V verbose mode

Success Example: (shown only when using -o)

 ✓ Success
 Input:        test/utf8_basic.csv
 Output:       output/utf8_basic.csv
 Encoding:     ascii (ASCII is UTF-8 compatible; no conversion needed)
 Rows:         2
 Columns:      3
 Input size:   42 B
 Output size:  43 B
 Headers:      normalized to snake_case

Error Example: (shown only when using -o)

 ✓ Success
 Input:        test/malformed_rows.csv
 Output:       output/malformed_rows.csv
 Encoding:     ascii (ASCII is UTF-8 compatible; no conversion needed)
 Rows:         1
 Columns:      4
 Input size:   24 B
 Output size:  40 B
 Headers:      normalized to snake_case

╭──────────────────────────── ! Validation Failed ─────────────────────────────╮
│ Validation Errors:                                                           │
│                                                                              │
│ Rejected rows: 2                                                             │
│                                                                              │
│ Error types:                                                                 │
│   • Expected Number of Columns: 3 Found: 2                                   │
│   • Expected Number of Columns: 3 Found: 4                                   │
│                                                                              │
│ Details: output/malformed_rows_reject_errors.csv                             │
╰──────────────────────────────────────────────────────────────────────────────╯

Exit Codes

Code	Meaning
0	Success
1	Error (validation failed, file not found, etc.)

Requirements

Python 3.9+
Dependencies (automatically installed):
- charset-normalizer>=3.0.0 - Encoding detection
- duckdb>=0.9.0 - CSV validation and normalization
- ftfy>=6.3.1 - Mojibake repair
- rich>=13.0.0 - Modern terminal output formatting
- rich-argparse>=1.0.0 - Enhanced CLI help formatting

Optional extras:

[dev] - Development dependencies (pytest>=7.0.0, pytest-cov>=4.0.0, ruff>=0.1.0)

Development

Setup

git clone https://github.com/aborruso/csvnorm
cd csvnorm

# Create and activate venv with uv (recommended)
uv venv
source .venv/bin/activate
uv pip install -e ".[dev]"

# Or with pip
pip install -e ".[dev]"

Testing

pytest tests/ -v

Project Structure

csvnorm/
├── src/csvnorm/
│   ├── __init__.py      # Package version
│   ├── __main__.py      # python -m support
│   ├── cli.py           # CLI argument parsing
│   ├── core.py          # Main processing pipeline
│   ├── encoding.py      # Encoding detection/conversion
│   ├── validation.py    # DuckDB validation
│   └── utils.py         # Helper functions
├── tests/               # Test suite
├── test/                # CSV fixtures
└── pyproject.toml       # Package configuration

Stay Updated

Get notified of new releases

Watch → Custom → ✓ Releases to receive notifications for all new versions.

Get notified of breaking changes only

Subscribe to Announcements to be notified only about:

Breaking changes (major version bumps)
Security updates
Important deprecation notices

We follow Semantic Versioning:

MAJOR (e.g., 1.0.0 → 2.0.0): Breaking changes
MINOR (e.g., 1.0.0 → 1.1.0): New features, backward compatible
PATCH (e.g., 1.0.0 → 1.0.1): Bug fixes only

See docs/COMMUNICATION.md for details.

License

Acknowledgments

csvnorm is built on top of excellent open-source libraries:

charset-normalizer - Universal encoding detection
DuckDB - Fast in-process analytical database for CSV processing
ftfy - Fixes mojibake and other text encoding issues
Rich - Beautiful terminal output formatting
rich-argparse - Enhanced CLI help formatting

We are grateful to the creators and maintainers of these libraries, without whom csvnorm would not exist and function.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.2.18

May 16, 2026

1.2.17

Apr 26, 2026

1.2.16

Apr 26, 2026

1.2.15

Mar 12, 2026

1.2.14

Feb 8, 2026

1.2.13

Feb 1, 2026

1.2.11

Feb 1, 2026

1.2.10

Feb 1, 2026

1.2.9

Feb 1, 2026

1.2.8

Jan 21, 2026

1.2.7

Jan 20, 2026

1.2.6

Jan 20, 2026

1.2.5

Jan 20, 2026

1.2.4

Jan 19, 2026

1.2.3

Jan 19, 2026

1.2.2

Jan 19, 2026

1.2.1

Jan 18, 2026

1.2.0

Jan 18, 2026

1.1.9

Jan 18, 2026

1.1.8

Jan 18, 2026

1.1.7

Jan 18, 2026

1.1.5

Jan 18, 2026

1.1.4

Jan 17, 2026

1.1.3

Jan 17, 2026

1.1.2

Jan 17, 2026

1.0.2

Jan 17, 2026

1.0.1

Jan 17, 2026

1.0.0

Jan 17, 2026

0.3.12

Jan 17, 2026

0.3.11

Jan 16, 2026

0.3.10

Jan 16, 2026

0.3.8

Jan 16, 2026

0.3.7

Jan 16, 2026

0.3.6

Jan 16, 2026

0.3.5

Jan 16, 2026

0.3.4

Jan 16, 2026

0.3.3

Jan 16, 2026

0.3.2

Jan 16, 2026

0.3.1

Jan 16, 2026

0.3.0

Jan 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

csvnorm-1.2.18.tar.gz (47.8 kB view details)

Uploaded May 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

csvnorm-1.2.18-py3-none-any.whl (32.3 kB view details)

Uploaded May 16, 2026 Python 3

File details

Details for the file csvnorm-1.2.18.tar.gz.

File metadata

Download URL: csvnorm-1.2.18.tar.gz
Upload date: May 16, 2026
Size: 47.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for csvnorm-1.2.18.tar.gz
Algorithm	Hash digest
SHA256	`70dfdac7325891a19e562fde73265cbc381a99bb3f0bdc6bc6bffbec22c7ecc3`
MD5	`7c1c3aacf0236d0bbce5350990ad0685`
BLAKE2b-256	`5a94264d7f020ffeb62166ed18a61c8fb7f1477db563e0fa2fc7fa97ba4d4363`

See more details on using hashes here.

Provenance

The following attestation bundles were made for csvnorm-1.2.18.tar.gz:

Publisher: publish-pypi.yml on aborruso/csvnorm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: csvnorm-1.2.18.tar.gz
- Subject digest: 70dfdac7325891a19e562fde73265cbc381a99bb3f0bdc6bc6bffbec22c7ecc3
- Sigstore transparency entry: 1553994905
- Sigstore integration time: May 16, 2026
Source repository:
- Permalink: aborruso/csvnorm@107e8e2950703a9d5bc6d0afb3b5d8a205af9d91
- Branch / Tag: refs/tags/v1.2.18
- Owner: https://github.com/aborruso
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-pypi.yml@107e8e2950703a9d5bc6d0afb3b5d8a205af9d91
- Trigger Event: push

File details

Details for the file csvnorm-1.2.18-py3-none-any.whl.

File metadata

Download URL: csvnorm-1.2.18-py3-none-any.whl
Upload date: May 16, 2026
Size: 32.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for csvnorm-1.2.18-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b161b3adecd5f2c06891a04db448306b0b13bbd1ac77771fbbae7fbdb63430ef`
MD5	`2adf22d53765fa5c086fafdb4eaf4fc8`
BLAKE2b-256	`53f4e5fd3dbdb6ad40d46cefb55bf0dcafee873ea06c1ec8babe26f069715f6f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for csvnorm-1.2.18-py3-none-any.whl:

Publisher: publish-pypi.yml on aborruso/csvnorm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: csvnorm-1.2.18-py3-none-any.whl
- Subject digest: b161b3adecd5f2c06891a04db448306b0b13bbd1ac77771fbbae7fbdb63430ef
- Sigstore transparency entry: 1553994941
- Sigstore integration time: May 16, 2026
Source repository:
- Permalink: aborruso/csvnorm@107e8e2950703a9d5bc6d0afb3b5d8a205af9d91
- Branch / Tag: refs/tags/v1.2.18
- Owner: https://github.com/aborruso
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-pypi.yml@107e8e2950703a9d5bc6d0afb3b5d8a205af9d91
- Trigger Event: push

csvnorm 1.2.18

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

csvnorm

Version 1.0 Breaking Change

Installation

Purpose

Features

Usage

Options

Examples

Output

Exit Codes

Requirements

Development

Setup

Testing

Project Structure

Stay Updated

Get notified of new releases

Get notified of breaking changes only

License

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance