Skip to main content

A command-line utility to validate and normalize CSV files

Project description

PyPI version License: MIT Python 3.9+ Ask DeepWiki

csvnorm

A command-line utility to validate and normalize CSV files for initial exploration.

Version 1.0 Breaking Change

If upgrading from v0.x: The default output has changed from file to stdout for better Unix composability.

# v0.x behavior
csvnorm data.csv              # Created data.csv in current directory

# v1.0 behavior (NEW)
csvnorm data.csv              # Outputs to stdout
csvnorm data.csv -o data.csv  # Explicitly save to file
csvnorm data.csv > data.csv   # Or use shell redirect

This follows the Unix philosophy and matches tools like jq, csvkit, and xsv.

Installation

Recommended (uv):

uv tool install csvnorm

Or with pip:

pip install csvnorm

Purpose

This tool prepares CSV files for basic exploratory data analysis (EDA), not for complex transformations. It focuses on achieving a clean, standardized baseline format that allows you to quickly assess data quality and structure before designing more sophisticated ETL pipelines.

What it does:

  • Validates CSV structure and reports errors
  • Normalizes encoding to UTF-8 when needed
  • Normalizes delimiters and field names
  • Creates a consistent starting point for data exploration

What it doesn't do:

  • Complex data transformations or business logic
  • Type inference or data validation beyond structure
  • Heavy processing or aggregations

Features

  • CSV Validation: Checks for common CSV errors and inconsistencies using DuckDB
  • Delimiter Normalization: Converts all field separators to standard commas (,)
  • Field Name Normalization: Converts column headers to snake_case format
  • Encoding Normalization: Auto-detects encoding and converts to UTF-8 when needed (ASCII is already UTF-8 compatible)
  • Processing Summary: Displays comprehensive statistics (rows, columns, file sizes) and error details
  • Error Reporting: Exports detailed error file for invalid rows with summary panel
  • Remote URL Support: Process CSV files directly from HTTP/HTTPS URLs without downloading (unless --fix-mojibake is used)

Usage

csvnorm input.csv [options]
csvnorm -                    # read from stdin

By default, csvnorm writes to stdout for easy piping and composability with other Unix tools. Use - as input to read from stdin.

Options

Option Description
-o, --output-file PATH Write to file instead of stdout
-f, --force Force overwrite of existing output file (when -o is specified)
-k, --keep-names Keep original column names (disable snake_case)
-d, --delimiter CHAR Set custom output delimiter (default: ,)
-s, --skip-rows N Skip first N rows of input file (useful for metadata/comments)
--fix-mojibake [N] Fix mojibake using ftfy (optional sample size N; use 0 to force repair)
--strict Exit with error code 1 if any validation errors occur (fail-fast mode)
--check Validate CSV without processing or normalizing (exit code 0=valid, 1=invalid)
--download-remote Download remote CSV locally before processing (needed for remote .zip/.gz)
-V, --verbose Enable verbose output for debugging
-v, --version Show version number
-h, --help Show help message

Examples

# Default: output to stdout
csvnorm data.csv

# Read from stdin
cat data.csv | csvnorm -
curl -s https://example.com/data.csv | csvnorm - -o clean.csv
csvnorm - --check < data.csv

# Preview first rows
csvnorm data.csv | head -20

# Pipe to other tools
csvnorm data.csv | csvcut -c name,age | csvstat

# Save to file
csvnorm data.csv -o output.csv

# Shell redirect
csvnorm data.csv > output.csv

# Process remote CSV from URL
csvnorm "https://raw.githubusercontent.com/aborruso/csvnorm/refs/heads/main/test/Trasporto%20Pubblico%20Locale%20Settore%20Pubblico%20Allargato%20-%20Indicatore%202000-2020%20Trasferimenti%20Correnti%20su%20Entrate%20Correnti.csv" -o output.csv

# Process remote compressed CSV (download first, then handle gzip/zip locally)
csvnorm "https://example.com/data.csv.gz" --download-remote -o output.csv

# Custom delimiter
csvnorm data.csv -d ';' -o output.csv

# Keep original headers
csvnorm data.csv --keep-names -o output.csv

# Skip first 2 rows (metadata or comments)
csvnorm data.csv --skip-rows 2 -o output.csv

# Force overwrite with verbose output
csvnorm data.csv -f -V -o processed.csv

# Fix mojibake using ftfy (default sample size)
csvnorm data.csv --fix-mojibake -o fixed.csv

# Fix mojibake with custom sample size
csvnorm data.csv --fix-mojibake 4000 -o fixed.csv

# Force mojibake repair even with low badness score
csvnorm data.csv --fix-mojibake 0 -o fixed.csv

# Fail-fast mode: exit with error if validation errors occur
csvnorm data.csv --strict > output.csv || echo "Validation failed!"

# Use in pipelines where data quality is critical
csvnorm remote_data.csv --strict | other_tool || handle_error

# Quick validation check (no processing or output)
csvnorm data.csv --check && echo "Valid CSV" || echo "Invalid CSV"

# Check remote CSV for validity
csvnorm https://example.com/data.csv --check

# Use in CI/CD pipelines for validation
csvnorm raw_data.csv --check || exit 1

Output

Default behavior (stdout):

  • Writes normalized CSV to stdout
  • Progress and errors go to stderr
  • Validation errors (if any) are shown to stderr before the output data
  • Reject file saved to ./reject_errors.csv in current working directory
  • Perfect for piping to other tools or shell redirection

File output (with -o):

  • Creates a normalized CSV file at the specified path with:
    • UTF-8 encoding
    • Consistent field delimiters
    • Normalized column names (unless --keep-names is specified)
  • Error report if any invalid rows are found (saved as {output_name}_reject_errors.csv in the same directory)
  • Shows success table with statistics (rows, columns, file sizes)
  • Supports absolute and relative paths
  • Any file extension is allowed (not limited to .csv)

Input file protection:

  • csvnorm will never overwrite the input file, even with --force
  • If you try to use the same path for input and output, you'll get an error
  • Use -o to specify a different output path

Remote URLs:

  • Encoding is handled automatically by DuckDB
  • If --fix-mojibake is enabled, the URL is downloaded to a temp file first

Mojibake repair (--fix-mojibake [N]):

  • Mojibake is garbled text produced by decoding bytes with the wrong character encoding (e.g., Città instead of Città).
  • Enables optional mojibake repair using ftfy (for already-misdecoded text).
  • N is the sample size (number of characters) used by the detector; default is 5000.
  • The repair runs only when ftfy's badness heuristic flags the sample as "bad."
  • Use N=0 to force repair without detection (useful for files with low badness scores but visible mojibake).
  • Note: ftfy cannot recover bytes that were irreversibly lost in the original encoding. Replacement characters () may remain where data was corrupted beyond repair.
  • HTTP timeout is set to 30 seconds
  • Only public URLs are supported (no authentication)

The tool provides modern terminal output (shown only when using -o to write to a file) with:

  • Progress indicators for multi-step processing
  • Color-coded error messages with panels
  • Success summary table with statistics (rows, columns, file sizes)
  • Encoding conversion status (converted/no conversion/remote; ASCII is already UTF-8 compatible)
  • Error summary panel with reject count and error types when validation fails
  • ASCII art banner with --version and -V verbose mode

Success Example: (shown only when using -o)

 ✓ Success
 Input:        test/utf8_basic.csv
 Output:       output/utf8_basic.csv
 Encoding:     ascii (ASCII is UTF-8 compatible; no conversion needed)
 Rows:         2
 Columns:      3
 Input size:   42 B
 Output size:  43 B
 Headers:      normalized to snake_case

Error Example: (shown only when using -o)

 ✓ Success
 Input:        test/malformed_rows.csv
 Output:       output/malformed_rows.csv
 Encoding:     ascii (ASCII is UTF-8 compatible; no conversion needed)
 Rows:         1
 Columns:      4
 Input size:   24 B
 Output size:  40 B
 Headers:      normalized to snake_case

╭──────────────────────────── ! Validation Failed ─────────────────────────────╮
│ Validation Errors:                                                           │
│                                                                              │
│ Rejected rows: 2                                                             │
│                                                                              │
│ Error types:                                                                 │
│   • Expected Number of Columns: 3 Found: 2                                   │
│   • Expected Number of Columns: 3 Found: 4                                   │
│                                                                              │
│ Details: output/malformed_rows_reject_errors.csv                             │
╰──────────────────────────────────────────────────────────────────────────────╯

Exit Codes

Code Meaning
0 Success
1 Error (validation failed, file not found, etc.)

Requirements

  • Python 3.9+
  • Dependencies (automatically installed):
    • charset-normalizer>=3.0.0 - Encoding detection
    • duckdb>=0.9.0 - CSV validation and normalization
    • ftfy>=6.3.1 - Mojibake repair
    • rich>=13.0.0 - Modern terminal output formatting
    • rich-argparse>=1.0.0 - Enhanced CLI help formatting

Optional extras:

  • [dev] - Development dependencies (pytest>=7.0.0, pytest-cov>=4.0.0, ruff>=0.1.0)

Development

Setup

git clone https://github.com/aborruso/csvnorm
cd csvnorm

# Create and activate venv with uv (recommended)
uv venv
source .venv/bin/activate
uv pip install -e ".[dev]"

# Or with pip
pip install -e ".[dev]"

Testing

pytest tests/ -v

Project Structure

csvnorm/
├── src/csvnorm/
│   ├── __init__.py      # Package version
│   ├── __main__.py      # python -m support
│   ├── cli.py           # CLI argument parsing
│   ├── core.py          # Main processing pipeline
│   ├── encoding.py      # Encoding detection/conversion
│   ├── validation.py    # DuckDB validation
│   └── utils.py         # Helper functions
├── tests/               # Test suite
├── test/                # CSV fixtures
└── pyproject.toml       # Package configuration

Stay Updated

Get notified of new releases

Watch → Custom → ✓ Releases to receive notifications for all new versions.

Get notified of breaking changes only

Subscribe to Announcements to be notified only about:

  • Breaking changes (major version bumps)
  • Security updates
  • Important deprecation notices

We follow Semantic Versioning:

  • MAJOR (e.g., 1.0.0 → 2.0.0): Breaking changes
  • MINOR (e.g., 1.0.0 → 1.1.0): New features, backward compatible
  • PATCH (e.g., 1.0.0 → 1.0.1): Bug fixes only

See docs/COMMUNICATION.md for details.

License

MIT License (c) 2026 aborruso@gmail.com - See LICENSE file for details

Acknowledgments

csvnorm is built on top of excellent open-source libraries:

  • charset-normalizer - Universal encoding detection
  • DuckDB - Fast in-process analytical database for CSV processing
  • ftfy - Fixes mojibake and other text encoding issues
  • Rich - Beautiful terminal output formatting
  • rich-argparse - Enhanced CLI help formatting

We are grateful to the creators and maintainers of these libraries, without whom csvnorm would not exist and function.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

csvnorm-1.2.18.tar.gz (47.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

csvnorm-1.2.18-py3-none-any.whl (32.3 kB view details)

Uploaded Python 3

File details

Details for the file csvnorm-1.2.18.tar.gz.

File metadata

  • Download URL: csvnorm-1.2.18.tar.gz
  • Upload date:
  • Size: 47.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for csvnorm-1.2.18.tar.gz
Algorithm Hash digest
SHA256 70dfdac7325891a19e562fde73265cbc381a99bb3f0bdc6bc6bffbec22c7ecc3
MD5 7c1c3aacf0236d0bbce5350990ad0685
BLAKE2b-256 5a94264d7f020ffeb62166ed18a61c8fb7f1477db563e0fa2fc7fa97ba4d4363

See more details on using hashes here.

Provenance

The following attestation bundles were made for csvnorm-1.2.18.tar.gz:

Publisher: publish-pypi.yml on aborruso/csvnorm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file csvnorm-1.2.18-py3-none-any.whl.

File metadata

  • Download URL: csvnorm-1.2.18-py3-none-any.whl
  • Upload date:
  • Size: 32.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for csvnorm-1.2.18-py3-none-any.whl
Algorithm Hash digest
SHA256 b161b3adecd5f2c06891a04db448306b0b13bbd1ac77771fbbae7fbdb63430ef
MD5 2adf22d53765fa5c086fafdb4eaf4fc8
BLAKE2b-256 53f4e5fd3dbdb6ad40d46cefb55bf0dcafee873ea06c1ec8babe26f069715f6f

See more details on using hashes here.

Provenance

The following attestation bundles were made for csvnorm-1.2.18-py3-none-any.whl:

Publisher: publish-pypi.yml on aborruso/csvnorm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page