A command-line utility to validate and normalize CSV files
Project description
csvnorm
A command-line utility to validate and normalize CSV files for initial exploration.
Version 1.0 Breaking Change
If upgrading from v0.x: The default output has changed from file to stdout for better Unix composability.
# v0.x behavior
csvnorm data.csv # Created data.csv in current directory
# v1.0 behavior (NEW)
csvnorm data.csv # Outputs to stdout
csvnorm data.csv -o data.csv # Explicitly save to file
csvnorm data.csv > data.csv # Or use shell redirect
This follows the Unix philosophy and matches tools like jq, csvkit, and xsv.
Installation
Recommended (uv):
uv tool install csvnorm
Or with pip:
pip install csvnorm
Purpose
This tool prepares CSV files for basic exploratory data analysis (EDA), not for complex transformations. It focuses on achieving a clean, standardized baseline format that allows you to quickly assess data quality and structure before designing more sophisticated ETL pipelines.
What it does:
- Validates CSV structure and reports errors
- Normalizes encoding to UTF-8 when needed
- Normalizes delimiters and field names
- Creates a consistent starting point for data exploration
What it doesn't do:
- Complex data transformations or business logic
- Type inference or data validation beyond structure
- Heavy processing or aggregations
Features
- CSV Validation: Checks for common CSV errors and inconsistencies using DuckDB
- Delimiter Normalization: Converts all field separators to standard commas (
,) - Field Name Normalization: Converts column headers to snake_case format
- Encoding Normalization: Auto-detects encoding and converts to UTF-8 when needed (ASCII is already UTF-8 compatible)
- Processing Summary: Displays comprehensive statistics (rows, columns, file sizes) and error details
- Error Reporting: Exports detailed error file for invalid rows with summary panel
- Remote URL Support: Process CSV files directly from HTTP/HTTPS URLs without downloading (unless
--fix-mojibakeis used)
Usage
csvnorm input.csv [options]
By default, csvnorm writes to stdout for easy piping and composability with other Unix tools.
Options
| Option | Description |
|---|---|
-o, --output-file PATH |
Write to file instead of stdout |
-f, --force |
Force overwrite of existing output file (when -o is specified) |
-k, --keep-names |
Keep original column names (disable snake_case) |
-d, --delimiter CHAR |
Set custom output delimiter (default: ,) |
-s, --skip-rows N |
Skip first N rows of input file (useful for metadata/comments) |
--fix-mojibake [N] |
Fix mojibake using ftfy (optional sample size N; use 0 to force repair) |
--strict |
Exit with error code 1 if any validation errors occur (fail-fast mode) |
--check |
Validate CSV without processing or normalizing (exit code 0=valid, 1=invalid) |
--download-remote |
Download remote CSV locally before processing (needed for remote .zip/.gz) |
-V, --verbose |
Enable verbose output for debugging |
-v, --version |
Show version number |
-h, --help |
Show help message |
Examples
# Default: output to stdout
csvnorm data.csv
# Preview first rows
csvnorm data.csv | head -20
# Pipe to other tools
csvnorm data.csv | csvcut -c name,age | csvstat
# Save to file
csvnorm data.csv -o output.csv
# Shell redirect
csvnorm data.csv > output.csv
# Process remote CSV from URL
csvnorm "https://raw.githubusercontent.com/aborruso/csvnorm/refs/heads/main/test/Trasporto%20Pubblico%20Locale%20Settore%20Pubblico%20Allargato%20-%20Indicatore%202000-2020%20Trasferimenti%20Correnti%20su%20Entrate%20Correnti.csv" -o output.csv
# Process remote compressed CSV (download first, then handle gzip/zip locally)
csvnorm "https://example.com/data.csv.gz" --download-remote -o output.csv
# Custom delimiter
csvnorm data.csv -d ';' -o output.csv
# Keep original headers
csvnorm data.csv --keep-names -o output.csv
# Skip first 2 rows (metadata or comments)
csvnorm data.csv --skip-rows 2 -o output.csv
# Force overwrite with verbose output
csvnorm data.csv -f -V -o processed.csv
# Fix mojibake using ftfy (default sample size)
csvnorm data.csv --fix-mojibake -o fixed.csv
# Fix mojibake with custom sample size
csvnorm data.csv --fix-mojibake 4000 -o fixed.csv
# Force mojibake repair even with low badness score
csvnorm data.csv --fix-mojibake 0 -o fixed.csv
# Fail-fast mode: exit with error if validation errors occur
csvnorm data.csv --strict > output.csv || echo "Validation failed!"
# Use in pipelines where data quality is critical
csvnorm remote_data.csv --strict | other_tool || handle_error
# Quick validation check (no processing or output)
csvnorm data.csv --check && echo "Valid CSV" || echo "Invalid CSV"
# Check remote CSV for validity
csvnorm https://example.com/data.csv --check
# Use in CI/CD pipelines for validation
csvnorm raw_data.csv --check || exit 1
Output
Default behavior (stdout):
- Writes normalized CSV to stdout
- Progress and errors go to stderr
- Validation errors (if any) are shown to stderr before the output data
- Reject file saved to
./reject_errors.csvin current working directory - Perfect for piping to other tools or shell redirection
File output (with -o):
- Creates a normalized CSV file at the specified path with:
- UTF-8 encoding
- Consistent field delimiters
- Normalized column names (unless
--keep-namesis specified)
- Error report if any invalid rows are found (saved as
{output_name}_reject_errors.csvin the same directory) - Shows success table with statistics (rows, columns, file sizes)
- Supports absolute and relative paths
- Any file extension is allowed (not limited to
.csv)
Input file protection:
- csvnorm will never overwrite the input file, even with
--force - If you try to use the same path for input and output, you'll get an error
- Use
-oto specify a different output path
Remote URLs:
- Encoding is handled automatically by DuckDB
- If
--fix-mojibakeis enabled, the URL is downloaded to a temp file first
Mojibake repair (--fix-mojibake [N]):
- Mojibake is garbled text produced by decoding bytes with the wrong character encoding (e.g.,
CittÃinstead ofCittà). - Enables optional mojibake repair using ftfy (for already-misdecoded text).
Nis the sample size (number of characters) used by the detector; default is 5000.- The repair runs only when ftfy's badness heuristic flags the sample as "bad."
- Use
N=0to force repair without detection (useful for files with low badness scores but visible mojibake). - Note: ftfy cannot recover bytes that were irreversibly lost in the original encoding. Replacement characters (
�) may remain where data was corrupted beyond repair. - HTTP timeout is set to 30 seconds
- Only public URLs are supported (no authentication)
The tool provides modern terminal output (shown only when using -o to write to a file) with:
- Progress indicators for multi-step processing
- Color-coded error messages with panels
- Success summary table with statistics (rows, columns, file sizes)
- Encoding conversion status (converted/no conversion/remote; ASCII is already UTF-8 compatible)
- Error summary panel with reject count and error types when validation fails
- ASCII art banner with
--versionand-Vverbose mode
Success Example: (shown only when using -o)
✓ Success
Input: test/utf8_basic.csv
Output: output/utf8_basic.csv
Encoding: ascii (ASCII is UTF-8 compatible; no conversion needed)
Rows: 2
Columns: 3
Input size: 42 B
Output size: 43 B
Headers: normalized to snake_case
Error Example: (shown only when using -o)
✓ Success
Input: test/malformed_rows.csv
Output: output/malformed_rows.csv
Encoding: ascii (ASCII is UTF-8 compatible; no conversion needed)
Rows: 1
Columns: 4
Input size: 24 B
Output size: 40 B
Headers: normalized to snake_case
╭──────────────────────────── ! Validation Failed ─────────────────────────────╮
│ Validation Errors: │
│ │
│ Rejected rows: 2 │
│ │
│ Error types: │
│ • Expected Number of Columns: 3 Found: 2 │
│ • Expected Number of Columns: 3 Found: 4 │
│ │
│ Details: output/malformed_rows_reject_errors.csv │
╰──────────────────────────────────────────────────────────────────────────────╯
Exit Codes
| Code | Meaning |
|---|---|
| 0 | Success |
| 1 | Error (validation failed, file not found, etc.) |
Requirements
- Python 3.9+
- Dependencies (automatically installed):
charset-normalizer>=3.0.0- Encoding detectionduckdb>=0.9.0- CSV validation and normalizationftfy>=6.3.1- Mojibake repairrich>=13.0.0- Modern terminal output formattingrich-argparse>=1.0.0- Enhanced CLI help formatting
Optional extras:
[dev]- Development dependencies (pytest>=7.0.0,pytest-cov>=4.0.0,ruff>=0.1.0)
Development
Setup
git clone https://github.com/aborruso/csvnorm
cd csvnorm
# Create and activate venv with uv (recommended)
uv venv
source .venv/bin/activate
uv pip install -e ".[dev]"
# Or with pip
pip install -e ".[dev]"
Testing
pytest tests/ -v
Project Structure
csvnorm/
├── src/csvnorm/
│ ├── __init__.py # Package version
│ ├── __main__.py # python -m support
│ ├── cli.py # CLI argument parsing
│ ├── core.py # Main processing pipeline
│ ├── encoding.py # Encoding detection/conversion
│ ├── validation.py # DuckDB validation
│ └── utils.py # Helper functions
├── tests/ # Test suite
├── test/ # CSV fixtures
└── pyproject.toml # Package configuration
Stay Updated
Get notified of new releases
Watch → Custom → ✓ Releases to receive notifications for all new versions.
Get notified of breaking changes only
Subscribe to Announcements to be notified only about:
- Breaking changes (major version bumps)
- Security updates
- Important deprecation notices
We follow Semantic Versioning:
- MAJOR (e.g., 1.0.0 → 2.0.0): Breaking changes
- MINOR (e.g., 1.0.0 → 1.1.0): New features, backward compatible
- PATCH (e.g., 1.0.0 → 1.0.1): Bug fixes only
See docs/COMMUNICATION.md for details.
License
MIT License (c) 2026 aborruso@gmail.com - See LICENSE file for details
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file csvnorm-1.2.14.tar.gz.
File metadata
- Download URL: csvnorm-1.2.14.tar.gz
- Upload date:
- Size: 45.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3198c758898fbe97615f1d414b7270e4abe386311ca3349692ecfed7edc38131
|
|
| MD5 |
ab8a8c10735be13840a158df3ae2c8d1
|
|
| BLAKE2b-256 |
578989fcd2d2549f02b60a7f3d7ed258938db323f8f9e63ab72c3c2706920955
|
Provenance
The following attestation bundles were made for csvnorm-1.2.14.tar.gz:
Publisher:
publish-pypi.yml on aborruso/csvnorm
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
csvnorm-1.2.14.tar.gz -
Subject digest:
3198c758898fbe97615f1d414b7270e4abe386311ca3349692ecfed7edc38131 - Sigstore transparency entry: 928170908
- Sigstore integration time:
-
Permalink:
aborruso/csvnorm@763952c9ff72b908dba42e9ce0e01b939c79e8a7 -
Branch / Tag:
refs/tags/v1.2.14 - Owner: https://github.com/aborruso
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@763952c9ff72b908dba42e9ce0e01b939c79e8a7 -
Trigger Event:
push
-
Statement type:
File details
Details for the file csvnorm-1.2.14-py3-none-any.whl.
File metadata
- Download URL: csvnorm-1.2.14-py3-none-any.whl
- Upload date:
- Size: 31.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2b508db98012fe3ded7730b14b4422c586454fcfc2badb868cd84cb22700b611
|
|
| MD5 |
fcf56058a49eaa9fbc138432c865fae7
|
|
| BLAKE2b-256 |
aa34322b44b6826c3baba7b0dc6226e1d4ea3a92ce7e9bc9134ec86f4281cbfe
|
Provenance
The following attestation bundles were made for csvnorm-1.2.14-py3-none-any.whl:
Publisher:
publish-pypi.yml on aborruso/csvnorm
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
csvnorm-1.2.14-py3-none-any.whl -
Subject digest:
2b508db98012fe3ded7730b14b4422c586454fcfc2badb868cd84cb22700b611 - Sigstore transparency entry: 928170915
- Sigstore integration time:
-
Permalink:
aborruso/csvnorm@763952c9ff72b908dba42e9ce0e01b939c79e8a7 -
Branch / Tag:
refs/tags/v1.2.14 - Owner: https://github.com/aborruso
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@763952c9ff72b908dba42e9ce0e01b939c79e8a7 -
Trigger Event:
push
-
Statement type: