Read, inspect and rewrite malformed CSV files with automatic encoding and separator detection.
Project description
csvbench
csvbench is a Python library for reading, diagnosing, and repairing malformed CSV files. It is under active development and not yet production-ready.
It does not use Python's csv module: handling broken files is the point.
Status
Early stage. The core pipeline (encoding detection, parsing, diagnosis) works. Repair strategies are under active development.
Battle-tested in production? Probably not. But you're welcome to try and to contribute.
Features
- Automatic detection of encoding, delimiter, and quote character
- Multi-character separator support (e.g.
||,@@@) - Structured diagnostic reports with per-row issue tracking
- Pluggable repair strategies via the Strategy pattern
- CLI with rich terminal output and JSON output for programmatic use
Usage
CLI
csvbench inspect appointments.csv
╭────────────────────────────── csvbench inspect ────────────────────────────────╮
│ │
│ 📁 File ~/data/appointments.csv │
│ 🔤 Encoding utf-8-sig (100% confidence - bom) │
│ 🔀 Separator ';' (98% confidence - sniffed) │
│ 💬 Quotechar '"' (97% confidence - detected) │
│ 📊 Columns 12 │
│ 📈 Lines 19847 │
│ ❌ Errors 0 │
│ ⚠️ Warnings 0 │
│ ⏱️ Elapsed 0.0013s │
│ │
╰────────────────────────────────────────────────────────────────────────────────╯
✔ No issues found.
JSON output for scripting:
csvbench inspect appointments.csv --format json
csvbench inspect appointments.csv --format json --output report.json
Reading from stdin:
cat appointments.csv | csvbench inspect -
Python API
from csvbench import CsvWorkbench
workbench = CsvWorkbench()
csv_file = workbench.read("appointments.csv")
print(csv_file.delimiter) # ';'
print(csv_file.encoding) # 'utf-8-sig'
print(csv_file.report.has_errors) # False
Override detection when you already know the parameters:
csv_file = workbench.read("appointments.csv", delimiter=";", encoding="utf-8")
Design
No csv module. csvbench implements its own parser. Python's csv module assumes
the file is well-formed enough to be parsed — csvbench doesn't. The parser operates
character by character to correctly handle malformed quoting, embedded newlines, and
inconsistent delimiters.
Multi-character separators. The delimiter detector considers both single-character
(|, ;, \t) and multi-character candidates (||, ::) when sniffing the file.
Pydantic v2 models throughout. CSVFile, DiagnosticReport, Issue, and all
detector results are Pydantic models. This keeps the data layer typed, validated, and
serializable without extra glue code.
CLI with two output modes. rich for humans, json for pipelines. Both use the
same underlying models — the formatter is swapped, not the data.
Contributing
Issues and pull requests are welcome.
If you find a CSV file that csvbench misparses or misdiagnoses, opening an issue with the file (or a minimal reproduction) is already a meaningful contribution.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file csvbench-0.1.0.tar.gz.
File metadata
- Download URL: csvbench-0.1.0.tar.gz
- Upload date:
- Size: 34.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6095348535db4f80491871816aaa26e6595857482290a88bca3843ba50b60933
|
|
| MD5 |
b0ec01f8294f43668b9071410104e0d2
|
|
| BLAKE2b-256 |
9656feabb39157cf3afbfd8d9abc5e97fab254b2c701c47494bc47ad98bcc5a8
|
File details
Details for the file csvbench-0.1.0-py3-none-any.whl.
File metadata
- Download URL: csvbench-0.1.0-py3-none-any.whl
- Upload date:
- Size: 41.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
314938ea20ad0f98f41922126f3c330cc2442a3aa7732c6ee2177ac6bec0282d
|
|
| MD5 |
4e87a35617275fd8dcfc3b6b5a81bedb
|
|
| BLAKE2b-256 |
2fc4f3db9fd296fbd8ba194f690968643928718013505fc208f88b8fd41637e4
|