Skip to main content

Read, inspect and rewrite malformed CSV files with automatic encoding and separator detection.

Project description

csvbench

csvbench is a Python library for reading, diagnosing, and repairing malformed CSV files. It is under active development and not yet production-ready.

It does not use Python's csv module: handling broken files is the point.


Status

Early stage. The core pipeline (encoding detection, parsing, diagnosis) works. Repair strategies are under active development.

Battle-tested in production? Probably not. But you're welcome to try and to contribute.


Features

  • Automatic detection of encoding, delimiter, and quote character
  • Multi-character separator support (e.g. ||, @@@)
  • Structured diagnostic reports with per-row issue tracking
  • Pluggable repair strategies via the Strategy pattern
  • CLI with rich terminal output and JSON output for programmatic use

Usage

CLI

csvbench inspect appointments.csv
╭────────────────────────────── csvbench inspect ────────────────────────────────╮
│                                                                                │
│   📁 File  ~/data/appointments.csv                                             │
│   🔤 Encoding  utf-8-sig  (100% confidence - bom)                              │
│   🔀 Separator  ';'  (98% confidence - sniffed)                                │
│   💬 Quotechar  '"'  (97% confidence - detected)                               │
│   📊 Columns  12                                                               │
│   📈 Lines  19847                                                              │
│   ❌ Errors  0                                                                 │
│   ⚠️  Warnings  0                                                              │
│   ⏱️  Elapsed  0.0013s                                                         │
│                                                                                │
╰────────────────────────────────────────────────────────────────────────────────╯
  ✔  No issues found.

JSON output for scripting:

csvbench inspect appointments.csv --format json
csvbench inspect appointments.csv --format json --output report.json

Reading from stdin:

cat appointments.csv | csvbench inspect -

Python API

from csvbench import CsvWorkbench

workbench = CsvWorkbench()
csv_file = workbench.read("appointments.csv")

print(csv_file.delimiter)           # ';'
print(csv_file.encoding)            # 'utf-8-sig'
print(csv_file.report.has_errors)   # False

Override detection when you already know the parameters:

csv_file = workbench.read("appointments.csv", delimiter=";", encoding="utf-8")

Design

No csv module. csvbench implements its own parser. Python's csv module assumes the file is well-formed enough to be parsed — csvbench doesn't. The parser operates character by character to correctly handle malformed quoting, embedded newlines, and inconsistent delimiters.

Multi-character separators. The delimiter detector considers both single-character (|, ;, \t) and multi-character candidates (||, ::) when sniffing the file.

Pydantic v2 models throughout. CSVFile, DiagnosticReport, Issue, and all detector results are Pydantic models. This keeps the data layer typed, validated, and serializable without extra glue code.

CLI with two output modes. rich for humans, json for pipelines. Both use the same underlying models — the formatter is swapped, not the data.


Contributing

Issues and pull requests are welcome.

If you find a CSV file that csvbench misparses or misdiagnoses, opening an issue with the file (or a minimal reproduction) is already a meaningful contribution.


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

csvbench-0.1.0.tar.gz (34.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

csvbench-0.1.0-py3-none-any.whl (41.7 kB view details)

Uploaded Python 3

File details

Details for the file csvbench-0.1.0.tar.gz.

File metadata

  • Download URL: csvbench-0.1.0.tar.gz
  • Upload date:
  • Size: 34.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for csvbench-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6095348535db4f80491871816aaa26e6595857482290a88bca3843ba50b60933
MD5 b0ec01f8294f43668b9071410104e0d2
BLAKE2b-256 9656feabb39157cf3afbfd8d9abc5e97fab254b2c701c47494bc47ad98bcc5a8

See more details on using hashes here.

File details

Details for the file csvbench-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: csvbench-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 41.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for csvbench-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 314938ea20ad0f98f41922126f3c330cc2442a3aa7732c6ee2177ac6bec0282d
MD5 4e87a35617275fd8dcfc3b6b5a81bedb
BLAKE2b-256 2fc4f3db9fd296fbd8ba194f690968643928718013505fc208f88b8fd41637e4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page