Detect, profile, normalize and repair delimiter-separated values files (CSV, TSV, pipe, semicolon).
Project description
dsvmonkey
Detect, profile, normalize and repair delimiter-separated-values files.
CSV is a polite lie. Real files are tab-separated, pipe-separated, or
semicolon-separated; start with decorative title rows; carry BOMs and
mixed encodings; include ragged rows and quoted newlines. dsvmonkey
reads them anyway, tells you what it found, and hands you a clean
stream of rows.
Status
Alpha. API is not yet stable.
Install
pip install dsvmonkey
For development (editable install with test tooling):
pip install -e .[dev]
# or equivalently:
pip install -r requirements-dev.txt
Both requirements.txt and requirements-dev.txt are thin pointers
to pyproject.toml — the single source of truth for dependency
lists. Edit dependencies in pyproject.toml; the requirements files
need no maintenance.
What it does
- Detect encoding, delimiter, quote char, header row and line endings — each with a confidence score, runner-up alternatives and the reasoning behind the choice.
- Normalize cells on read using
cleanmonkey(BOMs, NBSPs, zero-width spaces, smart quotes, stray control chars). - Profile date columns via
datemonkey. - Repair ragged rows, stray BOMs and inconsistent line endings.
- Stream row-by-row; large files are fine.
- Chain cleanly into
pgmonkey(DB import),xlfilldown(Excel output) andtypemonkey(type inference).
CLI
dsvmonkey inspect file.csv # human-readable detection report
dsvmonkey normalize file.csv -o clean.csv # strip BOM, fix ragged rows, normalize endings
dsvmonkey convert file.csv -o out.jsonl --to jsonl
Run dsvmonkey --help or dsvmonkey <command> --help for the full
list. Flags are command-specific:
inspect:-v/--verbose,--no-columns,--sample-rows,--excel-serial-min,--no-deep-scan,--clean-sample,--strict(exit 3 instead of 0 when the profile recommends human review — the unattended-pipeline gate).normalize:--encoding,--line-ending lf|crlf|cr,--delimiter,--field-count,--no-clean,--no-deep-scan,--keep-empty-rows,--sanitize-formulas,--strict(same gate semantics asinspect --strict: profile first, exit 3 with no output written when detection isn't confident enough).convert:--to {csv,tsv,jsonl},--no-clean,--no-deep-scan,--keep-empty-rows,--sanitize-formulas(applies on every output format, includingjsonl— JSONL output is commonly transformed back to CSV/Excel later, where formula payloads surviving as JSON string values become live formulas),--strict(gate as above).
Python API
import dsvmonkey
# Profile a file — encoding, delimiter, headers, etc.
profile = dsvmonkey.profile_file("file.csv")
# Stream cleaned rows as dicts
for row in dsvmonkey.read("file.csv"):
...
# Write a cleaned version
report = dsvmonkey.repair("messy.csv", "clean.csv")
# Convert to JSON Lines
dsvmonkey.to_jsonl("file.csv", "file.jsonl")
# Per-column profiling (date-format detection via datemonkey)
columns = dsvmonkey.profile_columns("file.csv")
Limitations
Some behaviours are deliberate design tradeoffs rather than bugs (e.g.
mixed-encoding detection requires UTF-8 multi-byte evidence to avoid
false-positives on cp1252 files; duplicate header names in dict mode
warn-and-collapse rather than raise). See LIMITATIONS.md for the
full list with rationale and escape hatches.
Using with AI assistants
SKILL.md at the repo root is a drop-in Claude Code / agent skill that
teaches LLMs how to call dsvmonkey correctly — decision tree, failure
modes it already handles, worked examples, and a "don't" list so agents
stop reinventing broken CSV parsing. Copy it to ~/.claude/skills/ or
include it in a project's AGENTS.md / CLAUDE.md for automatic
discovery.
License
MIT. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dsvmonkey-0.1.0.tar.gz.
File metadata
- Download URL: dsvmonkey-0.1.0.tar.gz
- Upload date:
- Size: 144.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a4191d5c0b7be44b9e2738112e17368c7dd257e87645b4d674bc40dfd4ada445
|
|
| MD5 |
94f6cdc941718f40a848a66eef52b26e
|
|
| BLAKE2b-256 |
ecbeb8eba82f12e3ee5c8ee37227b92ead786ed3b74cb8b5e053aff78f592361
|
File details
Details for the file dsvmonkey-0.1.0-py3-none-any.whl.
File metadata
- Download URL: dsvmonkey-0.1.0-py3-none-any.whl
- Upload date:
- Size: 75.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7b1900e786b71b015ebff26a58738c988b1ca9ddebb7f985cf4292c1a13ce605
|
|
| MD5 |
7327a3adb83af713210460f1bfcf07e3
|
|
| BLAKE2b-256 |
b5774e51f4cc8fd2902efbf836d178ca0c15503d03ee49bf674c84d065ebb828
|