Skip to main content

Detect, profile, normalize and repair delimiter-separated values files (CSV, TSV, pipe, semicolon).

Project description

dsvmonkey

Detect, profile, normalize and repair delimiter-separated-values files.

CSV is a polite lie. Real files are tab-separated, pipe-separated, or semicolon-separated; start with decorative title rows; carry BOMs and mixed encodings; include ragged rows and quoted newlines. dsvmonkey reads them anyway, tells you what it found, and hands you a clean stream of rows.

Status

Alpha. API is not yet stable.

Install

pip install dsvmonkey

For development (editable install with test tooling):

pip install -e .[dev]
# or equivalently:
pip install -r requirements-dev.txt

Both requirements.txt and requirements-dev.txt are thin pointers to pyproject.toml — the single source of truth for dependency lists. Edit dependencies in pyproject.toml; the requirements files need no maintenance.

What it does

  • Detect encoding, delimiter, quote char, header row and line endings — each with a confidence score, runner-up alternatives and the reasoning behind the choice.
  • Normalize cells on read using cleanmonkey (BOMs, NBSPs, zero-width spaces, smart quotes, stray control chars).
  • Profile date columns via datemonkey.
  • Repair ragged rows, stray BOMs and inconsistent line endings.
  • Stream row-by-row; large files are fine.
  • Chain cleanly into pgmonkey (DB import), xlfilldown (Excel output) and typemonkey (type inference).

CLI

dsvmonkey inspect   file.csv                       # human-readable detection report
dsvmonkey normalize file.csv -o clean.csv          # strip BOM, fix ragged rows, normalize endings
dsvmonkey convert   file.csv -o out.jsonl --to jsonl

Run dsvmonkey --help or dsvmonkey <command> --help for the full list. Flags are command-specific:

  • inspect: -v/--verbose, --no-columns, --sample-rows, --excel-serial-min, --no-deep-scan, --clean-sample, --strict (exit 3 instead of 0 when the profile recommends human review — the unattended-pipeline gate).
  • normalize: --encoding, --line-ending lf|crlf|cr, --delimiter, --field-count, --no-clean, --no-deep-scan, --keep-empty-rows, --sanitize-formulas, --strict (same gate semantics as inspect --strict: profile first, exit 3 with no output written when detection isn't confident enough).
  • convert: --to {csv,tsv,jsonl}, --no-clean, --no-deep-scan, --keep-empty-rows, --sanitize-formulas (applies on every output format, including jsonl — JSONL output is commonly transformed back to CSV/Excel later, where formula payloads surviving as JSON string values become live formulas), --strict (gate as above).

Python API

import dsvmonkey

# Profile a file — encoding, delimiter, headers, etc.
profile = dsvmonkey.profile_file("file.csv")

# Stream cleaned rows as dicts
for row in dsvmonkey.read("file.csv"):
    ...

# Write a cleaned version
report = dsvmonkey.repair("messy.csv", "clean.csv")

# Convert to JSON Lines
dsvmonkey.to_jsonl("file.csv", "file.jsonl")

# Per-column profiling (date-format detection via datemonkey)
columns = dsvmonkey.profile_columns("file.csv")

Limitations

Some behaviours are deliberate design tradeoffs rather than bugs (e.g. mixed-encoding detection requires UTF-8 multi-byte evidence to avoid false-positives on cp1252 files; duplicate header names in dict mode warn-and-collapse rather than raise). See LIMITATIONS.md for the full list with rationale and escape hatches.

Using with AI assistants

SKILL.md at the repo root is a drop-in Claude Code / agent skill that teaches LLMs how to call dsvmonkey correctly — decision tree, failure modes it already handles, worked examples, and a "don't" list so agents stop reinventing broken CSV parsing. Copy it to ~/.claude/skills/ or include it in a project's AGENTS.md / CLAUDE.md for automatic discovery.

License

MIT. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dsvmonkey-0.1.0.tar.gz (144.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dsvmonkey-0.1.0-py3-none-any.whl (75.8 kB view details)

Uploaded Python 3

File details

Details for the file dsvmonkey-0.1.0.tar.gz.

File metadata

  • Download URL: dsvmonkey-0.1.0.tar.gz
  • Upload date:
  • Size: 144.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for dsvmonkey-0.1.0.tar.gz
Algorithm Hash digest
SHA256 a4191d5c0b7be44b9e2738112e17368c7dd257e87645b4d674bc40dfd4ada445
MD5 94f6cdc941718f40a848a66eef52b26e
BLAKE2b-256 ecbeb8eba82f12e3ee5c8ee37227b92ead786ed3b74cb8b5e053aff78f592361

See more details on using hashes here.

File details

Details for the file dsvmonkey-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: dsvmonkey-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 75.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for dsvmonkey-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7b1900e786b71b015ebff26a58738c988b1ca9ddebb7f985cf4292c1a13ce605
MD5 7327a3adb83af713210460f1bfcf07e3
BLAKE2b-256 b5774e51f4cc8fd2902efbf836d178ca0c15503d03ee49bf674c84d065ebb828

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page