Clean and merge messy CSV files from the command line — offline, big-file capable, recipe-driven.

These details have not been verified by PyPI

Project links

Project description

csvtidy

Clean and merge messy CSV files from the command line — trim whitespace, fix dates, drop duplicates, and combine an entire folder of exports into one tidy file. For data engineers, analysts, and anyone wrangling CSV/Excel exports that are too big to open in a spreadsheet or too many to merge by hand. 100% offline, powered by DuckDB so it streams files bigger than RAM, and driven by reusable recipes — define your cleanup once, replay it on any file or whole folder.

Python Powered by DuckDB

pipx install csvtidy

# Merge every CSV in a folder into one, tagging each row with its source file
csvtidy merge ./exports --output combined.csv --source-column file

Why csvtidy

Offline and private. Your data never leaves your machine. No accounts, no uploads, no cloud.
Handles huge files. The DuckDB engine streams and spills to disk, so you can merge and clean CSVs that won't fit in memory — the kind that crash a spreadsheet.
Recipe-driven. Save a sequence of steps to a small YAML file and re-run it on next month's data with one command. Build once, reuse forever.
Non-destructive by default. Values are read as text and never silently retyped; dates it can't parse keep their original text instead of becoming blank.
Merges messy schemas. Files with mismatched or reordered columns are aligned by name — missing cells become empty, nothing shifts.

Install

# Recommended: isolated install with pipx
pipx install csvtidy

# Or with pip
pip install csvtidy

# From source
git clone https://github.com/abhishekrai43/csvtidy
cd csvtidy
pip install -e .

Requires Python 3.9+.

Examples

The repo ships small messy sample files under examples/ so every command below runs as-is.

Merge a whole folder into one file

csvtidy merge examples/data/exports --output combined.csv --source-column source_file

Per-file headers are read as headers (never repeated as data rows), columns are aligned by name across files, and --source-column adds the originating file name to every row.

Remove duplicate rows

# Exact duplicate rows
csvtidy dedupe combined.csv --output deduped.csv

# Or treat rows as duplicates when one column matches (keep the first seen)
csvtidy dedupe combined.csv --subset email --keep first

Clean up a file

csvtidy clean examples/data/exports/feb.csv \
  --fix-dates signup_date \
  --collapse-spaces \
  --output feb.clean.csv

Trims whitespace, collapses internal whitespace runs, drops all-blank rows, and rewrites the dates in signup_date to ISO YYYY-MM-DD — leaving anything it can't parse untouched.

Run a recipe (build once, re-run forever)

csvtidy run examples/recipe.yaml

Pipe results between commands or into other tools — omit --output and csvtidy writes CSV to stdout, while progress goes to stderr:

csvtidy merge ./exports | csvtidy dedupe - --subset email > clean.csv

Recipes

A recipe captures a whole cleanup as a small YAML file so you can replay it on new data without re-typing flags. Steps run top to bottom and compose into a single DuckDB query, keeping the same larger-than-RAM streaming as the individual commands.

# examples/recipe.yaml
input: examples/data/exports/*.csv   # files, a folder, or a glob
output: customers.clean.csv
source_column: source_file           # tag each row with its source file

steps:
  - clean:
      trim: true
      drop_empty_rows: true
      collapse_spaces: true
      fix_dates: [signup_date]
      date_format: "%Y-%m-%d"

  - dedupe:
      subset: [email]
      keep: first

csvtidy run examples/recipe.yaml            # writes customers.clean.csv
csvtidy run examples/recipe.yaml -o out.csv # override the output path

Command reference

`csvtidy merge <inputs...>`

Combine multiple CSVs, a folder, or globs into one file.

Option	Description
`-o, --output PATH`	Write to this CSV (streamed to disk). Omit to print to stdout.
`--source-column NAME`	Add a column with each row's source file name.
`--pattern GLOB`	Glob used when an input is a folder (default `*.csv`).
`-r, --recursive`	Recurse into sub-folders.

`csvtidy dedupe <input>`

Remove duplicate rows, preserving input order.

Option	Description
`-o, --output PATH`	Write to this CSV. Omit for stdout.
`--subset COLS`	Comma-separated columns to match on (default: all columns).
`--keep first\|last`	Which duplicate to keep (default `first`).

`csvtidy clean <input>`

Apply cleanup primitives.

Option	Description
`-o, --output PATH`	Write to this CSV. Omit for stdout.
`--trim / --no-trim`	Trim leading/trailing whitespace (default on).
`--drop-empty-rows / --keep-empty-rows`	Drop rows where every cell is blank (default on).
`--collapse-spaces`	Collapse internal whitespace runs to a single space.
`--fix-dates COLS`	Comma-separated columns to normalize to one date format.
`--date-format FMT`	Output format for `--fix-dates` (default `%Y-%m-%d`).
`--dayfirst`	Read ambiguous dates as day/month (e.g. `03/04` = 3 April).

`csvtidy run <recipe.yaml>`

Run a saved recipe. -o, --output PATH overrides the recipe's output path.

How it works

csvtidy is a thin, friendly layer over DuckDB. Each command builds a single SQL query and lets DuckDB do the heavy lifting: it reads CSVs directly, processes data in a streaming fashion, and spills to disk when a job is larger than memory. That's why merging a folder of multi-gigabyte exports works on an ordinary laptop without loading everything into RAM at once.

Development

git clone https://github.com/abhishekrai43/csvtidy
cd csvtidy
pip install -e ".[dev]"
pytest

License

MIT © Abhishek Rai

csvtidy is the open-source command-line tool. If you'd prefer a full desktop app with a visual recipe builder and the same offline, big-file engine, see Kramata.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Jun 29, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

csvtidy-0.1.0.tar.gz (12.7 kB view details)

Uploaded Jun 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

csvtidy-0.1.0-py3-none-any.whl (13.6 kB view details)

Uploaded Jun 29, 2026 Python 3

File details

Details for the file csvtidy-0.1.0.tar.gz.

File metadata

Download URL: csvtidy-0.1.0.tar.gz
Upload date: Jun 29, 2026
Size: 12.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for csvtidy-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`e42a6e707ed91e3c708754e0999adb5fdf46b2d0eaebb039837ec02b875579aa`
MD5	`cfe12e87b533a117b3c0a14e78a8571b`
BLAKE2b-256	`39a87e0cb5e4b8f843d32a90b00d953bdad1566907f1d51e171aebb227f250ac`

See more details on using hashes here.

File details

Details for the file csvtidy-0.1.0-py3-none-any.whl.

File metadata

Download URL: csvtidy-0.1.0-py3-none-any.whl
Upload date: Jun 29, 2026
Size: 13.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for csvtidy-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8f56ff240bc7c36da529aec8cc80df4a37bbaf23838f61d0d8c19303723aa5fb`
MD5	`c09e168744fd45e0a2fe7426c67a3284`
BLAKE2b-256	`0425a5352c8c798b2ffda9c33f7bb1f7e8b0ff4cc6b49459e40f59c8dce3b70b`

See more details on using hashes here.

csvtidy 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

csvtidy

Why csvtidy

Install

Examples

Merge a whole folder into one file

Remove duplicate rows

Clean up a file

Run a recipe (build once, re-run forever)

Recipes

Command reference

csvtidy merge <inputs...>

csvtidy dedupe <input>

csvtidy clean <input>

csvtidy run <recipe.yaml>

How it works

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`csvtidy merge <inputs...>`

`csvtidy dedupe <input>`

`csvtidy clean <input>`

`csvtidy run <recipe.yaml>`