Clean and merge messy CSV files from the command line — offline, big-file capable, recipe-driven.
Project description
csvtidy
Clean and merge messy CSV files from the command line — trim whitespace, fix dates, drop duplicates, and combine an entire folder of exports into one tidy file. For data engineers, analysts, and anyone wrangling CSV/Excel exports that are too big to open in a spreadsheet or too many to merge by hand. 100% offline, powered by DuckDB so it streams files bigger than RAM, and driven by reusable recipes — define your cleanup once, replay it on any file or whole folder.
pipx install csvtidy
# Merge every CSV in a folder into one, tagging each row with its source file
csvtidy merge ./exports --output combined.csv --source-column file
Why csvtidy
- Offline and private. Your data never leaves your machine. No accounts, no uploads, no cloud.
- Handles huge files. The DuckDB engine streams and spills to disk, so you can merge and clean CSVs that won't fit in memory — the kind that crash a spreadsheet.
- Recipe-driven. Save a sequence of steps to a small YAML file and re-run it on next month's data with one command. Build once, reuse forever.
- Non-destructive by default. Values are read as text and never silently retyped; dates it can't parse keep their original text instead of becoming blank.
- Merges messy schemas. Files with mismatched or reordered columns are aligned by name — missing cells become empty, nothing shifts.
Install
# Recommended: isolated install with pipx
pipx install csvtidy
# Or with pip
pip install csvtidy
# From source
git clone https://github.com/abhishekrai43/csvtidy
cd csvtidy
pip install -e .
Requires Python 3.9+.
Examples
The repo ships small messy sample files under examples/ so every command below runs as-is.
Merge a whole folder into one file
csvtidy merge examples/data/exports --output combined.csv --source-column source_file
Per-file headers are read as headers (never repeated as data rows), columns are aligned by name across files, and --source-column adds the originating file name to every row.
Remove duplicate rows
# Exact duplicate rows
csvtidy dedupe combined.csv --output deduped.csv
# Or treat rows as duplicates when one column matches (keep the first seen)
csvtidy dedupe combined.csv --subset email --keep first
Clean up a file
csvtidy clean examples/data/exports/feb.csv \
--fix-dates signup_date \
--collapse-spaces \
--output feb.clean.csv
Trims whitespace, collapses internal whitespace runs, drops all-blank rows, and rewrites the dates in signup_date to ISO YYYY-MM-DD — leaving anything it can't parse untouched.
Run a recipe (build once, re-run forever)
csvtidy run examples/recipe.yaml
Pipe results between commands or into other tools — omit --output and csvtidy writes CSV to stdout, while progress goes to stderr:
csvtidy merge ./exports | csvtidy dedupe - --subset email > clean.csv
Recipes
A recipe captures a whole cleanup as a small YAML file so you can replay it on new data without re-typing flags. Steps run top to bottom and compose into a single DuckDB query, keeping the same larger-than-RAM streaming as the individual commands.
# examples/recipe.yaml
input: examples/data/exports/*.csv # files, a folder, or a glob
output: customers.clean.csv
source_column: source_file # tag each row with its source file
steps:
- clean:
trim: true
drop_empty_rows: true
collapse_spaces: true
fix_dates: [signup_date]
date_format: "%Y-%m-%d"
- dedupe:
subset: [email]
keep: first
csvtidy run examples/recipe.yaml # writes customers.clean.csv
csvtidy run examples/recipe.yaml -o out.csv # override the output path
Command reference
csvtidy merge <inputs...>
Combine multiple CSVs, a folder, or globs into one file.
| Option | Description |
|---|---|
-o, --output PATH |
Write to this CSV (streamed to disk). Omit to print to stdout. |
--source-column NAME |
Add a column with each row's source file name. |
--pattern GLOB |
Glob used when an input is a folder (default *.csv). |
-r, --recursive |
Recurse into sub-folders. |
csvtidy dedupe <input>
Remove duplicate rows, preserving input order.
| Option | Description |
|---|---|
-o, --output PATH |
Write to this CSV. Omit for stdout. |
--subset COLS |
Comma-separated columns to match on (default: all columns). |
--keep first|last |
Which duplicate to keep (default first). |
csvtidy clean <input>
Apply cleanup primitives.
| Option | Description |
|---|---|
-o, --output PATH |
Write to this CSV. Omit for stdout. |
--trim / --no-trim |
Trim leading/trailing whitespace (default on). |
--drop-empty-rows / --keep-empty-rows |
Drop rows where every cell is blank (default on). |
--collapse-spaces |
Collapse internal whitespace runs to a single space. |
--fix-dates COLS |
Comma-separated columns to normalize to one date format. |
--date-format FMT |
Output format for --fix-dates (default %Y-%m-%d). |
--dayfirst |
Read ambiguous dates as day/month (e.g. 03/04 = 3 April). |
csvtidy run <recipe.yaml>
Run a saved recipe. -o, --output PATH overrides the recipe's output path.
How it works
csvtidy is a thin, friendly layer over DuckDB. Each command builds a single SQL query and lets DuckDB do the heavy lifting: it reads CSVs directly, processes data in a streaming fashion, and spills to disk when a job is larger than memory. That's why merging a folder of multi-gigabyte exports works on an ordinary laptop without loading everything into RAM at once.
Development
git clone https://github.com/abhishekrai43/csvtidy
cd csvtidy
pip install -e ".[dev]"
pytest
License
MIT © Abhishek Rai
csvtidy is the open-source command-line tool. If you'd prefer a full desktop app with a visual recipe builder and the same offline, big-file engine, see Kramata.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file csvtidy-0.1.0.tar.gz.
File metadata
- Download URL: csvtidy-0.1.0.tar.gz
- Upload date:
- Size: 12.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e42a6e707ed91e3c708754e0999adb5fdf46b2d0eaebb039837ec02b875579aa
|
|
| MD5 |
cfe12e87b533a117b3c0a14e78a8571b
|
|
| BLAKE2b-256 |
39a87e0cb5e4b8f843d32a90b00d953bdad1566907f1d51e171aebb227f250ac
|
File details
Details for the file csvtidy-0.1.0-py3-none-any.whl.
File metadata
- Download URL: csvtidy-0.1.0-py3-none-any.whl
- Upload date:
- Size: 13.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8f56ff240bc7c36da529aec8cc80df4a37bbaf23838f61d0d8c19303723aa5fb
|
|
| MD5 |
c09e168744fd45e0a2fe7426c67a3284
|
|
| BLAKE2b-256 |
0425a5352c8c798b2ffda9c33f7bb1f7e8b0ff4cc6b49459e40f59c8dce3b70b
|