Small CSV utilities: classification, duplicates, row digests, and CLI helpers.
Project description
csvsmith
Introduction
csvsmith is a lightweight collection of CSV utilities designed for
data integrity, deduplication, and organization. It provides a robust
Python API for programmatic data cleaning and a convenient CLI for quick
operations.
Whether you need to organize thousands of files based on their structural
signatures or pinpoint duplicate rows in a complex dataset, csvsmith
ensures the process is predictable, transparent, and reversible.
As of recent versions, CSV classification supports:
- strict vs relaxed header matching
- exact vs subset (“contains”) matching
- auto clustering with collision‑resistant hashes
- dry‑run preview
- report‑only planning mode (scan without moving)
- full rollback via manifest
Installation
From PyPI:
pip install csvsmith
For local development:
git clone https://github.com/yeiichi/csvsmith.git
cd csvsmith
python -m venv .venv
source .venv/bin/activate
pip install -e .[dev]
Python API Usage
Count duplicate values
from csvsmith import count_duplicates_sorted
items = ["a", "b", "a", "c", "a", "b"]
print(count_duplicates_sorted(items))
# [('a', 3), ('b', 2)]
Find duplicate rows in a DataFrame
import pandas as pd
from csvsmith import find_duplicate_rows
df = pd.read_csv("input.csv")
dup_rows = find_duplicate_rows(df)
Deduplicate with report
import pandas as pd
from csvsmith import dedupe_with_report
df = pd.read_csv("input.csv")
deduped, report = dedupe_with_report(df)
deduped.to_csv("deduped.csv", index=False)
report.to_csv("duplicate_report.csv", index=False)
# Exclude columns (e.g. IDs or timestamps)
deduped2, report2 = dedupe_with_report(df, exclude=["id"])
CSV File Classification (Python)
from csvsmith.classify import CSVClassifier
classifier = CSVClassifier(
source_dir="./raw_data",
dest_dir="./organized",
auto=True,
mode="relaxed", # or "strict"
match="exact", # or "contains"
)
classifier.run()
# Roll back using the generated manifest
classifier.rollback("./organized/manifest_YYYYMMDD_HHMMSS.json")
CLI Usage
csvsmith provides a CLI for duplicate detection and CSV organization.
Show duplicate rows
csvsmith row-duplicates input.csv
Save duplicate rows only:
csvsmith row-duplicates input.csv -o duplicates_only.csv
Deduplicate and generate a report
csvsmith dedupe input.csv --deduped deduped.csv --report duplicate_report.csv
Classify CSVs
# Dry-run (preview only)
csvsmith classify --src ./raw --dest ./out --auto --dry-run
# Exact matching (default)
csvsmith classify --src ./raw --dest ./out --config signatures.json
# Relaxed matching (ignore column order)
csvsmith classify --src ./raw --dest ./out --config signatures.json --mode relaxed
# Subset matching (signature columns must be present)
csvsmith classify --src ./raw --dest ./out --config signatures.json --match contains
# Report-only (plan without moving files)
csvsmith classify --src ./raw --dest ./out --auto --report-only
# Roll back using manifest
csvsmith classify --rollback ./out/manifest_YYYYMMDD_HHMMSS.json
Report-only mode
--report-only scans all CSVs and writes a manifest describing what would
happen, without touching the filesystem. This enables downstream pipelines
to consume the classification plan for custom processing.
Philosophy
- CSVs deserve tools that are simple, predictable, and transparent.
- A row has meaning only when its identity is stable and hashable.
- Collisions are sin; determinism is virtue.
- Let no delimiter sow ambiguity among fields.
- Love thy \x1f — the unseen separator, guardian of clean hashes.
- The pipeline should be silent unless something is wrong.
- Your data deserves respect — and your tools should help you give it.
License
MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file csvsmith-0.2.1.tar.gz.
File metadata
- Download URL: csvsmith-0.2.1.tar.gz
- Upload date:
- Size: 16.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
70e37377e7c7ccec023fc3fe3b3fca2344b3688d7b193778c252e2c5cc74641f
|
|
| MD5 |
c2633442ceca5c26fa503da3f41507ec
|
|
| BLAKE2b-256 |
2b9e4342445ba20fa02a69651f6dd8c813d1aa73d7ae129b46b1f2178421179d
|
File details
Details for the file csvsmith-0.2.1-py3-none-any.whl.
File metadata
- Download URL: csvsmith-0.2.1-py3-none-any.whl
- Upload date:
- Size: 13.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9032df2bd2ab069c1b7d1aca0b8ddea89cdb7cdcb48483a52a76b3a9adc2ea7d
|
|
| MD5 |
fb8b70896580ada78738d3cb553c2efe
|
|
| BLAKE2b-256 |
707e97694f85a482e8e30e102bf54fd8ac79e7e25f9726dadd71b06b09914c22
|