Small CSV utilities: classification, duplicates, row digests, and CLI helpers.
Project description
csvsmith
Introduction
csvsmith is a lightweight collection of CSV utilities designed for
data integrity, deduplication, and organization. It provides a robust
Python API for programmatic data cleaning and a convenient CLI for quick
operations. Whether you need to organize thousands of files based on
their structural signatures or pinpoint duplicate rows in a complex
dataset, csvsmith ensures the process is predictable, transparent, and
reversible.
Table of Contents
[Python API Usage](#python-api-usage)
: - [Count duplicate values](#count-duplicate-values)
- [Find duplicate rows in a
DataFrame](#find-duplicate-rows-in-a-dataframe)
- [Deduplicate with report](#deduplicate-with-report)
- [CSV File Classification](#csv-file-classification)
[CLI Usage](#cli-usage)
: - [Show duplicate rows](#show-duplicate-rows)
- [Deduplicate and generate a duplicate
report](#deduplicate-and-generate-a-duplicate-report)
- [Classify CSVs](#classify-csvs)
Installation
From PyPI:
pip install csvsmith
For local development:
git clone https://github.com/yeiichi/csvsmith.git
cd csvsmith
python -m venv .venv
source .venv/bin/activate
pip install -e .[dev]
Python API Usage
Count duplicate values
Works on any iterable of hashable items.
from csvsmith import count_duplicates_sorted
items = ["a", "b", "a", "c", "a", "b"]
print(count_duplicates_sorted(items))
# [('a', 3), ('b', 2)]
Find duplicate rows in a DataFrame
import pandas as pd
from csvsmith import find_duplicate_rows
df = pd.read_csv("input.csv")
dup_rows = find_duplicate_rows(df)
print(dup_rows)
Deduplicate with report
import pandas as pd
from csvsmith import dedupe_with_report
df = pd.read_csv("input.csv")
# Use all columns
deduped, report = dedupe_with_report(df)
deduped.to_csv("deduped.csv", index=False)
report.to_csv("duplicate_report.csv", index=False)
# Use all columns except an ID column
deduped_no_id, report_no_id = dedupe_with_report(df, exclude=["id"])
CSV File Classification
Organize files into directories based on their headers.
from csvsmith.classify import CSVClassifier
classifier = CSVClassifier(
source_dir="./raw_data",
dest_dir="./organized",
auto=True # Automatically group files with identical headers
)
# Execute the classification
classifier.run()
# Or rollback a previous run using its manifest
classifier.rollback("./organized/manifest_20260121_120000.json")
CLI Usage
csvsmith includes a command-line interface for duplicate detection and
file organization.
Show duplicate rows
csvsmith row-duplicates input.csv
Save only duplicate rows to a file:
csvsmith row-duplicates input.csv -o duplicates_only.csv
Deduplicate and generate a duplicate report
csvsmith dedupe input.csv --deduped deduped.csv --report duplicate_report.csv
Classify CSVs
Organize a mess of CSV files into structured folders based on their column headers.
# Preview what would happen (Dry Run)
csvsmith classify --src ./raw_data --dest ./organized --auto --dry-run
# Run classification with a signature config
csvsmith classify --src ./raw_data --dest ./organized --config signatures.json
# Undo a classification run
csvsmith classify --rollback ./organized/manifest_20260121_120000.json
Philosophy
- CSVs deserve tools that are simple, predictable, and transparent.
- A row has meaning only when its identity is stable and hashable.
- Collisions are sin; determinism is virtue.
- Let no delimiter sow ambiguity among fields.
- Love thy \x1f. The unseen separator, the quiet guardian of clean hashes.
- The pipeline should be silent unless something is wrong.
- Your data deserves respect --- and your tools should help you give it.
For more, see MANIFESTO.md.
License
MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file csvsmith-0.2.0.tar.gz.
File metadata
- Download URL: csvsmith-0.2.0.tar.gz
- Upload date:
- Size: 14.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
93f1af0c24b2e5426fc5a8b8f91e668a4b3bdfbc95b324f19825cff1ff58fa73
|
|
| MD5 |
64ba6e30c1e06c6cc92a0cf32e1c4c28
|
|
| BLAKE2b-256 |
dadd75e169a56d4b8d34410aed7932d5e7b360260de19f68bed15026dcd0e761
|
File details
Details for the file csvsmith-0.2.0-py3-none-any.whl.
File metadata
- Download URL: csvsmith-0.2.0-py3-none-any.whl
- Upload date:
- Size: 11.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fabf7764d01b9bbbaec53092ac9458a61506b516d15c6094ca33efc73a38d9ce
|
|
| MD5 |
73b55db1bdbab98aadd168976989f5b0
|
|
| BLAKE2b-256 |
6e992113a27ad6320693fd8ae29dd225b610fb75f7d9211034fd6ba368833d69
|