Small CSV utilities: duplicates, row digests, and CLI helpers.

These details have not been verified by PyPI

Project links

Project description

csvsmith

Python versions

csvsmith is a small collection of CSV utilities.

Current focus:

Duplicate value counting (count_duplicates_sorted)
Row-level digest creation (add_row_digest)
Duplicate-row detection (find_duplicate_rows)
Deduplication with full duplicate report (dedupe_with_report)
Command-line interface (CLI) for quick operations

Installation

From PyPI (future):

pip install csvsmith

For local development:

git clone https://github.com/yeiichi/csvsmith.git
cd csvsmith
python -m venv .venv
source .venv/bin/activate
pip install -e .[dev]

Python API Usage

Count duplicate values

from csvsmith import count_duplicates_sorted

items = ["a", "b", "a", "c", "a", "b"]
print(count_duplicates_sorted(items))
# [('a', 3), ('b', 2)]

Find duplicate rows in a DataFrame

import pandas as pd
from csvsmith import find_duplicate_rows

df = pd.read_csv("input.csv")
dup_rows = find_duplicate_rows(df)
print(dup_rows)

Deduplicate with report

import pandas as pd
from csvsmith import dedupe_with_report

df = pd.read_csv("input.csv")

# Use all columns
deduped, report = dedupe_with_report(df)
deduped.to_csv("deduped.csv", index=False)
report.to_csv("duplicate_report.csv", index=False)

# Use all columns except an ID column
deduped_no_id, report_no_id = dedupe_with_report(df, exclude=["id"])

CLI Usage

csvsmith includes a small command-line interface for duplicate detection and CSV deduplication.

Show duplicate rows

csvsmith row-duplicates input.csv

Save only duplicate rows to a file:

csvsmith row-duplicates input.csv -o duplicates_only.csv

Use only a subset of columns to determine duplicates:

csvsmith row-duplicates input.csv --subset col1 col2 -o dup_rows_subset.csv

Exclude ID column(s) when looking for duplicates:

csvsmith row-duplicates input.csv --exclude id -o dup_rows_no_id.csv

Deduplicate and generate a duplicate report

csvsmith dedupe input.csv   --deduped deduped.csv   --report duplicate_report.csv

Deduplicate using selected columns

csvsmith dedupe input.csv   --subset col1 col2   --deduped deduped_subset.csv   --report duplicate_report_subset.csv

Remove all occurrences of duplicated rows

csvsmith dedupe input.csv   --subset col1   --keep False   --deduped deduped_no_dups.csv   --report duplicate_report_col1.csv

Exclude “id” from duplicate logic:

csvsmith dedupe input.csv   --exclude id   --deduped deduped_no_id.csv   --report duplicate_report_no_id.csv

Philosophy (“csvsmith Manifesto”)

CSVs deserve tools that are simple, predictable, and transparent.
A row has meaning only when its identity is stable and hashable.
Collisions are sin; determinism is virtue.
Let no delimiter sow ambiguity among fields.
Love thy \x1f.
The unseen separator, the quiet guardian of clean hashes.
Chosen not for aesthetics, but for truth.
The pipeline should be silent unless something is wrong.
Your data deserves respect — and your tools should help you give it.

For more, see MANIFESTO.md.

License

MIT License.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.8.0

Apr 16, 2026

0.7.3

Apr 4, 2026

0.7.2

Apr 4, 2026

0.7.1

Apr 2, 2026

0.7.0

Apr 2, 2026

0.6.0

Apr 2, 2026

0.5.0

Mar 31, 2026

0.4.0

Mar 31, 2026

0.2.3

Mar 25, 2026

0.2.2

Mar 24, 2026

0.2.1

Feb 21, 2026

0.2.0

Jan 21, 2026

This version

0.1.1

Nov 23, 2025

0.1.0

Nov 23, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

csvsmith-0.1.1.tar.gz (10.7 kB view details)

Uploaded Nov 23, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

csvsmith-0.1.1-py3-none-any.whl (8.9 kB view details)

Uploaded Nov 23, 2025 Python 3

File details

Details for the file csvsmith-0.1.1.tar.gz.

File metadata

Download URL: csvsmith-0.1.1.tar.gz
Upload date: Nov 23, 2025
Size: 10.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for csvsmith-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`ddc623ff580aacc0df4a2456b66cbe8c2425cfcde5dcfb83d7d59f3f3e0d5446`
MD5	`5871b64c5ba16fb209417f95ea0ecd31`
BLAKE2b-256	`83994c2c4949826e8c30e8d8c37fba201bebf03f9cf2aeae19b40c78f29be711`

See more details on using hashes here.

File details

Details for the file csvsmith-0.1.1-py3-none-any.whl.

File metadata

Download URL: csvsmith-0.1.1-py3-none-any.whl
Upload date: Nov 23, 2025
Size: 8.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for csvsmith-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bc9b9e74146523c23366f99cfad860a820642762f3cb11a8fbfe044fcad44375`
MD5	`e0f364c8c670096bc1a477a89edf952a`
BLAKE2b-256	`06f2fa2094a1ce70705942bd98c45142fc3ed2dd3c6e02d54202d31129e8c683`

See more details on using hashes here.

csvsmith 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

csvsmith

Installation

Python API Usage

Count duplicate values

Find duplicate rows in a DataFrame

Deduplicate with report

CLI Usage

Show duplicate rows

Deduplicate and generate a duplicate report

Deduplicate using selected columns

Remove all occurrences of duplicated rows

Philosophy (“csvsmith Manifesto”)

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes