Skip to main content

Small CSV utilities: classification, duplicates, row digests, and CLI helpers.

Project description

csvsmith

PyPI version Python versions License

Introduction

csvsmith is a lightweight collection of CSV utilities designed for data integrity, deduplication, and organization. It provides a robust Python API for programmatic data cleaning and a convenient CLI for quick operations. Whether you need to organize thousands of files based on their structural signatures or pinpoint duplicate rows in a complex dataset, csvsmith ensures the process is predictable, transparent, and reversible.

Table of Contents

[Python API Usage](#python-api-usage)

:   -   [Count duplicate values](#count-duplicate-values)
    -   [Find duplicate rows in a
        DataFrame](#find-duplicate-rows-in-a-dataframe)
    -   [Deduplicate with report](#deduplicate-with-report)
    -   [CSV File Classification](#csv-file-classification)
[CLI Usage](#cli-usage)

:   -   [Show duplicate rows](#show-duplicate-rows)
    -   [Deduplicate and generate a duplicate
        report](#deduplicate-and-generate-a-duplicate-report)
    -   [Classify CSVs](#classify-csvs)

Installation

From PyPI:

pip install csvsmith

For local development:

git clone https://github.com/yeiichi/csvsmith.git
cd csvsmith
python -m venv .venv
source .venv/bin/activate
pip install -e .[dev]

Python API Usage

Count duplicate values

Works on any iterable of hashable items.

from csvsmith import count_duplicates_sorted

items = ["a", "b", "a", "c", "a", "b"]
print(count_duplicates_sorted(items))
# [('a', 3), ('b', 2)]

Find duplicate rows in a DataFrame

import pandas as pd
from csvsmith import find_duplicate_rows

df = pd.read_csv("input.csv")
dup_rows = find_duplicate_rows(df)
print(dup_rows)

Deduplicate with report

import pandas as pd
from csvsmith import dedupe_with_report

df = pd.read_csv("input.csv")

# Use all columns
deduped, report = dedupe_with_report(df)
deduped.to_csv("deduped.csv", index=False)
report.to_csv("duplicate_report.csv", index=False)

# Use all columns except an ID column
deduped_no_id, report_no_id = dedupe_with_report(df, exclude=["id"])

CSV File Classification

Organize files into directories based on their headers.

from csvsmith.classify import CSVClassifier

classifier = CSVClassifier(
    source_dir="./raw_data",
    dest_dir="./organized",
    auto=True  # Automatically group files with identical headers
)

# Execute the classification
classifier.run()

# Or rollback a previous run using its manifest
classifier.rollback("./organized/manifest_20260121_120000.json")

CLI Usage

csvsmith includes a command-line interface for duplicate detection and file organization.

Show duplicate rows

csvsmith row-duplicates input.csv

Save only duplicate rows to a file:

csvsmith row-duplicates input.csv -o duplicates_only.csv

Deduplicate and generate a duplicate report

csvsmith dedupe input.csv --deduped deduped.csv --report duplicate_report.csv

Classify CSVs

Organize a mess of CSV files into structured folders based on their column headers.

# Preview what would happen (Dry Run)
csvsmith classify --src ./raw_data --dest ./organized --auto --dry-run

# Run classification with a signature config
csvsmith classify --src ./raw_data --dest ./organized --config signatures.json

# Undo a classification run
csvsmith classify --rollback ./organized/manifest_20260121_120000.json

Philosophy

  1. CSVs deserve tools that are simple, predictable, and transparent.
  2. A row has meaning only when its identity is stable and hashable.
  3. Collisions are sin; determinism is virtue.
  4. Let no delimiter sow ambiguity among fields.
  5. Love thy \x1f. The unseen separator, the quiet guardian of clean hashes.
  6. The pipeline should be silent unless something is wrong.
  7. Your data deserves respect --- and your tools should help you give it.

For more, see MANIFESTO.md.

License

MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

csvsmith-0.2.0.tar.gz (14.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

csvsmith-0.2.0-py3-none-any.whl (11.7 kB view details)

Uploaded Python 3

File details

Details for the file csvsmith-0.2.0.tar.gz.

File metadata

  • Download URL: csvsmith-0.2.0.tar.gz
  • Upload date:
  • Size: 14.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for csvsmith-0.2.0.tar.gz
Algorithm Hash digest
SHA256 93f1af0c24b2e5426fc5a8b8f91e668a4b3bdfbc95b324f19825cff1ff58fa73
MD5 64ba6e30c1e06c6cc92a0cf32e1c4c28
BLAKE2b-256 dadd75e169a56d4b8d34410aed7932d5e7b360260de19f68bed15026dcd0e761

See more details on using hashes here.

File details

Details for the file csvsmith-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: csvsmith-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 11.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for csvsmith-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fabf7764d01b9bbbaec53092ac9458a61506b516d15c6094ca33efc73a38d9ce
MD5 73b55db1bdbab98aadd168976989f5b0
BLAKE2b-256 6e992113a27ad6320693fd8ae29dd225b610fb75f7d9211034fd6ba368833d69

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page