Skip to main content

Small CSV utilities: row deduplication, classification, row filtering, and CLI helpers.

Project description

PyPI version Supported Python versions License Documentation

Small, focused CSV utilities for common data wrangling tasks.

csvsmith provides a handful of practical tools for working with CSV files, including cleaning numeric values, filtering rows, deduplicating records, classifying files, converting Excel spreadsheets to CSV, moving files by suffix, and finding matches inside CSV content.

Documentation

Read the full documentation at:

https://csvsmith.readthedocs.io/en/latest/

Features

  • Clean numeric strings into normalized values

  • Filter CSV rows by substring matching

  • Deduplicate row data and generate reports

  • Classify CSV files into folders based on headers/signatures

  • Convert Excel workbooks to CSV

  • Move files by suffix

  • Find matching values inside CSV files

  • Concatenate CSV files with identical headers

  • Use the tools either from Python or from the command line

Installation

Install the package in your environment as usual for your project setup.

Example:

pip install csvsmith

If you are developing locally, install it in editable mode from the project root:

pip install -e .

Quick start

You can use the library from Python:

from csvsmith.utils.clean_numeric import clean_currency_numeric

print(clean_currency_numeric("$1,234.56"))

For command-line usage, use single quotes around values containing $:

csvsmith --help

Command-line usage

The package provides a CLI with several subcommands.

Clean numeric values:

csvsmith clean-numeric "1,234.56" --sep "," --decimal "."

Clean currency-prefixed numeric values:

csvsmith clean-currency-numeric '$1,234.56' --sep "," --decimal "."

Filter rows in a CSV:

csvsmith drop-rows input.csv notes spam --case-insensitive --drop-header

Deduplicate rows:

csvsmith dedupe input.csv -o out.csv --subset id --keep first

Classify CSV files:

csvsmith classify src_dir dst_dir --mode relaxed --match subset --auto --dry-run

Convert Excel to CSV:

csvsmith excel2csv input.xlsx

Move files by suffix:

csvsmith move-files src_dir dst_dir --suffixes .csv,.pdf

Find matches in a CSV:

csvsmith find-matches input.csv target --ignore-case --ignore-whitespace

Concatenate CSV files:

csvsmith strict-concat file1.csv file2.csv -o combined.csv

Find matches in a CSV

find_matches_in_csv searches a CSV file for a target value and returns match records containing coordinates and row context information.

Python API:

from csvsmith import find_matches_in_csv

results = find_matches_in_csv("input.csv", "target")

CLI:

csvsmith find-matches input.csv target

Options:

  • --ignore-case: ignore case while matching

  • --ignore-whitespace: ignore whitespace while matching

  • --no-nfkc: disable NFKC normalization

If matches are found, the CLI prints formatted JSON. If no matches are found, it prints a simple message.

Other Python APIs

The package also exposes a few other helper functions and classes from its top-level API.

Numeric and row tools:

from csvsmith import (
    clean_numeric,
    count_duplicates_sorted,
    add_row_digest,
    find_duplicate_rows,
    dedupe_with_report,
    read_csv_rows,
    write_csv_rows,
)

CSV classification and filtering:

from csvsmith import CSVClassifier, DropRowsBySubstring, CSVCleaner

File and conversion helpers:

from csvsmith import excel_to_csv, move_by_suffix, strict_concat_rows, save_csv

String comparison utilities:

from csvsmith import StringDistance, Relation, Result, analyze_pair

Project structure

The code is organized into two main areas:

  • csvsmith.tools for higher-level CSV workflows

  • csvsmith.utils for reusable utility helpers

Testing

Run the test suite with your preferred Python test runner.

Example:

pytest

License

See the project license for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

csvsmith-0.8.0.tar.gz (29.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

csvsmith-0.8.0-py3-none-any.whl (23.9 kB view details)

Uploaded Python 3

File details

Details for the file csvsmith-0.8.0.tar.gz.

File metadata

  • Download URL: csvsmith-0.8.0.tar.gz
  • Upload date:
  • Size: 29.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for csvsmith-0.8.0.tar.gz
Algorithm Hash digest
SHA256 3f9155f09cada6adefc8475405c123d46d49f3779a6021f892fd152000f0788a
MD5 947fc615dc8b28501bad38b712654e7b
BLAKE2b-256 f54a26179d1bc8f748a7b55e0a433561a5dfb409eea2b1a7c659700562e2fd1d

See more details on using hashes here.

File details

Details for the file csvsmith-0.8.0-py3-none-any.whl.

File metadata

  • Download URL: csvsmith-0.8.0-py3-none-any.whl
  • Upload date:
  • Size: 23.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for csvsmith-0.8.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6c4ff2ce2302b00c541d20a7e324b55434ef3158f7eeb549a149ca4be16c3816
MD5 f016210eb0206e29e83911a4015e7cb3
BLAKE2b-256 3158fc703868e2dbe682bdc71c8a4129f5a976756566a51a21bd19b869a81a25

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page