Skip to main content

A utility library for data cleaning and parsing.

Project description

smith-utils

PyPI version Python versions Status License Tests Documentation Status

Smith Utils is a central hub for data cleaning and parsing scripts. This package consolidates distributed utility functions to improve code reuse and maintenance efficiency across all yeiichi projects.

Key Features

Datetime Utilities (smith_utils.datetime)

Robust date parsing and formatting.

  • ensure_date: Flexible conversion of strings, datetime.date objects, or None (returns today) into a date object.
  • parse_strict_date: Strict parsing for YYYYMMDD or YYYY-MM-DD formats, rejecting ambiguous inputs.
  • format_ordinal: Converts integers to ordinal strings (e.g., 1"1st", 22"22nd").

Numeric Refinement (smith_utils.numeric)

Clean and parse messy numeric data.

  • parse_numeric_value: Handles custom separators, decimals, and negative formats like (1,234.56).
  • parse_currency_value: Alias for numeric parsing, specifically for currency strings.

Text Normalization & Metrics (smith_utils.text)

Standardize text and compare string similarity.

  • normalize_text: Unicode NFKC normalization, case folding, and whitespace handling.
  • StringDistance: Implementation of Damerau-Levenshtein and Jaro-Winkler algorithms for fuzzy matching.
  • analyze_pair: Convenience function for string comparison returning a Result.
  • Relation / Result: Relation enum and typed result for text comparisons.
  • make_unicode_char_name_records: Extract Unicode codepoint/name metadata from text.
  • normalize_newlines_stream: Stream-based newline normalization to LF with newline type detection.
  • normalize_file_to_lf: File-based newline normalization helper.

Crypto Hash Utilities (smith_utils.crypto)

Calculate SHA-256 digests for text and files.

  • get_text_digest: Returns the SHA-256 hexadecimal digest for a text string.
  • get_file_digest: Returns the SHA-256 hexadecimal digest for a file using streaming reads.

File Classification Utilities (smith_utils.file)

Classify files from multiple evidence sources.

  • classify_file: Returns extension, MIME, magic-number, and file(1) classification evidence.
  • FileClassification: Dataclass result containing raw signals, file_class, and derived categories.

Installation

Install via pip:

pip install smith-utils

Quick Start

from smith_utils import ensure_date, get_text_digest, parse_numeric_value, normalize_text
from smith_utils import make_unicode_char_name_records
from smith_utils import classify_file, get_file_digest, normalize_file_to_lf

# Datetime
date = ensure_date("20231225") # datetime.date(2023, 12, 25)

# Numeric
value = parse_numeric_value("(1,250.50)") # -1250.5

# Text
clean_text = normalize_text("  Smith  Utils  ") # "smith utils"

# Unicode metadata
records = make_unicode_char_name_records("Aあ")
# [UnicodeCharNameRecord(index=0, codepoint='U+0041', ...), ...]

# Normalize a file's newlines to LF
summary = normalize_file_to_lf("input.txt", "output.txt")
# {'newline_type': 'CRLF', 'bytes_in': ..., 'bytes_out': ...}

# SHA-256 digests
text_digest = get_text_digest("smith-utils")
file_digest = get_file_digest("input.txt")

# File classification
classification = classify_file("input.pdf")
# FileClassification(extension='.pdf', file_class='document', categories=('document', 'pdf'), ...)

Directory Structure

  • src/smith_utils/: Main package source.
  • legacy/: Legacy scripts and templates (not included in distribution).
  • tests/: Comprehensive test suite.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smith_utils-0.4.0.tar.gz (15.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

smith_utils-0.4.0-py3-none-any.whl (16.5 kB view details)

Uploaded Python 3

File details

Details for the file smith_utils-0.4.0.tar.gz.

File metadata

  • Download URL: smith_utils-0.4.0.tar.gz
  • Upload date:
  • Size: 15.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for smith_utils-0.4.0.tar.gz
Algorithm Hash digest
SHA256 c944e8d066cebcbac3e6bb67f2355cfe1b56b345e576f2d423868cf59799bb86
MD5 5cb2eece9254c69a5a349c0488e5c00a
BLAKE2b-256 27434049bc884c9aac3b6ecc4b6c589b15a4f2755c619cf9be1120aa60a8f1f2

See more details on using hashes here.

File details

Details for the file smith_utils-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: smith_utils-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 16.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for smith_utils-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 99f9294c8ccace089899e8fefd71f5efb6945e999bf2a8cc395edb3d9dbf891a
MD5 990c79501215eae8fc70813d6d35fff0
BLAKE2b-256 ea1a8bf67442315aee9fbeddbf56e2a97ea51ac589f62e53f22d8575e211f5b3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page