A utility library for data cleaning and parsing.
Project description
smith-utils
Smith Utils is a central hub for data cleaning and parsing scripts. This package consolidates distributed utility functions to improve code reuse and maintenance efficiency across all yeiichi projects.
Key Features
Datetime Utilities (smith_utils.datetime)
Robust date parsing and formatting.
ensure_date: Flexible conversion of strings,datetime.dateobjects, orNone(returns today) into adateobject.parse_strict_date: Strict parsing forYYYYMMDDorYYYY-MM-DDformats, rejecting ambiguous inputs.format_ordinal: Converts integers to ordinal strings (e.g.,1→"1st",22→"22nd").
Numeric Refinement (smith_utils.numeric)
Clean and parse messy numeric data.
parse_numeric_value: Handles custom separators, decimals, and negative formats like(1,234.56).parse_currency_value: Alias for numeric parsing, specifically for currency strings.
Text Normalization & Metrics (smith_utils.text)
Standardize text and compare string similarity.
normalize_text: Unicode NFKC normalization, case folding, and whitespace handling.StringDistance: Implementation of Damerau-Levenshtein and Jaro-Winkler algorithms for fuzzy matching.analyze_pair: Convenience function for string comparison returning aResult.Relation/Result: Relation enum and typed result for text comparisons.make_unicode_char_name_records: Extract Unicode codepoint/name metadata from text.normalize_newlines_stream: Stream-based newline normalization to LF with newline type detection.normalize_file_to_lf: File-based newline normalization helper.
Crypto Hash Utilities (smith_utils.crypto)
Calculate SHA-256 digests for text and files.
get_text_digest: Returns the SHA-256 hexadecimal digest for a text string.get_file_digest: Returns the SHA-256 hexadecimal digest for a file using streaming reads.
File Classification Utilities (smith_utils.file)
Classify files from multiple evidence sources.
classify_file: Returns extension, MIME, magic-number, andfile(1)classification evidence.FileClassification: Dataclass result containing raw signals,file_class, and derived categories.
Installation
Install via pip:
pip install smith-utils
Quick Start
from smith_utils import ensure_date, get_text_digest, parse_numeric_value, normalize_text
from smith_utils import make_unicode_char_name_records
from smith_utils import classify_file, get_file_digest, normalize_file_to_lf
# Datetime
date = ensure_date("20231225") # datetime.date(2023, 12, 25)
# Numeric
value = parse_numeric_value("(1,250.50)") # -1250.5
# Text
clean_text = normalize_text(" Smith Utils ") # "smith utils"
# Unicode metadata
records = make_unicode_char_name_records("Aあ")
# [UnicodeCharNameRecord(index=0, codepoint='U+0041', ...), ...]
# Normalize a file's newlines to LF
summary = normalize_file_to_lf("input.txt", "output.txt")
# {'newline_type': 'CRLF', 'bytes_in': ..., 'bytes_out': ...}
# SHA-256 digests
text_digest = get_text_digest("smith-utils")
file_digest = get_file_digest("input.txt")
# File classification
classification = classify_file("input.pdf")
# FileClassification(extension='.pdf', file_class='document', categories=('document', 'pdf'), ...)
Directory Structure
src/smith_utils/: Main package source.legacy/: Legacy scripts and templates (not included in distribution).tests/: Comprehensive test suite.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file smith_utils-0.4.0.tar.gz.
File metadata
- Download URL: smith_utils-0.4.0.tar.gz
- Upload date:
- Size: 15.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c944e8d066cebcbac3e6bb67f2355cfe1b56b345e576f2d423868cf59799bb86
|
|
| MD5 |
5cb2eece9254c69a5a349c0488e5c00a
|
|
| BLAKE2b-256 |
27434049bc884c9aac3b6ecc4b6c589b15a4f2755c619cf9be1120aa60a8f1f2
|
File details
Details for the file smith_utils-0.4.0-py3-none-any.whl.
File metadata
- Download URL: smith_utils-0.4.0-py3-none-any.whl
- Upload date:
- Size: 16.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
99f9294c8ccace089899e8fefd71f5efb6945e999bf2a8cc395edb3d9dbf891a
|
|
| MD5 |
990c79501215eae8fc70813d6d35fff0
|
|
| BLAKE2b-256 |
ea1a8bf67442315aee9fbeddbf56e2a97ea51ac589f62e53f22d8575e211f5b3
|