Clean, functional data processing for human-centric applications. Normalize and standardize names, emails, phones, departments, and job titles with a single unified API.
Project description
HumanMint
Clean, functional data processing for human-centric applications. Normalize and standardize names, emails, phones, addresses, departments, job titles, and organizations with a single unified API.
Installation
pip install humanmint
Quick Start
from humanmint import mint
result = mint(
name="Dr. John Smith",
email="JOHN.SMITH@CITY.GOV",
phone="(555) 123-4567",
address="123 Main St, Springfield, IL 62701",
department="Public Works 850-123-1234 ext 200",
title="Chief of Police"
)
print(result.name) # {'full': 'John Smith', 'first': 'John', 'last': 'Smith', 'gender': 'Male'}
print(result.email) # 'john.smith@city.gov'
print(result.phone) # '+1 555-123-4567'
print(result.address) # {'street': '123 Main St', 'city': 'Springfield', 'state': 'IL', 'zip': '62701'}
print(result.department) # 'Public Works'
print(result.department_category) # 'Infrastructure'
print(result.title) # {'canonical': 'police chief', 'is_valid': True, ...}
Why HumanMint?
Problem
Real-world contact data is messy:
- Names with titles: "Dr. Jane Smith, PhD"
- Phone numbers in various formats: "(555) 123-4567" vs "555.123.4567"
- Departments with codes and noise: "000171 - Public Works 555-123-1234 ext 200"
- Job titles that need standardization: "Chief of Police" -> canonical form
- Emails with inconsistent casing: "JOHN.SMITH@EXAMPLE.COM"
- Single names that need gender inference: "Jo", "Al", "Ty"
Solution
HumanMint cleans and standardizes everything in one call.
Features
- Unified API: One
mint()call to normalize names, emails, phones, addresses, departments, titles, and organizations. - Rich names: Normalize, enrich, infer gender, detect nicknames; handles single, hyphenated, and titled names.
- Emails: Lowercasing, validation, generic inbox detection, domain extraction, free provider detection.
- Phones: E164/pretty formats, extension extraction, validation, type detection (mobile/landline/fax), VoIP and impossible number detection.
- Addresses: US postal address parsing (street, city, state, ZIP).
- Departments: 23,452 mappings to 64 canonicals with fuzzy matching, categorization, and custom overrides.
- Titles: Canonicalization and fuzzy matching against curated heuristics with confidence scores.
- Organizations: Normalize agency/organization names by removing civic prefixes and suffixes.
- Pandas:
df.humanmint.clean(...)accessor with heuristic column guessing or explicitname_col/email_col/...mapping. - CLI:
humanmint clean input.csv output.csvwith auto-guessing or explicit column flags. - Batch processing: Parallel processing with
bulk()for handling large datasets. - Gzip-backed data: Reference data ships as
.json.gzcaches for fast loads; raw sources live undersrc/humanmint/data/original/. - Ethics: Gender inference is probabilistic from historical name data and not a determination of identity; downstream use should respect that.
API Reference
Core Functions
mint(name, email, phone, address, department, title, organization, ...)
One function. All your data cleaned.
from humanmint import mint
result = mint(
name="Jane Doe", # optional
email="jane@example.com", # optional
phone="(555) 555-5555", # optional
address="123 Main St, City, ST ZIP", # optional
department="Water Utilities", # optional
title="Chief of Water", # optional
organization="City of Springfield" # optional
)
# Access cleaned data
result.name # dict: {full, first, last, middle, suffix, gender}
result.email # str
result.phone # str (E.164 format)
result.address # dict: {street, city, state, zip}
result.department # str (canonical)
result.department_category # str
result.title # dict: {raw, cleaned, canonical, is_valid, confidence}
result.organization # dict: {raw, normalized, canonical, confidence}
# Convert to dict for JSON
result.model_dump()
bulk(records, workers=4, progress=False)
Process multiple records in parallel.
from humanmint import bulk
records = [
{"name": "Alice", "email": "alice@example.com"},
{"name": "Bob", "email": "bob@example.com"},
]
results = bulk(records, workers=4, progress=True)
# Returns: list[MintResult]
compare(result_a, result_b)
Compare two normalized records and return similarity score (0-100).
from humanmint import mint, compare
r1 = mint(name="John Smith", email="john@example.com")
r2 = mint(name="Jon Smith", email="john.smith@example.com")
similarity = compare(r1, r2) # Returns float 0-100
Names Module
Import: from humanmint.names import ...
| Function | Purpose | Returns |
|---|---|---|
normalize_name(raw) |
Parse and normalize a full name | Dict: {first, last, middle, suffix, full, canonical, is_valid} |
infer_gender(first_name, confidence=False) |
Infer gender from first name | Dict: {gender, confidence} or str |
enrich_name(normalized_dict, include_gender=True) |
Add gender/enrichment data to normalized name | Dict with enriched fields |
detect_nickname(first_name) |
Detect if a name is a nickname, return canonical form | Optional[str] |
get_nickname_variants(canonical_name) |
Get all known nicknames for a name | set[str] |
get_name_equivalents(name) |
Get all equivalent forms (nicknames + canonicals) | set[str] |
compare_first_names(name1, name2, use_nicknames=True) |
Compare two first names with fuzzy matching | float (0-1) |
compare_last_names(last1, last2) |
Compare two last names | float (0-1) |
match_names(raw1, raw2, strict=False) |
Full name matching with detailed scoring | Dict: {score, is_match, reasons, ...} |
Examples:
from humanmint.names import normalize_name, infer_gender, detect_nickname
# Normalize
result = normalize_name("Dr. Jane Smith, PhD")
# {'first': 'Jane', 'last': 'Smith', 'full': 'Jane Smith', 'gender': None, ...}
# Infer gender
gender = infer_gender("Jane") # {'gender': 'Female', 'confidence': 0.98}
# Detect nickname
canonical = detect_nickname("Bobby") # 'Robert'
variants = get_name_equivalents("Robert") # {'Robert', 'Bob', 'Bobby', 'Rob', ...}
Emails Module
Import: from humanmint.emails import ...
| Function | Purpose | Returns |
|---|---|---|
normalize_email(raw, generic_inboxes=None) |
Normalize email and extract metadata | Dict: {email, local, domain, is_generic, is_free_provider, is_valid} |
is_free_provider(domain) |
Check if domain is a free email provider | bool |
guess_email(name, domain, known=[]) |
Guess likely email pattern from known examples | str (email or empty) |
get_pattern_scores(known) |
Analyze known emails and return detected patterns | list[(pattern_id, confidence)] |
describe_pattern(pattern_id) |
Get documentation for an email pattern | Optional[Dict] |
Examples:
from humanmint.emails import normalize_email, guess_email, is_free_provider
# Normalize
result = normalize_email("JOHN.SMITH@GMAIL.COM")
# {'email': 'john.smith@gmail.com', 'domain': 'gmail.com', 'is_free_provider': True, ...}
# Check if free
is_free = is_free_provider("gmail.com") # True
# Guess email pattern
known = [("John Smith", "jsmith@company.com"), ("Jane Doe", "jdoe@company.com")]
guess = guess_email("Bob Jones", "company.com", known) # 'bjones@company.com'
Phones Module
Import: from humanmint.phones import ...
| Function | Purpose | Returns |
|---|---|---|
normalize_phone(raw, country="US") |
Normalize phone to E.164 format | Dict: {e164, pretty, extension, country, type, is_valid} |
detect_impossible(phone_dict) |
Detect if phone appears impossible/test/fake | bool |
detect_fax_pattern(phone_dict) |
Detect if phone matches known fax patterns | bool |
detect_voip_pattern(phone_dict) |
Detect if phone matches VoIP provider patterns | bool |
Examples:
from humanmint.phones import normalize_phone, detect_fax_pattern
# Normalize
result = normalize_phone("(555) 123-4567 ext 201")
# {'e164': '+15551234567', 'pretty': '+1 555-123-4567', 'extension': '201', 'type': 'mobile', ...}
# Detect fax
fax = detect_fax_pattern(result) # False
Departments Module
Import: from humanmint.departments import ...
| Function | Purpose | Returns |
|---|---|---|
normalize_department(raw_dept) |
Normalize dept name (remove noise, standardize format) | str |
find_best_match(dept_name, threshold=0.6, normalize=True) |
Find best canonical match | Optional[str] |
find_all_matches(dept_name, threshold=0.6, top_n=3, normalize=True) |
Find all matches ranked by similarity | list[str] |
match_departments(dept_names, threshold=0.6, normalize=True) |
Match multiple depts at once | dict[str, Optional[str]] |
get_similarity_score(dept1, dept2) |
Calculate similarity between two depts | float (0-1) |
get_department_category(dept) |
Get category for canonical dept | Optional[str] |
get_all_categories() |
Get all available categories | set[str] |
get_departments_by_category(category) |
Get depts in a specific category | list[str] |
categorize_departments(depts) |
Categorize multiple depts | dict[str, Optional[str]] |
get_canonical_departments() |
Get all canonical dept names | list[str] |
is_canonical(dept) |
Check if dept is canonical | bool |
load_mappings() |
Load all dept mappings | dict[str, list[str]] |
get_mapping_for_original(original) |
Get canonical for original name | Optional[str] |
get_originals_for_canonical(canonical) |
Get all original names for canonical | list[str] |
Examples:
from humanmint.departments import normalize_department, find_best_match, get_department_category
# Normalize
clean = normalize_department("000171 - Police Department") # 'Police'
# Find match
match = find_best_match("Police Dept") # 'Police'
category = get_department_category("Police") # 'Public Safety'
# Find all matches
all_matches = find_all_matches("Finance", threshold=0.7) # ['Finance', 'Budget']
Titles Module
Import: from humanmint.titles import ...
| Function | Purpose | Returns |
|---|---|---|
normalize_title_full(raw_title, threshold=0.6, dept_canonical=None, overrides=None) |
Full normalization with confidence | TitleResult Dict: {raw, cleaned, canonical, is_valid, confidence} |
normalize_title(raw_title) |
Core title cleaning | str |
find_best_match(title, threshold=0.6, normalize=True) |
Find best canonical match | tuple[Optional[str], float] |
find_all_matches(title, threshold=0.6, top_n=3) |
Find all matches | list[str] |
get_similarity_score(title1, title2) |
Calculate similarity | float (0-1) |
get_canonical_titles() |
Get all canonical titles | list[str] |
is_canonical(title) |
Check if title is canonical | bool |
get_mapping_for_variant(variant) |
Get canonical for variant | Optional[str] |
get_all_mappings() |
Get all title mappings | dict[str, str] |
Examples:
from humanmint.titles import normalize_title_full, find_best_match, get_canonical_titles
# Full normalization
result = normalize_title_full("0001 - Chief of Police (Downtown)")
# {'raw': '0001 - Chief of Police (Downtown)', 'cleaned': 'Chief of Police', 'canonical': 'police chief', ...}
# Find match with confidence
title, confidence = find_best_match("Police Chief", threshold=0.7)
# ('police chief', 0.95)
Addresses Module
Import: from humanmint.addresses import ...
| Function | Purpose | Returns |
|---|---|---|
normalize_address(raw) |
Parse US postal address | Optional[Dict]: {street, city, state, zip, canonical} |
Examples:
from humanmint.addresses import normalize_address
result = normalize_address("123 Main St, Springfield, IL 62701")
# {'street': '123 Main St', 'city': 'Springfield', 'state': 'IL', 'zip': '62701', ...}
Organizations Module
Import: from humanmint.organizations import ...
| Function | Purpose | Returns |
|---|---|---|
normalize_organization(raw) |
Normalize agency/org name (remove civic suffixes) | Optional[Dict]: {raw, normalized, canonical, confidence} |
Examples:
from humanmint.organizations import normalize_organization
result = normalize_organization("City of Springfield")
# {'raw': 'City of Springfield', 'normalized': 'Springfield', 'canonical': 'Springfield', ...}
Customization
You can steer canonicals without forking data files:
- Department overrides: Map normalized departments to your preferred canonical. Example:
mint(department="RevOps", dept_overrides={"revenue operations": "Sales"})or pass the same dict intobulk()/pandas/CLI. - Title overrides / ignores: Map cleaned titles to a canonical string with
title_overrides. To ignore a title, set it toNoneand drop records whereresult.titleisNoneorresult.title["is_valid"]isFalse.
Examples
Government Contact
result = mint(
name="Chief Robert Patterson",
email="robert.patterson@police.gov",
phone="(555) 123-4567",
department="000171 - Police",
title="Chief of Police"
)
# Cleaned:
# name: {'full': 'Robert Patterson', 'first': 'Robert', 'last': 'Patterson', 'gender': 'Male'}
# email: 'robert.patterson@police.gov'
# phone: '+1 555-123-4567'
# department: 'Police' (category: 'Public Safety')
# title: {'canonical': 'police chief', 'is_valid': True}
Messy Data
result = mint(
name="Dr. Jane Smith, PhD",
email="JANE@EXAMPLE.COM",
phone="555 123 4567",
department="Planning and Development 555-123-4567 ext 200"
)
# Cleaned:
# name: {'full': 'Jane Smith', 'first': 'Jane', 'last': 'Smith', 'gender': 'Female'}
# email: 'jane@example.com'
# phone: '+1 555-123-4567'
# department: 'Planning' (category: 'Planning & Development')
Single Names
result = mint(name="Madonna")
# name: {'full': 'Madonna', 'first': 'Madonna', 'last': '', 'gender': 'Female'}
result = mint(name="Jo")
# name: {'full': 'Jo', 'first': 'Jo', 'last': '', 'gender': 'Female'}
Advanced Usage
Use individual modules for specific needs. See API Reference above for the complete list of functions across all modules:
# Names: normalize, enrich, infer gender, detect nicknames
from humanmint.names import normalize_name, infer_gender, detect_nickname
# Emails: validate, detect free providers, guess patterns
from humanmint.emails import normalize_email, is_free_provider, guess_email
# Phones: normalize, detect types, identify fax/VoIP/test numbers
from humanmint.phones import normalize_phone, detect_fax_pattern, detect_voip_pattern
# Departments: normalize, match, categorize
from humanmint.departments import normalize_department, find_best_match, get_department_category
# Titles: normalize, match, get mappings
from humanmint.titles import normalize_title_full, find_best_match as match_title
# Addresses: parse and normalize
from humanmint.addresses import normalize_address
# Organizations: normalize and extract canonical names
from humanmint.organizations import normalize_organization
# Batch processing and comparison
from humanmint import bulk, compare
Testing
pytest -q unittests
CLI
Clean a CSV (auto-detecting columns, or override with flags):
humanmint clean input.csv output.csv --name-col name --email-col email
Regenerating data caches
If you edit the raw datasets in src/humanmint/data/original/, rebuild the packaged .json.gz caches with:
python scripts/build_caches.py
Performance
- Batch processing: ~0.1-0.2ms per contact
- 10 contacts: ~1-2ms
- 100 contacts: ~10-20ms
What Gets Cleaned
| Input | Output |
|---|---|
name="Dr. Jane Smith, PhD" |
first='Jane', last='Smith' |
email="JOHN@EXAMPLE.COM" |
john@example.com |
phone="(555) 123-4567 x101" |
+1 555-123-4567, extension: 101 |
address="123 Main St, Springfield, IL 62701" |
street='123 Main St', city='Springfield', state='IL', zip='62701' |
department="000171 - Public Works 555-123-1234 ext 200" |
Public Works (category: Infrastructure) |
title="0001 - Chief of Police (Downtown)" |
police chief |
organization="City of Springfield Police" |
normalized='Springfield', canonical='Springfield' |
name="Jo" |
first='Jo', gender: Female |
Version
0.1.0
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file humanmint-0.1.1.tar.gz.
File metadata
- Download URL: humanmint-0.1.1.tar.gz
- Upload date:
- Size: 1.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3c913765ad77ade08f8b85e9829ebe254adaf57450c20304760458b4e99837f8
|
|
| MD5 |
0448caf7dea1f6acd2e9679eb1f2e7d7
|
|
| BLAKE2b-256 |
a9afa500a473a406f2c3a6b9552fab72cc14ee863d51fc95d9b3e19d60ab51b2
|
File details
Details for the file humanmint-0.1.1-py3-none-any.whl.
File metadata
- Download URL: humanmint-0.1.1-py3-none-any.whl
- Upload date:
- Size: 1.1 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
75fbebd6b99a133b7766c8ffdc2816945e39e9250a894b13c6148bafed77cc1d
|
|
| MD5 |
b0b79366acb1302b6c7840863ccda118
|
|
| BLAKE2b-256 |
802a753326952a98e3b137c7fccf642cb97c6e75cc1096a9f1e2f6b682a4ef61
|